1. namespace无法被删除

1. 检查是否有 Finalizers 阻塞
root@ubuntu:~# kubectl get ns nebula -o json | jq '.spec.finalizers'
[
  "kubernetes"
]

2. 你的 nebula 命名空间卡在 Terminating 状态,是因为 Kubernetes 的 Finalizer("kubernetes")阻止了它的删除。Finalizer 是 Kubernetes 的一种机制,用于确保资源被正确清理,但有时会因为某些原因卡住。你可以直接编辑命名空间的 Finalizer 字段,移除 "kubernetes",使其可以正常删除:
root@ubuntu:~# kubectl get ns nebula -o json | \
  jq 'del(.spec.finalizers)' | \
  kubectl replace --raw "/api/v1/namespaces/nebula/finalize" -f -

2. evicted 容器被驱逐

粗暴的方式是先全部删掉这些异常的pod

kubectl get pods -A | grep Evicted
kubectl delete pods --all-namespaces --field-selector=status.phase=Failed

每个出错的Evicted容器都会给出具体的Events事件的,例如下图这个 The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

主要是提示该容器使用了 524036KiB(约 512Mi) 的临时存储。但容器没有设置 ephemeral-storage requests,也就是说,K8s 默认它“不需要”临时存储。节点发现临时存储空间不足时,会优先驱逐没有声明 request 的容器。

resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"

image-XkAz.png

根本原因就是因为工作节点的硬盘空间快要满了,需要到该节点上清理一下空间释放些出来。

image-FiXo.png

检测脚本

可以做成定时脚本来执行,再完善一下可以将结果进行推送。

kubectl get pods -A --field-selector=status.phase=Failed -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.nodeName}{"\t"}{.status.reason}{"\t"}{.status.message}{"\n"}{end}' | \
grep Evicted | \
while IFS=$'\t' read -r namespace pod node reason message; do
  echo "容器: $namespace/$pod"
  echo "节点: $node"
  echo "详情: $message"
  echo ""
done


容器: fssc/ems-base-application-5dbf9bd7-74tdp
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container ems-base-application was using 780640Ki, which exceeds its request of 0.

容器: kubesphere-monitoring-system/notification-manager-deployment-798fdfc9b-fbbqr
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container notification-manager was using 16Ki, which exceeds its request of 0. Container tenant was using 754680Ki, which exceeds its request of 0.

容器: zizhu/zzoms-service-d7754cbd7-9nwg9
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

3. Headless Service 无法被ping通

今天踩了个坑,差点以为系统出问题了 Headless Service 居然怎么都 ping 不通,明明服务 Pod 都跑得好好的,结果像 kafka-0.kafka.default.svc.cluster.local 这样的域名死活解析不了,提示 “bad address”。

仔细对比观察这才发现,其实是 StatefulSet + Headless Service 的组合在搞事,StatefulSet 里会指定一个 serviceName,这个名字不能乱改、也不能单独新建个名字不一样的 serviceName,只要这个名字对不上 DNS 就解析不出来,服务之间自然也就无法通信了。

4. pv,pvc,claim 等概念

在 Kubernetes 中,PV 就像房东提供的大房子(存储资源),PVC 就是租客提交的租房申请(申请使用存储),Claim 表示这一申请行为。PVC 会根据需求匹配合适的 PV,一旦绑定成功,租客(Pod)就可以独占使用这块存储空间。

5. 获取所有容器Pending信息

kubectl get pods --all-namespaces --field-selector=status.phase=Pending | awk 'NR>1 {print $1, $2}' | xargs -n2 -I{} sh -c 'echo Namespace: $(echo {} | cut -d" " -f1), Pod: $(echo {} | cut -d" " -f2); kubectl describe pod -n $(echo {} | cut -d" " -f1) $(echo {} | cut -d" " -f2) | grep -A10 -E "Events|Warning|Failed|Error|FailedScheduling"; echo "--------------------------"'


Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-0
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0"
--------------------------
Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-1
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1"
--------------------------

6. endpoints

引Service 不指定 selector 时,系统将按 Service 名称关联同命名空间下的同名 Endpoints,从而可手动绑定外部地址。

[root@hybxvuka01 harbor-svc]# cat harbor-endpoints.yaml 
apiVersion: v1
kind: Service
metadata:
  name: harbor
  namespace: bx
spec:
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: v1
kind: Endpoints
metadata:
  name: harbor
  namespace: bx
subsets:
  - addresses:
      - ip: 172.31.0.99
    ports:
      - port: 80

7. 证书续签

# 检查证书
[root@hybxvpka01 ~]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Nov 19, 2026 09:51 UTC   357d            ca                      no      
apiserver                  Nov 19, 2026 09:51 UTC   357d            ca                      no      
apiserver-etcd-client      Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
apiserver-kubelet-client   Nov 19, 2026 09:51 UTC   357d            ca                      no      
controller-manager.conf    Nov 19, 2026 09:51 UTC   357d            ca                      no      
etcd-healthcheck-client    Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
etcd-peer                  Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
etcd-server                Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
front-proxy-client         Nov 19, 2026 09:51 UTC   357d            front-proxy-ca          no      
scheduler.conf             Nov 19, 2026 09:51 UTC   357d            ca                      no      
super-admin.conf           Nov 19, 2026 09:51 UTC   357d            ca                      no      

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Nov 17, 2035 09:51 UTC   9y              no      
etcd-ca                 Nov 17, 2035 09:51 UTC   9y              no      
front-proxy-ca          Nov 17, 2035 09:51 UTC   9y              no      



# 开始续签,如果你有多个 master 每个节点都需要执行
[root@hybxvpka01 ~]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed
certificate embedded in the kubeconfig file for the super-admin renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

# 所有master都需要重启kubelet
[root@hybxvpka01 ~]# systemctl restart kubelet

8. 获取ip pool

# calico
kubectl get ippools.crd.projectcalico.org

# 
kubectl describe ippools.crd.projectcalico.org default-ipv4-ippool