1. namespace无法被删除

1. 检查是否有 Finalizers 阻塞
root@ubuntu:~# kubectl get ns nebula -o json | jq '.spec.finalizers'
[
  "kubernetes"
]

2. 你的 nebula 命名空间卡在 Terminating 状态,是因为 Kubernetes 的 Finalizer("kubernetes")阻止了它的删除。Finalizer 是 Kubernetes 的一种机制,用于确保资源被正确清理,但有时会因为某些原因卡住。你可以直接编辑命名空间的 Finalizer 字段,移除 "kubernetes",使其可以正常删除:
root@ubuntu:~# kubectl get ns nebula -o json | \
  jq 'del(.spec.finalizers)' | \
  kubectl replace --raw "/api/v1/namespaces/nebula/finalize" -f -

2. evicted 容器被驱逐

粗暴的方式是先全部删掉这些异常的pod

kubectl get pods -A | grep Evicted
kubectl delete pods --all-namespaces --field-selector=status.phase=Failed

每个出错的Evicted容器都会给出具体的Events事件的,例如下图这个 The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

主要是提示该容器使用了 524036KiB(约 512Mi) 的临时存储。但容器没有设置 ephemeral-storage requests,也就是说,K8s 默认它“不需要”临时存储。节点发现临时存储空间不足时,会优先驱逐没有声明 request 的容器。

resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"

image-XkAz.png

根本原因就是因为工作节点的硬盘空间快要满了,需要到该节点上清理一下空间释放些出来。

image-FiXo.png

检测脚本

可以做成定时脚本来执行,再完善一下可以将结果进行推送。

kubectl get pods -A --field-selector=status.phase=Failed -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.nodeName}{"\t"}{.status.reason}{"\t"}{.status.message}{"\n"}{end}' | \
grep Evicted | \
while IFS=$'\t' read -r namespace pod node reason message; do
  echo "容器: $namespace/$pod"
  echo "节点: $node"
  echo "详情: $message"
  echo ""
done


容器: fssc/ems-base-application-5dbf9bd7-74tdp
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container ems-base-application was using 780640Ki, which exceeds its request of 0.

容器: kubesphere-monitoring-system/notification-manager-deployment-798fdfc9b-fbbqr
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container notification-manager was using 16Ki, which exceeds its request of 0. Container tenant was using 754680Ki, which exceeds its request of 0.

容器: zizhu/zzoms-service-d7754cbd7-9nwg9
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

3. Headless Service 无法被ping通

今天踩了个坑,差点以为系统出问题了 Headless Service 居然怎么都 ping 不通,明明服务 Pod 都跑得好好的,结果像 kafka-0.kafka.default.svc.cluster.local 这样的域名死活解析不了,提示 “bad address”。

仔细对比观察这才发现,其实是 StatefulSet + Headless Service 的组合在搞事,StatefulSet 里会指定一个 serviceName,这个名字不能乱改、也不能单独新建个名字不一样的 serviceName,只要这个名字对不上 DNS 就解析不出来,服务之间自然也就无法通信了。

4. pv,pvc,claim 等概念

在 Kubernetes 中,PV 就像房东提供的大房子(存储资源),PVC 就是租客提交的租房申请(申请使用存储),Claim 表示这一申请行为。PVC 会根据需求匹配合适的 PV,一旦绑定成功,租客(Pod)就可以独占使用这块存储空间。

5. 获取所有容器Pending信息

kubectl get pods --all-namespaces --field-selector=status.phase=Pending | awk 'NR>1 {print $1, $2}' | xargs -n2 -I{} sh -c 'echo Namespace: $(echo {} | cut -d" " -f1), Pod: $(echo {} | cut -d" " -f2); kubectl describe pod -n $(echo {} | cut -d" " -f1) $(echo {} | cut -d" " -f2) | grep -A10 -E "Events|Warning|Failed|Error|FailedScheduling"; echo "--------------------------"'


Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-0
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0"
--------------------------
Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-1
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1"
--------------------------