容器技术 k8s pod 优雅退出

携手创作，共同成长！这是我参与「日新计划 8 月更文挑战」的第28天。

Kubernetes 提供了一种 Pod 优雅退出机制，使 Pod 在退出前可以完成一些清理工作。但若执行清理工作时出错了，Pod 能正常退出吗？多久能退出？退出时间可以指定吗？系统有默认参数吗？这其中有若干细节值得我们去注意，本文就从这些细节出发，梳理清楚每种情况下 Kubernetes 的组件的各项行为及其参数设定。

Pod 正常退出

Pod 正常退出是指非被驱逐时退出，包括人为删除、执行出错被删除等。在 Pod 退出时，kubelet 删除容器之前，会先执行 pod 的 preStop，允许 pod 在退出前执行一段脚本用以清除必要的资源等。然而 preStop 也有执行失败或者直接 hang 住的情况，这个时候 preStop 并不会阻止 pod 的退出，kubelet 也不会重复执行，而是会等一段时间，超过这个时间会直接删除容器，保证整个系统的稳定。

整个过程在函数 killContainer 中，我们在 pod 优雅退出时，需要明确的是，kubelet 的等待时间由那几个因素决定，用户可以设置的字段和系统组件的参数是如何共同作用的。

gracePeriod

kubelet 计算 gracePeriod 的过程为：

如果 pod 的 DeletionGracePeriodSeconds 不为 nil，表示是 ApiServer 删除的，gracePeriod 直接取值；
如果 pod 的 Spec.TerminationGracePeriodSeconds 不为 nil，再看 pod 删除的原因是什么；
若删除原因为执行 startupProbe 失败，gracePeriod 取值为 startupProbe 中设置的 TerminationGracePeriodSeconds
若删除原因为执行 livenessProbe 失败，gracePeriod 取值为 livenessProbe 中设置的 TerminationGracePeriodSeconds

获得到 gracePeriod 之后，kubelet 执行 pod 的 preStop，函数 executePreStopHook 中会起一个 goroutine ，并计算其执行的时间，gracePeriod 再减去该时间，就是最终传给 runtime 的删除容器的 timeout 时间。所以，若我们设置了 pod preStop，需要同时考虑到 preStop 的执行时间以及容器退出的时间，可以给 TerminationGracePeriodSeconds 设置一个大于 preStop + 容器退出的时间。

func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, reason containerKillReason, gracePeriodOverride *int64) error {
   ...
 // From this point, pod and container must be non-nil.
 gracePeriod := int64(minimumGracePeriodInSeconds)
 switch {
 case pod.DeletionGracePeriodSeconds != nil:
  gracePeriod = *pod.DeletionGracePeriodSeconds
 case pod.Spec.TerminationGracePeriodSeconds != nil:
  gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
  switch reason {
  case reasonStartupProbe:
   if containerSpec.StartupProbe != nil && containerSpec.StartupProbe.TerminationGracePeriodSeconds != nil {
    gracePeriod = *containerSpec.StartupProbe.TerminationGracePeriodSeconds
   }
  case reasonLivenessProbe:
   if containerSpec.LivenessProbe != nil && containerSpec.LivenessProbe.TerminationGracePeriodSeconds != nil {
    gracePeriod = *containerSpec.LivenessProbe.TerminationGracePeriodSeconds
   }
  }
 }
 // Run internal pre-stop lifecycle hook
 if err := m.internalLifecycle.PreStopContainer(containerID.ID); err != nil {
  return err
 }
 // Run the pre-stop lifecycle hooks if applicable and if there is enough time to run it
 if containerSpec.Lifecycle != nil && containerSpec.Lifecycle.PreStop != nil && gracePeriod > 0 {
  gracePeriod = gracePeriod - m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)
 }
 // always give containers a minimal shutdown window to avoid unnecessary SIGKILLs
 if gracePeriod < minimumGracePeriodInSeconds {
  gracePeriod = minimumGracePeriodInSeconds
 }
 if gracePeriodOverride != nil {
  gracePeriod = *gracePeriodOverride
 }
 err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
   ...
 return nil
}

gracePeriodOverride

在上面分析的过程中，kubelet 调用 runtime 接口之前，会再判断一步 gracePeriodOverride，若传进来的值不为空，直接用该值覆盖前面的 gracePeriod。

kubelet 计算 gracePeriodOverride 的主要过程如下：

取值 pod 的 DeletionGracePeriodSeconds ；
若 kubelet 是在驱逐 pod，则用驱逐的设置 pod 退出时间覆盖；

func calculateEffectiveGracePeriod(status *podSyncStatus, pod *v1.Pod, options *KillPodOptions) (int64, bool) {
 gracePeriod := status.gracePeriod
 // this value is bedrock truth - the apiserver owns telling us this value calculated by apiserver
 if override := pod.DeletionGracePeriodSeconds; override != nil {
  if gracePeriod == 0 || *override < gracePeriod {
   gracePeriod = *override
  }
 }
 // we allow other parts of the kubelet (namely eviction) to request this pod be terminated faster
 if options != nil {
  if override := options.PodTerminationGracePeriodSecondsOverride; override != nil {
   if gracePeriod == 0 || *override < gracePeriod {
    gracePeriod = *override
   }
  }
 }
 // make a best effort to default this value to the pod's desired intent, in the event
 // the kubelet provided no requested value (graceful termination?)
 if gracePeriod == 0 && pod.Spec.TerminationGracePeriodSeconds != nil {
  gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
 }
 // no matter what, we always supply a grace period of 1
 if gracePeriod < 1 {
  gracePeriod = 1
 }
 return gracePeriod, status.gracePeriod != 0 && status.gracePeriod != gracePeriod
}

ApiServer 的行为

在上面分析 kubelet 处理 pod 的退出时间时，我们会发现 kubelet 会首先用 pod 的 DeletionGracePeriodSeconds，而该值正是 ApiServer 删除 pod 时写入的。本节我们来分析 ApiServer 删除 pod 时的行为。

ApiServer 中计算 pod 的 GracePeriodSeconds 过程为：

若 options.GracePeriodSeconds 不为空，则设置为该值；否则设置为 spec 中用户指定的 Spec.TerminationGracePeriodSeconds（默认为 30s）；
若 pod 未被调度或已经退出，则设置为 0，即立即删除；

其中，options.GracePeriodSeconds 为 kubectl 删除 pod 时，可以指定的参数 --grace-period；或者程序里调用 ApiServer 接口时指定的参数，如 client-go 中的 DeleteOptions.GracePeriodSeconds。

func (podStrategy) CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool {
 if options == nil {
  return false
 }
 pod := obj.(*api.Pod)
 period := int64(0)
 // user has specified a value
 if options.GracePeriodSeconds != nil {
  period = *options.GracePeriodSeconds
 } else {
  // use the default value if set, or deletes the pod immediately (0)
  if pod.Spec.TerminationGracePeriodSeconds != nil {
   period = *pod.Spec.TerminationGracePeriodSeconds
  }
 }
 // if the pod is not scheduled, delete immediately
 if len(pod.Spec.NodeName) == 0 {
  period = 0
 }
 // if the pod is already terminated, delete immediately
 if pod.Status.Phase == api.PodFailed || pod.Status.Phase == api.PodSucceeded {
  period = 0
 }
 if period < 0 {
  period = 1
 }
 // ensure the options and the pod are in sync
 options.GracePeriodSeconds = &period
 return true
}

kubelet 驱逐 pod

另外，在 kubelet 驱逐 pod 时，pod 的优雅退出时间是被覆盖的。

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
   ...
 // we kill at most a single pod during each eviction interval
 for i := range activePods {
  pod := activePods[i]
  gracePeriodOverride := int64(0)
  if !isHardEvictionThreshold(thresholdToReclaim) {
   gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
  }
  message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
  if m.evictPod(pod, gracePeriodOverride, message, annotations) {
   metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
   return []*v1.Pod{pod}
  }
 }
 ...
}

其 override 值为 EvictionMaxPodGracePeriod，且只有软驱逐时有效，该值为 kubelet 的驱逐相关的配置参数：

// Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}.
 EvictionHard map[string]string
 // Map of signal names to quantities that defines soft eviction thresholds.  For example: {"memory.available": "300Mi"}.
 EvictionSoft map[string]string
 // Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {"memory.available": "30s"}.
 EvictionSoftGracePeriod map[string]string
 // Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
 EvictionPressureTransitionPeriod metav1.Duration
 // Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
 EvictionMaxPodGracePeriod int32

kubelet 驱逐 pod 的函数是启动时注入的，函数如下：

func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
 return func(pod *v1.Pod, isEvicted bool, gracePeriodOverride *int64, statusFn func(*v1.PodStatus)) error {
  // determine the grace period to use when killing the pod
  gracePeriod := int64(0)
  if gracePeriodOverride != nil {
   gracePeriod = *gracePeriodOverride
  } else if pod.Spec.TerminationGracePeriodSeconds != nil {
   gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
  }
  // we timeout and return an error if we don't get a callback within a reasonable time.
  // the default timeout is relative to the grace period (we settle on 10s to wait for kubelet->runtime traffic to complete in sigkill)
  timeout := int64(gracePeriod + (gracePeriod / 2))
  minTimeout := int64(10)
  if timeout < minTimeout {
   timeout = minTimeout
  }
  timeoutDuration := time.Duration(timeout) * time.Second
  // open a channel we block against until we get a result
  ch := make(chan struct{}, 1)
  podWorkers.UpdatePod(UpdatePodOptions{
   Pod:        pod,
   UpdateType: kubetypes.SyncPodKill,
   KillPodOptions: &KillPodOptions{
    CompletedCh:                              ch,
    Evict:                                    isEvicted,
    PodStatusFunc:                            statusFn,
    PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
   },
  })
  // wait for either a response, or a timeout
  select {
  case <-ch:
   return nil
  case <-time.After(timeoutDuration):
   recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
   return fmt.Errorf("timeout waiting to kill pod")
  }
 }
}

killPodNow 函数是 kubelet 在驱逐 pod 时所调用的函数，gracePeriodOverride 为软驱逐时设置的参数，当其没有设置时，gracePeriod 依然取值 pod.Spec.TerminationGracePeriodSeconds。然后该函数会调用 podWorkers.UpdatePod，传入相应参数，并且设置一个跟 gracePeriod 相关的超时时间，等待其返回。

总结

Pod 的优雅退出是由 preStop 实现的，本文就 Pod 正常退出和被驱逐时，Pod 的退出时间受哪些因素影响，各参数之间是如何相互作用的做了简要的分析。了解了这些细节后，我们对 Pod 的退出流程就有了一个更加全面的认知。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

容器技术 k8s pod 优雅退出

Pod 正常退出

gracePeriod

gracePeriodOverride

ApiServer 的行为

kubelet 驱逐 pod

总结

评论(0)

提示：请文明发言取消回复

近期文章

近期评论

容器技术 k8s pod 优雅退出

Pod 正常退出

gracePeriod

gracePeriodOverride

ApiServer 的行为

kubelet 驱逐 pod

总结

评论(0)

提示：请文明发言 取消回复

近期文章

近期评论

提示：请文明发言取消回复