Pods are the smallest unit of compute in k8s. It runs user-specified workloads by managing containers. Pods aren’t meant for direct use as they provide very limited resilience — they are considered ephemeral. That’s why their management is usually delegated to Workload Resources.

Designed to host tightly coupled workloads, Pods provide shared Networking and Storage, allowing an application to run alongside its supporting services.

Note

Each Pod is meant to run a single instance of a given application.

Workload Resources

As Pods don’t provide any HA mechanism such as load balancing, redundancy, scaling, etc., they are used through higher level primitives called workload resources. Examples include Deployment, StatefulSet, DaemonSet, Job. Each one of these primitives is managed by its respective controller (see k8s controllers). To allow, however, a controller to manage a Pod, a Pod Template has to be defined under the workload resource’s spec.template field.

Pod Templates

The pod template is a complete Pod specification, lacking only the kind1 and apiVersion fields. Their necessity may raise the questions:

  1. Why is it mandatory to specify the spec.selector field when the labels are apparent in the spec.template.metadata field; can’t k8s infer them?

This behaviour stems from how Kubernetes handles object deltas. As a 3-way diff, the strategic merge strategy of kubectl apply compares the current state of an object (stored in etcd), the last state of the object (written by kubectl, before sending the request to the kube-apiserver, under the kubectl.kubernetes.io/last-applied-configuration annotation), and the desired state. And since default values would be inferred server-side, the client (kubectl) cannot see them in the annotation as it only stores the last version of the desired state, i.e. the user-defined specification (client-side). Thus, the label selector is enforced and immutable to prevent edge cases. See Issue #26202, as well as Issue #15894, for more information.

  1. Since the label selector2 is mandatory, why is it necessary to embed the Pod definition under the spec.template field, instead of referencing existing Pod objects?

Embedding the Pod specification directly within the workload resource ensures the controller always has a template to recreate Pods when necessary.

Pod replacement and update

Upon updating a pod template, the workload resource will not patch the already existing Pods, rather it will replace them. The update strategy differs depending on the workload resource.

However, some Pod fields can still be updated via specific sub-resources3, such as status or ephemeralcontainers.

Lifecycle

In their lifetime, Pods are assigned to a node only once. The kube-scheduler determines which Node is most suitable for a Pod (scheduling), then the Pod is assigned to a Node, which triggers the kubelet to create containers for it, using a container runtime.

If a Node starts failing, a policy to terminate and shift all scheduled Pods to a failing state will be triggered. Neither the Pods that were going to be scheduled, nor the ones already assigned to the Node will get rescheduled on another Node. For that, you would use a workload management system (e.g., Deployment).

Note

Pods are never “rescheduled”; they can be replaced by near-identical instances but in any case, there won’t be two pods with the same UID.

The Pod Status field

Within the k8s api, a Pod has both a specification and a status object4 (PodStatus), reflecting the real status of the Pod. The status field has two important sub-fields:

  • status.phase: A string with predefined meaning5 which reflects the high-level lifecycle stage a Pod’s in. It is not meant to be used as a state machine.

Important

A Pods’s phases are not to be confused with the Status field of a kubectl get command. One is a k8s primitive and the other is simply a client-side (kubectl) convention for readability.

  • status.conditions: A list of boolean assertions6, representing the stages a Pod has or has not gone through. They are set in arbitrary order and could change over time.

Healing

Kubernetes provides basic healing for Pods, based on a Pod’s restart policy. This restart policy has two levels of coverage:

  • Pod-level: By default set to “Always”, this policy is inherited by all containers, except by sidecar containers:
apiVersion: v1
kind: Pod
metadata:
  name: on-failure-pod
spec:
  restartPolicy: OnFailure # if a container exits with something other than 0, restart the relevant container.
  • Container-level (only with enabled ContainerRestartRules gate): Takes precedence over the above, though sidecars remain a special case. You may also have conditions for the restart policy under restartPolicyRules:
apiVersion: v1
kind: Pod
metadata:
  name: restart-on-exit-codes
spec:
  restartPolicy: Never
  containers:
  - name: restart-on-exit-codes
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 60 && exit 0']
    restartPolicy: Never # Required when you have specified rules
	restartPolicyRules:  # Only restart the container if it exits with 42
    - action: Restart
      exitCodes:
        operator: In
        values: [42]

Note

The above case is useful when you try to differentiate between restartable and non-restartable exit codes.

When a container within a Pod fails, the kubelet restarts it (replaces it) according to the policy. With each consecutive failure, the kubelet exponentially increases the time (multiplies it by 2: 10, 20, 40, etc.) between restarts, which is capped at 5 min. This is to prevent overload from excessive restarts.

The maximum delay could be modified through a kubelet’s configuration object, passed as a file via the --config flag.

Readiness

In a general setup, a Pod is considered ready when the kubelet has successfully mounted any required storage volumes, has set up a runtime sandbox with configured networking by using the relevant plugin through the CNI, the init containers have exited successfully, and all containers are running.

To determine a pod’s readiness, Kubernetes monitors the container states throughout its lifecycle phases. You can, however, ingest custom readiness data into the conditions field, if your use case requires it. To do so, you need to specify the readinessGates field:

kind: Pod
...
spec:
  readinessGates:
    - conditionType: "www.example.com/feature-1"
status:
  conditions:
    - type: Ready                              # a built-in PodCondition
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
    - type: "www.example.com/feature-1"        # an extra PodCondition
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z

In case the kubelet doesn’t find the specified gate in the conditions field, it sets the corresponding condition to False. Appending the condition could be done by a custom controller, using the PATCH operation on the PodStatus.

Warning

When using custom readiness conditions, the Pod can be considered ready only if all containers are ready and all custom readiness gates are True. If the containers are ready but one custom condition is missing, the kubelet sets the ContainersReady condition to True.

Note

kubectl doesn’t support patch operations on the status field of Pods.

OS-specific scheduling and configuration

Kubernetes allows you to define OS-specific rules for Pods. For example, the runAsRoot field determines whether a container within the Pod can run as the root user. You can read more about the related security standards here.

Note

More low-level configuration is possible through the RuntimeClass in Kubernetes, which allows configuration changes at the runtime level.

Warning

In multi-OS clusters, .spec.os.name does NOT influence scheduling. To prevent Pods from landing on incompatible Nodes, ensure each Node has the correct kubernetes.io/os label. Otherwise, because the kube-scheduler does not validate OS compatibility, a Pod may be scheduled to a Node with an unsupported OS.

Networking and Storage

Pods provide shared networking and storage for their containers.

Networking

Containers within Pods share the same network namespace. All containers can communicate through localhost. A Pod is assigned a single IP address and if a container wants to communicate with another container within a different Pod it can use IP communication.

Volumes

Volumes can be attached to Pods to provide shared storage between containers, and certain volume types can also offer persistence.

Note

Non-persistent volumes share the lifetime of a pod, i.e., deleting/terminating the pod would delete the volume.

Static Pods

Static Pods are Pods that are directly managed by a kubelet, i.e. the k8s api doesn’t oversee7 them, instead, the kubelet monitors their state and takes action when necessary (e.g., restarting them on failure). The kubelet also creates a mirror Pod in the API so that the Pod can still be observed, even though it cannot be controlled through the API. Every Static Pod resides on the same Node, just as its managing kubelet. And, they are distinguishable by names that include the Node’s hostname, appended after a hyphen (e.g., etcd-controlplane1).

Note

Remember that the kubelet still interacts with a container runtime’s CRI to run the static Pods.

Static Pods are mainly used for self-hosted k8s control plane Nodes, as this removes the kube-apiserver as a dependency.

Creating Static Pods

You create a Static Pod by placing a Pod manifest under the staticPodPath directory8 (commonly /etc/kubernetes/manifests). Alternatively, you can configure the kubelet with --manifest-url to pull static Pod manifests from a remote location.

Sources

Footnotes

  1. See k8s kind field.

  2. See Labels and Label Selectors in k8s.

  3. See Sub-resources.

  4. See k8s objects.

  5. See the descriptions of the phases here.

  6. See all conditions types here.

  7. This is meant more for the k8s controllers, responsible for managing the Pods through the API.

  8. This field is specified in the kubelet’s configuration file.