Open22

[kubernetes/kubernetes] Implement the integration tests for requeueing scenarios

tozastationtozastation

source issue
https://github.com/kubernetes/kubernetes/issues/122305

  • check the scheduler's requeueing scenario in the integration test

QueueingHint KEP

QueueingHint

The scheduler gets a new functionality called QueueingHint to get suggestion for how to requeue Pods from each plugin. It helps reducing useless scheduling retries and thus improving the scheduling throughput.
Also, by giving an ability to skip backoff in appropriate cases, the time to take to schedule Pods with dynamic resource allocation is improved.

EditCode Point

https://github.com/kubernetes/kubernetes/blob/746f08a8da0a45e8b1a46362822db05cb11ed294/test/integration/scheduler/queue_test.go#L188-L226

  • add NodeAffinity integration test for requeueing scenarios

Author Memo

  • What I want to make sure (eventually) is that all plugins take preCheck into consideration in EventsToRegisrter (+ QHint

https://github.com/tozastation/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L607-L621
https://github.com/tozastation/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L627-L658
https://github.com/tozastation/kubernetes/blob/master/staging/src/k8s.io/component-helpers/scheduling/corev1/nodeaffinity/nodeaffinity.go#L306-L319
https://github.com/tozastation/kubernetes/blob/master/staging/src/k8s.io/component-helpers/scheduling/corev1/helpers.go#L78-L86

https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L70-L83
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L86-L90
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L120

  • So, for example, for nodeaffinity, we can just have two scenarios in which Pod is supposed to be requeued to activeQ/backoffQ (= node satisfying nodeaffinity is created), vs is supposed not to be requeued (= node unrelated to nodeaffinity is created). No need to cover other minor scenarios (e.g., a scenario addedNodeSelector comes into play, etc)
  • Plus, it'd be great if we could have a scenario where preCheck scenario is covered.
    For example:
    (1) Pod with nodeaffinity is created
    (2) is rejected because no node satisfies nodeaffinity
    (3) we create a new node that satisfies nodeaffinity, but has taint.
    (4) nodeadd event from (3) should be filtered out by precheck.
    (5) we remove taint from node.
    (6) nodeupdate event from (5) requeue the pod.
tozastationtozastation

Test Senario Assumption

  • Pod rejected by the NodeAffinity plugin is requeued when a new Node is created and turned to ready
    • Required Node Affinity or Node Selector Case
tozastationtozastation

https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/


PreEnqueue

These plugins are called prior to adding Pods to the internal active queue, where Pods are marked as ready for scheduling.
Only when all PreEnqueue plugins return Success, the Pod is allowed to enter the active queue. Otherwise, it's placed in the internal unschedulable Pods list, and doesn't get an Unschedulable condition.

PreEnqueueはkube-schedulerの内部キュー(正確にはactiveQ)にPodを追加される前に常に実行されます。PreEnqueue拡張点がSuccessではないステータスを返すと、そのPodは内部キューに追加されず結果としてスケジューリングが試みられません。
https://qiita.com/everpeace/items/8ffce0195f7d8371c54f

tozastationtozastation
func (pl *NodeAffinity) EventsToRegister(_ context.Context) ([]framework.ClusterEventWithHint, error) {
	return []framework.ClusterEventWithHint{
		{Event: framework.ClusterEvent{
			Resource: framework.Node, ActionType: framework.Add | framework.Update,
		}, QueueingHintFn: pl.isSchedulableAfterNodeChange},
	}, nil
}
type ClusterEventWithHint struct {
	Event ClusterEvent
	// QueueingHintFn is executed for the plugin rejected by this plugin when the above Event happens,
	// and filters out events to reduce useless retry of Pod's scheduling.
	// It's an optional field. If not set,
	// the scheduling of Pods will be always retried with backoff when this Event happens.
	// (the same as Queue)
	QueueingHintFn QueueingHintFn
}
// ClusterEvent abstracts how a system resource's state gets changed.
// Resource represents the standard API resources such as Pod, Node, etc.
// ActionType denotes the specific change such as Add, Update or Delete.
type ClusterEvent struct {
	Resource   GVK
	ActionType ActionType
	// Label describes this cluster event, only used in logging and metrics.
	Label string
}
// Constants for ActionTypes.
const (
	Add ActionType = 1 << iota
	Delete

	// UpdateNodeXYZ is only applicable for Node events.
	// If you use UpdateNodeXYZ,
	// your plugin's QueueingHint is only executed for the specific sub-Update event.
	// It's better to narrow down the scope of the event by using them instead of just using Update event
	// for better performance in requeueing.
	UpdateNodeAllocatable
	UpdateNodeLabel
	UpdateNodeTaint
	UpdateNodeCondition
	UpdateNodeAnnotation

	// UpdatePodXYZ is only applicable for Pod events.
	// If you use UpdatePodXYZ,
	// your plugin's QueueingHint is only executed for the specific sub-Update event.
	// It's better to narrow down the scope of the event by using them instead of Update event
	// for better performance in requeueing.
	UpdatePodLabel
	// UpdatePodScaleDown is an update for pod's scale down (i.e., any resource request is reduced).
	UpdatePodScaleDown

	// updatePodOther is a update for pod's other fields.
	// It's used only for the internal event handling, and thus unexported.
	updatePodOther

	All ActionType = 1<<iota - 1

	// Use the general Update type if you don't either know or care the specific sub-Update type to use.
	Update = UpdateNodeAllocatable | UpdateNodeLabel | UpdateNodeTaint | UpdateNodeCondition | UpdateNodeAnnotation | UpdatePodLabel | UpdatePodScaleDown | updatePodOther
)
var (
	// AssignedPodAdd is the event when an assigned pod is added.
	AssignedPodAdd = ClusterEvent{Resource: Pod, ActionType: Add, Label: "AssignedPodAdd"}
	// NodeAdd is the event when a new node is added to the cluster.
	NodeAdd = ClusterEvent{Resource: Node, ActionType: Add, Label: "NodeAdd"}
	// NodeDelete is the event when a node is deleted from the cluster.
	NodeDelete = ClusterEvent{Resource: Node, ActionType: Delete, Label: "NodeDelete"}
	// AssignedPodUpdate is the event when an assigned pod is updated.
	AssignedPodUpdate = ClusterEvent{Resource: Pod, ActionType: Update, Label: "AssignedPodUpdate"}
	// UnscheduledPodAdd is the event when an unscheduled pod is added.
	UnscheduledPodAdd = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodAdd"}
	// UnscheduledPodUpdate is the event when an unscheduled pod is updated.
	UnscheduledPodUpdate = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodUpdate"}
	// UnscheduledPodDelete is the event when an unscheduled pod is deleted.
	UnscheduledPodDelete = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodDelete"}
	// assignedPodOtherUpdate is the event when an assigned pod got updated in fields that are not covered by UpdatePodXXX.
	assignedPodOtherUpdate = ClusterEvent{Resource: Pod, ActionType: updatePodOther, Label: "AssignedPodUpdate"}
	// AssignedPodDelete is the event when an assigned pod is deleted.
	AssignedPodDelete = ClusterEvent{Resource: Pod, ActionType: Delete, Label: "AssignedPodDelete"}
	// PodRequestScaledDown is the event when a pod's resource request is scaled down.
	PodRequestScaledDown = ClusterEvent{Resource: Pod, ActionType: UpdatePodScaleDown, Label: "PodRequestScaledDown"}
	// PodLabelChange is the event when a pod's label is changed.
	PodLabelChange = ClusterEvent{Resource: Pod, ActionType: UpdatePodLabel, Label: "PodLabelChange"}
	// NodeSpecUnschedulableChange is the event when unschedulable node spec is changed.
	NodeSpecUnschedulableChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeTaint, Label: "NodeSpecUnschedulableChange"}
	// NodeAllocatableChange is the event when node allocatable is changed.
	NodeAllocatableChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeAllocatable, Label: "NodeAllocatableChange"}
	// NodeLabelChange is the event when node label is changed.
	NodeLabelChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeLabel, Label: "NodeLabelChange"}
	// NodeAnnotationChange is the event when node annotation is changed.
	NodeAnnotationChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeAnnotation, Label: "NodeAnnotationChange"}
	// NodeTaintChange is the event when node taint is changed.
	NodeTaintChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeTaint, Label: "NodeTaintChange"}
	// NodeConditionChange is the event when node condition is changed.
	NodeConditionChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeCondition, Label: "NodeConditionChange"}
	// PvAdd is the event when a persistent volume is added in the cluster.
	PvAdd = ClusterEvent{Resource: PersistentVolume, ActionType: Add, Label: "PvAdd"}
	// PvUpdate is the event when a persistent volume is updated in the cluster.
	PvUpdate = ClusterEvent{Resource: PersistentVolume, ActionType: Update, Label: "PvUpdate"}
	// PvcAdd is the event when a persistent volume claim is added in the cluster.
	PvcAdd = ClusterEvent{Resource: PersistentVolumeClaim, ActionType: Add, Label: "PvcAdd"}
	// PvcUpdate is the event when a persistent volume claim is updated in the cluster.
	PvcUpdate = ClusterEvent{Resource: PersistentVolumeClaim, ActionType: Update, Label: "PvcUpdate"}
	// StorageClassAdd is the event when a StorageClass is added in the cluster.
	StorageClassAdd = ClusterEvent{Resource: StorageClass, ActionType: Add, Label: "StorageClassAdd"}
	// StorageClassUpdate is the event when a StorageClass is updated in the cluster.
	StorageClassUpdate = ClusterEvent{Resource: StorageClass, ActionType: Update, Label: "StorageClassUpdate"}
	// CSINodeAdd is the event when a CSI node is added in the cluster.
	CSINodeAdd = ClusterEvent{Resource: CSINode, ActionType: Add, Label: "CSINodeAdd"}
	// CSINodeUpdate is the event when a CSI node is updated in the cluster.
	CSINodeUpdate = ClusterEvent{Resource: CSINode, ActionType: Update, Label: "CSINodeUpdate"}
	// CSIDriverAdd is the event when a CSI driver is added in the cluster.
	CSIDriverAdd = ClusterEvent{Resource: CSIDriver, ActionType: Add, Label: "CSIDriverAdd"}
	// CSIDriverUpdate is the event when a CSI driver is updated in the cluster.
	CSIDriverUpdate = ClusterEvent{Resource: CSIDriver, ActionType: Update, Label: "CSIDriverUpdate"}
	// CSIStorageCapacityAdd is the event when a CSI storage capacity is added in the cluster.
	CSIStorageCapacityAdd = ClusterEvent{Resource: CSIStorageCapacity, ActionType: Add, Label: "CSIStorageCapacityAdd"}
	// CSIStorageCapacityUpdate is the event when a CSI storage capacity is updated in the cluster.
	CSIStorageCapacityUpdate = ClusterEvent{Resource: CSIStorageCapacity, ActionType: Update, Label: "CSIStorageCapacityUpdate"}
	// WildCardEvent semantically matches all resources on all actions.
	WildCardEvent = ClusterEvent{Resource: WildCard, ActionType: All, Label: "WildCardEvent"}
	// UnschedulableTimeout is the event when a pod stays in unschedulable for longer than timeout.
	UnschedulableTimeout = ClusterEvent{Resource: WildCard, ActionType: All, Label: "UnschedulableTimeout"}
)
tozastationtozastation
  • util.As
    • return oldTyped, newTyped, nil

// NodeAffinity is a plugin that checks if a pod node selector matches the node label.
type NodeAffinity struct {
	handle              framework.Handle
	addedNodeSelector   *nodeaffinity.NodeSelector
	addedPrefSchedTerms *nodeaffinity.PreferredSchedulingTerms
}
type RequiredNodeAffinity struct {
	labelSelector labels.Selector
	nodeSelector  *LazyErrorNodeSelector
}
// isSchedulableAfterNodeChange is invoked whenever a node changed. It checks whether
// that change made a previously unschedulable pod schedulable.
func (pl *NodeAffinity) isSchedulableAfterNodeChange(logger klog.Logger, pod *v1.Pod, oldObj, newObj interface{}) (framework.QueueingHint, error) {
	_, modifiedNode, err := util.As[*v1.Node](oldObj, newObj)
	if err != nil {
		return framework.Queue, err
	}

	if pl.addedNodeSelector != nil && !pl.addedNodeSelector.Match(modifiedNode) {
		logger.V(4).Info("added or modified node didn't match scheduler-enforced node affinity and this event won't make the Pod schedulable", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
		return framework.QueueSkip, nil
	}

	requiredNodeAffinity := nodeaffinity.GetRequiredNodeAffinity(pod)
	isMatched, err := requiredNodeAffinity.Match(modifiedNode)
	if err != nil {
		return framework.Queue, err
	}
	if isMatched {
		logger.V(4).Info("node was created or updated, and matches with the pod's NodeAffinity", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
		return framework.Queue, nil
	}

	// TODO: also check if the original node meets the pod's requestments once preCheck is completely removed.
	// See: https://github.com/kubernetes/kubernetes/issues/110175

	logger.V(4).Info("node was created or updated, but it doesn't make this pod schedulable", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
	return framework.QueueSkip, nil
}
tozastationtozastation

scheduler: move all preCheck to QueueingHint
kubernetes/kubernetes#110175

So, I think we should simply inform all events to queue and let queue move unschedulable Pods with that events.

As the preCheckForNode check will just pre-verify the basic
checks which are orthogonal with PodTopologySpread.

to define "which kind of resource's events should be ignored" in plugin side.


To summarize, the following cases doesn't properly re-queue relevant pods:

  1. Assigned Pod's add/update events
  2. Non-scheduled Pod's add/update/delete events
  3. Node's add/update/delete events

But, I am still believing that all logic related to a plugin should be controllable from the plugin side and we should not create something that can be achieved in in-tree plugins but not in out-of-tree plugins.

tozastationtozastation
tozastationtozastation

Currently, NodeAdded QueueingHint could not always be called because of the internal feature called preCheck.
It's definitely not something expected for plugin developers,
and we're trying to eventually remove preCheck completely to fix this.
Until then we'll register UpdateNodeTaint event for plugins that have NodeAdded event, but don't have UpdateNodeTaint event.
It'd result in a bad impact on the requeuing efficiency though, a lot better than some Pods being stuck in the
unschedulable pod pool.

tozastationtozastation

Run -> ScheduleOne or handleBindingCycleError -> sched.FailureHandler -> handleSchedulingFailure -> AddUnschedulableIfNotPresent -> requeuePodViaQueueingHint or isPodWorthRequeueing

// isPodWorthRequeuing calls QueueingHintFn of only plugins registered in pInfo.unschedulablePlugins and pInfo.PendingPlugins.
//
// If any of pInfo.PendingPlugins return Queue,
// the scheduling queue is supposed to enqueue this Pod to activeQ, skipping backoffQ.
// If any of pInfo.unschedulablePlugins return Queue,
// the scheduling queue is supposed to enqueue this Pod to activeQ/backoffQ depending on the remaining backoff time of the Pod.
// If all QueueingHintFns returns Skip, the scheduling queue enqueues the Pod back to unschedulable Pod pool
// because no plugin changes the scheduling result via the event.

https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/backend/queue/scheduling_queue.go#L393-L475