[kubernetes/kubernetes] Implement the integration tests for requeueing scenarios

source issue

https://github.com/kubernetes/kubernetes/issues/122305
check the scheduler's requeueing scenario in the integration test
QueueingHint KEP
https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4247-queueinghint/README.md
QueueingHint
The scheduler gets a new functionality called QueueingHint to get suggestion for how to requeue Pods from each plugin. It helps reducing useless scheduling retries and thus improving the scheduling throughput.

Also, by giving an ability to skip backoff in appropriate cases, the time to take to schedule Pods with dynamic resource allocation is improved.
EditCode Point
https://github.com/kubernetes/kubernetes/blob/746f08a8da0a45e8b1a46362822db05cb11ed294/test/integration/scheduler/queue_test.go#L188-L226
add NodeAffinity integration test for requeueing scenarios
Author Memo
What I want to make sure (eventually) is that all plugins take preCheck into consideration in EventsToRegisrter (+ QHint
https://github.com/tozastation/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L607-L621

https://github.com/tozastation/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L627-L658

https://github.com/tozastation/kubernetes/blob/master/staging/src/k8s.io/component-helpers/scheduling/corev1/nodeaffinity/nodeaffinity.go#L306-L319

https://github.com/tozastation/kubernetes/blob/master/staging/src/k8s.io/component-helpers/scheduling/corev1/helpers.go#L78-L86
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/eventhandlers.go#L70-L83

https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L86-L90

https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L120
So, for example, for nodeaffinity, we can just have two scenarios in which Pod is supposed to be requeued to activeQ/backoffQ (= node satisfying nodeaffinity is created), vs is supposed not to be requeued (= node unrelated to nodeaffinity is created). No need to cover other minor scenarios (e.g., a scenario addedNodeSelector comes into play, etc)
Plus, it'd be great if we could have a scenario where preCheck scenario is covered.

For example:

(1) Pod with nodeaffinity is created

(2) is rejected because no node satisfies nodeaffinity

(3) we create a new node that satisfies nodeaffinity, but has taint.

(4) nodeadd event from (3) should be filtered out by precheck.

(5) we remove taint from node.

(6) nodeupdate event from (5) requeue the pod.

tozastation

NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable

tozastation

Test Senario Assumption

Pod rejected by the NodeAffinity plugin is requeued when a new Node is created and turned to ready
- Required Node Affinity or Node Selector Case

tozastation

https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
PreEnqueue
These plugins are called prior to adding Pods to the internal active queue, where Pods are marked as ready for scheduling.

Only when all PreEnqueue plugins return Success, the Pod is allowed to enter the active queue. Otherwise, it's placed in the internal unschedulable Pods list, and doesn't get an Unschedulable condition.
PreEnqueueはkube-schedulerの内部キュー(正確にはactiveQ)にPodを追加される前に常に実行されます。PreEnqueue拡張点がSuccessではないステータスを返すと、そのPodは内部キューに追加されず結果としてスケジューリングが試みられません。

https://qiita.com/everpeace/items/8ffce0195f7d8371c54f

tozastation

func (pl *NodeAffinity) EventsToRegister(_ context.Context) ([]framework.ClusterEventWithHint, error) {
	return []framework.ClusterEventWithHint{
		{Event: framework.ClusterEvent{
			Resource: framework.Node, ActionType: framework.Add | framework.Update,
		}, QueueingHintFn: pl.isSchedulableAfterNodeChange},
	}, nil
}
type ClusterEventWithHint struct {
	Event ClusterEvent
	// QueueingHintFn is executed for the plugin rejected by this plugin when the above Event happens,
	// and filters out events to reduce useless retry of Pod's scheduling.
	// It's an optional field. If not set,
	// the scheduling of Pods will be always retried with backoff when this Event happens.
	// (the same as Queue)
	QueueingHintFn QueueingHintFn
}
// ClusterEvent abstracts how a system resource's state gets changed.
// Resource represents the standard API resources such as Pod, Node, etc.
// ActionType denotes the specific change such as Add, Update or Delete.
type ClusterEvent struct {
	Resource   GVK
	ActionType ActionType
	// Label describes this cluster event, only used in logging and metrics.
	Label string
}
// Constants for ActionTypes.
const (
	Add ActionType = 1 << iota
	Delete

	// UpdateNodeXYZ is only applicable for Node events.
	// If you use UpdateNodeXYZ,
	// your plugin's QueueingHint is only executed for the specific sub-Update event.
	// It's better to narrow down the scope of the event by using them instead of just using Update event
	// for better performance in requeueing.
	UpdateNodeAllocatable
	UpdateNodeLabel
	UpdateNodeTaint
	UpdateNodeCondition
	UpdateNodeAnnotation

	// UpdatePodXYZ is only applicable for Pod events.
	// If you use UpdatePodXYZ,
	// your plugin's QueueingHint is only executed for the specific sub-Update event.
	// It's better to narrow down the scope of the event by using them instead of Update event
	// for better performance in requeueing.
	UpdatePodLabel
	// UpdatePodScaleDown is an update for pod's scale down (i.e., any resource request is reduced).
	UpdatePodScaleDown

	// updatePodOther is a update for pod's other fields.
	// It's used only for the internal event handling, and thus unexported.
	updatePodOther

	All ActionType = 1<<iota - 1

	// Use the general Update type if you don't either know or care the specific sub-Update type to use.
	Update = UpdateNodeAllocatable | UpdateNodeLabel | UpdateNodeTaint | UpdateNodeCondition | UpdateNodeAnnotation | UpdatePodLabel | UpdatePodScaleDown | updatePodOther
)
var (
	// AssignedPodAdd is the event when an assigned pod is added.
	AssignedPodAdd = ClusterEvent{Resource: Pod, ActionType: Add, Label: "AssignedPodAdd"}
	// NodeAdd is the event when a new node is added to the cluster.
	NodeAdd = ClusterEvent{Resource: Node, ActionType: Add, Label: "NodeAdd"}
	// NodeDelete is the event when a node is deleted from the cluster.
	NodeDelete = ClusterEvent{Resource: Node, ActionType: Delete, Label: "NodeDelete"}
	// AssignedPodUpdate is the event when an assigned pod is updated.
	AssignedPodUpdate = ClusterEvent{Resource: Pod, ActionType: Update, Label: "AssignedPodUpdate"}
	// UnscheduledPodAdd is the event when an unscheduled pod is added.
	UnscheduledPodAdd = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodAdd"}
	// UnscheduledPodUpdate is the event when an unscheduled pod is updated.
	UnscheduledPodUpdate = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodUpdate"}
	// UnscheduledPodDelete is the event when an unscheduled pod is deleted.
	UnscheduledPodDelete = ClusterEvent{Resource: Pod, ActionType: Update, Label: "UnschedulablePodDelete"}
	// assignedPodOtherUpdate is the event when an assigned pod got updated in fields that are not covered by UpdatePodXXX.
	assignedPodOtherUpdate = ClusterEvent{Resource: Pod, ActionType: updatePodOther, Label: "AssignedPodUpdate"}
	// AssignedPodDelete is the event when an assigned pod is deleted.
	AssignedPodDelete = ClusterEvent{Resource: Pod, ActionType: Delete, Label: "AssignedPodDelete"}
	// PodRequestScaledDown is the event when a pod's resource request is scaled down.
	PodRequestScaledDown = ClusterEvent{Resource: Pod, ActionType: UpdatePodScaleDown, Label: "PodRequestScaledDown"}
	// PodLabelChange is the event when a pod's label is changed.
	PodLabelChange = ClusterEvent{Resource: Pod, ActionType: UpdatePodLabel, Label: "PodLabelChange"}
	// NodeSpecUnschedulableChange is the event when unschedulable node spec is changed.
	NodeSpecUnschedulableChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeTaint, Label: "NodeSpecUnschedulableChange"}
	// NodeAllocatableChange is the event when node allocatable is changed.
	NodeAllocatableChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeAllocatable, Label: "NodeAllocatableChange"}
	// NodeLabelChange is the event when node label is changed.
	NodeLabelChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeLabel, Label: "NodeLabelChange"}
	// NodeAnnotationChange is the event when node annotation is changed.
	NodeAnnotationChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeAnnotation, Label: "NodeAnnotationChange"}
	// NodeTaintChange is the event when node taint is changed.
	NodeTaintChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeTaint, Label: "NodeTaintChange"}
	// NodeConditionChange is the event when node condition is changed.
	NodeConditionChange = ClusterEvent{Resource: Node, ActionType: UpdateNodeCondition, Label: "NodeConditionChange"}
	// PvAdd is the event when a persistent volume is added in the cluster.
	PvAdd = ClusterEvent{Resource: PersistentVolume, ActionType: Add, Label: "PvAdd"}
	// PvUpdate is the event when a persistent volume is updated in the cluster.
	PvUpdate = ClusterEvent{Resource: PersistentVolume, ActionType: Update, Label: "PvUpdate"}
	// PvcAdd is the event when a persistent volume claim is added in the cluster.
	PvcAdd = ClusterEvent{Resource: PersistentVolumeClaim, ActionType: Add, Label: "PvcAdd"}
	// PvcUpdate is the event when a persistent volume claim is updated in the cluster.
	PvcUpdate = ClusterEvent{Resource: PersistentVolumeClaim, ActionType: Update, Label: "PvcUpdate"}
	// StorageClassAdd is the event when a StorageClass is added in the cluster.
	StorageClassAdd = ClusterEvent{Resource: StorageClass, ActionType: Add, Label: "StorageClassAdd"}
	// StorageClassUpdate is the event when a StorageClass is updated in the cluster.
	StorageClassUpdate = ClusterEvent{Resource: StorageClass, ActionType: Update, Label: "StorageClassUpdate"}
	// CSINodeAdd is the event when a CSI node is added in the cluster.
	CSINodeAdd = ClusterEvent{Resource: CSINode, ActionType: Add, Label: "CSINodeAdd"}
	// CSINodeUpdate is the event when a CSI node is updated in the cluster.
	CSINodeUpdate = ClusterEvent{Resource: CSINode, ActionType: Update, Label: "CSINodeUpdate"}
	// CSIDriverAdd is the event when a CSI driver is added in the cluster.
	CSIDriverAdd = ClusterEvent{Resource: CSIDriver, ActionType: Add, Label: "CSIDriverAdd"}
	// CSIDriverUpdate is the event when a CSI driver is updated in the cluster.
	CSIDriverUpdate = ClusterEvent{Resource: CSIDriver, ActionType: Update, Label: "CSIDriverUpdate"}
	// CSIStorageCapacityAdd is the event when a CSI storage capacity is added in the cluster.
	CSIStorageCapacityAdd = ClusterEvent{Resource: CSIStorageCapacity, ActionType: Add, Label: "CSIStorageCapacityAdd"}
	// CSIStorageCapacityUpdate is the event when a CSI storage capacity is updated in the cluster.
	CSIStorageCapacityUpdate = ClusterEvent{Resource: CSIStorageCapacity, ActionType: Update, Label: "CSIStorageCapacityUpdate"}
	// WildCardEvent semantically matches all resources on all actions.
	WildCardEvent = ClusterEvent{Resource: WildCard, ActionType: All, Label: "WildCardEvent"}
	// UnschedulableTimeout is the event when a pod stays in unschedulable for longer than timeout.
	UnschedulableTimeout = ClusterEvent{Resource: WildCard, ActionType: All, Label: "UnschedulableTimeout"}
)

tozastation

util.As
- return oldTyped, newTyped, nil


// NodeAffinity is a plugin that checks if a pod node selector matches the node label.
type NodeAffinity struct {
	handle              framework.Handle
	addedNodeSelector   *nodeaffinity.NodeSelector
	addedPrefSchedTerms *nodeaffinity.PreferredSchedulingTerms
}
type RequiredNodeAffinity struct {
	labelSelector labels.Selector
	nodeSelector  *LazyErrorNodeSelector
}
// isSchedulableAfterNodeChange is invoked whenever a node changed. It checks whether
// that change made a previously unschedulable pod schedulable.
func (pl *NodeAffinity) isSchedulableAfterNodeChange(logger klog.Logger, pod *v1.Pod, oldObj, newObj interface{}) (framework.QueueingHint, error) {
	_, modifiedNode, err := util.As[*v1.Node](oldObj, newObj)
	if err != nil {
		return framework.Queue, err
	}

	if pl.addedNodeSelector != nil && !pl.addedNodeSelector.Match(modifiedNode) {
		logger.V(4).Info("added or modified node didn't match scheduler-enforced node affinity and this event won't make the Pod schedulable", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
		return framework.QueueSkip, nil
	}

	requiredNodeAffinity := nodeaffinity.GetRequiredNodeAffinity(pod)
	isMatched, err := requiredNodeAffinity.Match(modifiedNode)
	if err != nil {
		return framework.Queue, err
	}
	if isMatched {
		logger.V(4).Info("node was created or updated, and matches with the pod's NodeAffinity", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
		return framework.Queue, nil
	}

	// TODO: also check if the original node meets the pod's requestments once preCheck is completely removed.
	// See: https://github.com/kubernetes/kubernetes/issues/110175

	logger.V(4).Info("node was created or updated, but it doesn't make this pod schedulable", "pod", klog.KObj(pod), "node", klog.KObj(modifiedNode))
	return framework.QueueSkip, nil
}

tozastation

scheduler: move all preCheck to QueueingHint

kubernetes/kubernetes#110175
So, I think we should simply inform all events to queue and let queue move unschedulable Pods with that events.
As the preCheckForNode check will just pre-verify the basic

checks which are orthogonal with PodTopologySpread.
to define "which kind of resource's events should be ignored" in plugin side.
To summarize, the following cases doesn't properly re-queue relevant pods:
Assigned Pod's add/update events
Non-scheduled Pod's add/update/delete events
Node's add/update/delete events
But, I am still believing that all logic related to a plugin should be controllable from the plugin side and we should not create something that can be achieved in in-tree plugins but not in out-of-tree plugins.

tozastation

Currently, NodeAdded QueueingHint could not always be called because of the internal feature called preCheck.
It's definitely not something expected for plugin developers,
and we're trying to eventually remove preCheck completely to fix this.
Until then we'll register UpdateNodeTaint event for plugins that have NodeAdded event, but don't have UpdateNodeTaint event.
It'd result in a bad impact on the requeuing efficiency though, a lot better than some Pods being stuck in the
unschedulable pod pool.

tozastation

Run -> ScheduleOne or handleBindingCycleError -> sched.FailureHandler -> handleSchedulingFailure -> AddUnschedulableIfNotPresent -> requeuePodViaQueueingHint or isPodWorthRequeueing

// isPodWorthRequeuing calls QueueingHintFn of only plugins registered in pInfo.unschedulablePlugins and pInfo.PendingPlugins.
//
// If any of pInfo.PendingPlugins return Queue,
// the scheduling queue is supposed to enqueue this Pod to activeQ, skipping backoffQ.
// If any of pInfo.unschedulablePlugins return Queue,
// the scheduling queue is supposed to enqueue this Pod to activeQ/backoffQ depending on the remaining backoff time of the Pod.
// If all QueueingHintFns returns Skip, the scheduling queue enqueues the Pod back to unschedulable Pod pool
// because no plugin changes the scheduling result via the event.

tozastation

kube-schedulerでは、Scheduling Cacheから都度最新のNodeを直接取得するのでなく、スケジュール開始時に作成したsnapshotからNodeを取得することで、一つのPodのスケジュール中に参照するNodeの状態が同じになるように設計してあります。

tozastation