I’m running Drone (kubernetes-native) on AWS EKS, with an autoscaler running in the cluster. The hope is that when CPU utilization rises, the autoscaler will trigger new nodes to be added, and jobs will run on the new nodes. At first glance adding
resources with CPU requests would get me what I need. However, some interrelated things seem to be thwarting me:
- pipelines are created with node affinity. Drone ‘sticks’ the pipeline steps to the same node as their services, which I read as the same node as their ‘drone-job-*’ pod
- job controllers are created without resources. The cluster doesn’t have the CPU requests up front, so it places these seemingly anywhere, even on CPU-starved nodes.
So I have autoscaling set up to recognize that a step requires more CPU than available, but it can’t scale up because it’s stuck on the same node due to node affinity (I think):
Scale-up predicate failed: GeneralPredicates predicate mismatch, cannot put [...] on [...], reason: node(s) didn't match node selector
Any guidance here is welcome. I’m going to continue to experiment but I’m running out of ideas.
Essentially, right now it feels like I want to get CPU requests on these
Hey I’m having the same problem. Would love some input from other Drone on K8s users.
I can confirm that this is exactly what is happening. I have a masivelly parallel build ( with depends_on statements ) and all the steps are started on the same node hammering it’s CPU. Please check the node column in the listing below:
$ kubectl -n xvcp0k60xwr1r25eyvfnsu58j1ybf39p get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
8k8tjypwid4adter866km8nzv7pqgd4q 1/1 Running 0 17s 100.96.32.14 ip-10-0-96-165.eu-west-1.compute.internal <none>
a28rar2fhj6t8qfbhzsm1qvuq6lltkp5 1/1 Running 0 17s 100.96.32.15 ip-10-0-96-165.eu-west-1.compute.internal <none>
ggls4fz6bes2i3wpz4hmqkwlan6j2u36 1/1 Running 0 17s 100.96.32.16 ip-10-0-96-165.eu-west-1.compute.internal <none>
k1yz80sz57dvbz5ra4cepevfgql9opzm 1/1 Running 0 16s 100.96.32.17 ip-10-0-96-165.eu-west-1.compute.internal <none>
kpax184m5dxrfd76w1glkrbu9ocrxqmx 1/1 Running 0 17s 100.96.32.9 ip-10-0-96-165.eu-west-1.compute.internal <none>
lhe297eg809tdi13rg7p1hdaq5x6urv6 1/1 Running 0 17s 100.96.32.10 ip-10-0-96-165.eu-west-1.compute.internal <none>
m9kkvmff629cgt201zg6qzcz9b5gca9f 0/1 Completed 0 5m37s 100.96.32.8 ip-10-0-96-165.eu-west-1.compute.internal <none>
mc2slr1hiqs482pqvcpeve9zc9wy7aj3 1/1 Running 0 17s 100.96.32.12 ip-10-0-96-165.eu-west-1.compute.internal <none>
oinn10m4w0s8af7f0z0vu1mormc38kzy 1/1 Running 0 17s 100.96.32.11 ip-10-0-96-165.eu-west-1.compute.internal <none>
qmmzt55iq5iuh30moa29wf3uwtj1q6vz 0/1 Completed 0 5m37s 100.96.32.7 ip-10-0-96-165.eu-west-1.compute.internal <none>
rl1rctgy3qrycus43dvcrh7muhm8vmwg 1/1 Running 0 17s 100.96.32.13 ip-10-0-96-165.eu-west-1.compute.internal <none>
xadcb3v14xjpdg3t0uc2w6zon8139ye1 0/1 Completed 0 5m46s 100.96.32.6 ip-10-0-96-165.eu-west-1.compute.internal <none>
Just a reminder that native Kubernetes runtime is still experimental and is not recommended for production use. It may be deprecated and replaced by Tekton in the future, so just be careful if relying on this for a production deployment. With that being said, we will accept patches that fix bugs with the current implementation.
Fair enough, and there are labels in multiple spots specifying that it’s experimental. It could be just us confused early adopters. The only other ask I would have from this thread would be a general update since the “drone goes k8s” blog post 6 months ago:
The Kubernetes Runtime is still considered experiment, however, initial testing has been very positive. There are some known issues and areas of improvement, however, I expect rapid progress over the coming weeks.
^ Dec 7th, 2018 - https://blog.drone.io/drone-goes-kubernetes-native/
Has testing remained very positive? Do we still expect rapid progress? Would an update somewhere help, to show a change in priorities? An update like this could save people from wasting their experimental time.
Has testing remained very positive? Do we still expect rapid progress?
I do not believe so. The documentation has been updated to recommend against production use while we re-assess. We are tracking various issues related to Kubernetes where we have summarized our concerns, although no final decisions have been made. Some further reading:
Our current focus is on enabling custom stage definitions  and providing a runner framework (conceptually similar to Kubernetes operator framework). This will enable creation of custom runners, and will decouple runners from Drone core. I expect this will lead to a community-driven Kubernetes runtime that supersedes what we have today. I also expect the current Kubernetes runtime to remain active as a community-driven runtime, assuming there is interest in maintaining it despite its faults.