Error with autoscaled drone agents

We have an issue with agents spinning up and reporting an error.

Via the autoscaler CLI commands I can see

‘error during connect: Post “https://{IPADDRESS}:2376/v1.40/images/create?fromImage=drone%2Fdrone-runner-docker&tag=1”: EOF’

I can ssh into the agents, and can’t really see anything wrong with them, although I’m not sure what I’m meant to look for. Docker does seem to be running on them.

It’s been a while since I touched the setup, but we host our agents in AWS, and they have a custom cloudinit that loads in docker hub credentials so they don’t die to rate limiting. Pretty sure it’s the template from the docs with a config file jammed in.

It’d be nice to fix this because those agents wait around for a bit, preventing us from scaling to our maximum, until the autoscaler kills them (we’ve enabled the autoscaler reaper feature).

We’re still having this issue, and it occurs relatively often. Of the last 10 agents to spin up, two were spun up in an error state.

Hey @maxgruebneraeroqual happy to help with this, can you provide more info?

One root cause we observed in the past is the following …

You choose a base AMI where linux distro / package manager is configured to auto-upgrade packages on startup. As a result, when the autoscaler provisions a vm and it boots, the docker daemon starts and then the package manager immediately stops docker for upgrade, making the daemon unreachable for a period of time … when you manually login you see Docker daemon running, because the upgrade has since completed.

I believe aws linux is the biggest offender but other distros may be impacted as well. This can be solved by disabling the auto-upgrade in the base AMI. Here is a link to a past discussion, for reference

We were using the default AMI, which I thought would have been fine, but maybe because we’re using a custom cloudinit to shove the docker hub credentials in, that’s adjusted something.

After we updated to ubuntu 22.04 we’re getting the same error as in that post that you linked Brad, but much less frequently.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

I’ll have a tinker and see where I get to.

Several months later, I’ve finally had the opportunity to look into this some more.

The docker service is indeed rebooting when an agent is created.

We’re using a custom cloud_init.yaml, basically identical to this one, to pass the docker hub credentials in, so that we can pull stuff without going over the docker hub rate limits.

At the bottom of that, there’s a couple of runcmd commands that reload the docker daemon systemd files are restart docker, which is probably what’s causing my issues.

I’ll see if I can come up with a better one