Drone on Kubernetes: Builds fail with Docker errors

Hello all! I’ve been trying lots of things, searching various issues to no effect; hoping someone can help me out here.

I have installed Drone 0.5 on my v1.3.7 Kubernetes cluster running on GCE, using a heavily modified rendition of the manifests in this repository. I was able to alter these scripts to get the server, agent, and proxy running and talking to each other. Github login works, builds start on a push, all of that is wonderful.

The problem lies in getting these builds to run. Every time I attempt a push, or restart a failed build manually, I get something like this:

My configuration to date is relatively trivial (just wanted to see a few things working before I go in full-bore):

pipeline:
  install:
    image: node:latest 
    commands:
        - npm install

  test:
    image: node:latest
    commands:
        - npm run-script test
  
  log:
    image: plugins/slack
    webhook: <redacted>
    channel: ci
    template: |
        {{ #success build.status }}
            {{ repo.name }}/{{ repo.branch }} \#{{ build.number }} - succeeded after {{ since build.started }}.
        {{ else }}
            {{ repo.name }}/{{ repo.branch }} \#{{ build.number }} - FAILED after {{ since build.started }}.
        {{ /success }}
    when:
        status: [ success, failure ]

I’ve tried a build-only script to the same end. After verifying it wasn’t configuration related, I moved to my cluster.

I have deployed every Drone asset in its own pod, federating them with Kubernetes services. I then began messing with the configuration of the agent vis-a-vis its Docker and network access on that pod.

I am currently mapping in the Docker socket from the host node without a hitch. I have tried mapping the host network into the pod to no avail, and changing the security context of the pod to privileged with and without host networking enabled: neither of these things have made a difference. The DOCKER_HOST environment variable has been no use to me, as I am mapping the socket into the pod at the location specified by the agent’s default.

Here is a sample of my configuration (partially modified for brevity - the most relevant bits I have kept. I am not adjusting any aspect of the agent’s configuration with respect to Docker via environment variables):

{
    "apiVersion": "extensions/v1beta1",
    "kind": "Deployment",
    "metadata": { ... },
    "spec": {
        "replicas": 1,
        "template": {
            "metadata": { ... },
            "spec": {
                "hostNetwork": true,
                "containers": [
                    {
                        "command": [
                            "/drone",
                            "agent"
                        ],
                        "env": [ ... ],
                        "image": "drone/drone:0.5",
                        "name": "drone-agent",
                        "resources": { ... },
                        "securityContext": {
                            "privileged": true
                        },
                        "volumeMounts": [
                            {
                                "mountPath": "/var/run/docker.sock",
                                "name": "docker-socket"
                            }
                        ]
                    }
                ],
                "volumes": [
                    {
                        "hostPath": {
                            "path": "/var/run/docker.sock"
                        },
                        "name": "docker-socket"
                    }
                ]
            }
        }
    }
}

Removing host networking and/or a privileged security context yields the same result as the picture above.

Any ideas what I’m doing wrong? Coming from many, many months of frustration with Jenkins I am really excited to move my organization to Drone - I just need to get us past this! Thank you in advance for your help!

Totally forgot some integral info: Docker version on all Kubernetes hosts:

What do your docker daemon logs say? This message is coming from docker because drone is trying to create containers that link to the drone_ambassador_* container using --volumes-from

This error either tells me either 1) the ambassador container was never created or 2) docker is having trouble linking the volumes or 3) the ambassador container is starting and being killed by something. In either case your daemon logs should provide more insight, as this is not a common error.

Thank you for your reply! This is the daemon’s response. I was trying to elicit this response yesterday to no effect, but I have seen these messages bleeding through to Drone’s UI before (the “no such network interface” message):

time="2017-01-13T18:17:50.919665472Z" level=warning msg="DEPRECATED: Setting host configuration options when the container starts is deprecated and will be removed in Docker 1.12"
time="2017-01-13T18:17:51.002469878Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/09a0c650fadf4247b5823e9dc861064ff952fdbd52d8516f7c0f47591b5b872e/shm: no such file or directory"
time="2017-01-13T18:17:51.046351181Z" level=error msg="Handler for POST /v1.15/containers/09a0c650fadf4247b5823e9dc861064ff952fdbd52d8516f7c0f47591b5b872e/start returned error: failed to create endpoint drone_ambassador_Ws1CyBM82_g on network bridge: adding interface vethc43b3e2 to bridge docker0 failed: could not find bridge docker0: route ip+net: no such network interface"
time="2017-01-13T18:17:51.335255502Z" level=error msg="Handler for POST /v1.15/containers/create returned error: No such container: drone_ambassador_Ws1CyBM82_g"

While the host does not present a docker0 bridge interface:

[email protected]:~# ifconfig docker0
docker0: error fetching interface information: Device not found

Docker seems to think it is presenting this interface with this name:

[email protected]:~# docker network ls
NETWORK ID          NAME                DRIVER
60152d9be17f        bridge              bridge
e3aa6c6eae22        host                host
d0ad3642ffd2        none                null
[email protected]:~# docker network inspect 60152d9be17f
[
    {
        "Name": "bridge",
        "Id": "60152d9be17f7f7740a645e2ecea0461aef29141dddf49936e91221debfe5aeb",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16"
                }
            ]
        },
        "Internal": false,
        "Containers": {},
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "false",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

Historically we have let Kubernetes handle turning up the Docker host on the node with minimal intervention on our behalf. Is this an incorrect assumption to make?

Does Drone present some way to redirect traffic to a particular interface? I admit I am kind of lost when it comes to dealing with the way that Kubernetes deals with Docker in this way - apologies!

I would prefer NOT to create the interface if I can help it, but if there is no way out I suppose we will do what we can.

I would also add: up to Kubernetes 1.5.1 does not seem to create such an interface on GCE Debian/GCI hosts.

Sorry, I don’t have any Kubernetes experience :frowning:

I recommend reaching out to @nlf or @gtaylor or @ipedrazas in our chatroom. They are using Drone+Kubernetes and are probably better equipped to answer such questions.

Good luck!

Hi again!

After some thought and experimentation I was able to get it. I hope to publish this for more general use with Kube configuration files and a prebuilt image, as the agent docker container requires a bit of a different setup on more modern Kubernetes releases. I did want to close the loop on this thread for posterity at the very least.

Kubernetes creates a Docker interface with a non-standard name if you are running it with GCE (that name is cbr0, by the by). cbr0 is a virtual bridge and cannot easily be aliases to another name, which is required for drone’s current client libraries (its Docker client appears to be fixated on the dockerd provisioned name, docker0).

I was able to get the agent working by re-routing from the on-Kubernetes-host docker socket to an improvised one using the DIND official image and running a hackish entrypoint script. The net effect is that builds are done against this Docker-in-Docker flavor, and the host’s configuration and Docker API version are irrelevant. The Dockerfile:

FROM docker:1.12.6-dind
RUN apk add --no-cache bash
COPY bin/drone /drone
COPY entrypoint.sh /entrypoint.sh
RUN chmod u+x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

Obviously, a prebuilt, x86_64-based drone binary need be present at $(pwd)/bin.

The entrypoint script reaps the benefits of DIND and drone running in agent mode:

#!/bin/bash

if [ $# -eq 0 ]; then
    /usr/local/bin/dockerd-entrypoint.sh &
    sleep 10
    /drone agent
else
    exec "[email protected]"
fi

There is probably a better way to do the above, but this was enough to get the agent responding and building.

Like I said, if I can find the bandwidth to standardize/repo-ize all of this, and make that entrypoint script not such a terrific and unbearable hack, I will do so and share.

Thank you again for jumping in!

1 Like

Nice sleuthing! Just wanted to say that we are on GKE (Google Container Engine) which would explain why we are seeing the same symptoms. Doesn’t docker-in-docker require the pod to run in privileged mode, which GKE disables? How were you able to get around that, or were there no issues elevating it to privileged mode?

This only recently started to occur, and it only occurs every few builds so it seems like the majority of the time it’s fine. We have 3 agents all running in separate pods mounting in the host’s docker socket so perhaps it’s node/pod-specific.

@bacongobbler the docker plugin does use docker-in-docker. I have no GKE experience, but was searching on Stack Overflow and it looks like privileged mode should be supported http://stackoverflow.com/questions/31124368/allow-privileged-containers-in-kubernetes-on-google-container-gke