Weird issue with GCR specifically (but not gitlab registry)

I’m not sure what’s going on exactly, or why pushing to gcr specifically completely fails.

Docker info:

Client:
 Version:	17.12.1-ce
 API version:	1.35
 Go version:	go1.9.4
 Git commit:	7390fc6
 Built:	Tue Feb 27 22:17:53 2018
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	18.03.0-ce-rc4
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	fbedb97
  Built:	Thu Mar 15 07:42:54 2018
  OS/Arch:	linux/amd64
Experimental: false

Here are the error messages I’m getting:

From the dind container:

    time="2018-03-18T23:38:34.929137948Z" level=error msg="Upload failed, retrying: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56880->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:38:45.697296074Z" level=info msg="Layer sha256:7bd9c31f8d447bdf187e1b4481dd47dead674782425e7f8a0f5c7b38291f36ea cleaned up"
    time="2018-03-18T23:38:46.968848532Z" level=error msg="Upload failed, retrying: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56936->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:38:52.106884826Z" level=error msg="Upload failed, retrying: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56940->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:38:55.058103684Z" level=error msg="Upload failed: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56944->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:38:55.058470308Z" level=info msg="Attempting next endpoint for push after error: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56944->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:39:02.251787859Z" level=error msg="Upload failed, retrying: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56952->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:39:17.388838238Z" level=error msg="Upload failed, retrying: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56960->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:39:37.548286736Z" level=error msg="Upload failed: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56972->74.125.141.82:443: write: broken pipe"
    time="2018-03-18T23:39:37.548575722Z" level=info msg="Attempting next endpoint for push after error: net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56972->74.125.141.82:443: write: broken pipe"

And from my build runner:

Basically a whole lot of these:

3dd242197f82: Retrying in 3 seconds
f5d9842d46d0: Retrying in 2 seconds
75645eedf26d: Retrying in 2 seconds
86b86df1dd4d: Retrying in 2 seconds
3dd242197f82: Retrying in 2 seconds
f5d9842d46d0: Retrying in 1 second
75645eedf26d: Retrying in 1 second
86b86df1dd4d: Retrying in 1 second
3dd242197f82: Retrying in 1 second
net/http: HTTP/1.x transport connection broken: write tcp 10.28.4.2:56474->74.125.141.82:443: write: broken pipe
Makefile:77: recipe for target 'deploy_docker' failed

and a whole lot of these:

16756a95e889: Retrying in 3 seconds
16756a95e889: Retrying in 2 seconds
16756a95e889: Retrying in 1 second
unexpected EOF

Here’s my set up:

I’m using a single GKE cluster with the drone helm chart, with dind enabled.

I’m using a custom build runner; it basically loops through listed sub-directories and runs an os.Exec.

In this case, it’s gclouddocker build` (successful),

The relevant parts of my .drone.yml:

    pipeline:
      deploy_docker:
        image: my-build-runner-image-with-docker-ce
        volumes:
          - /var/run/docker.sock:/var/run/docker.sock
        # network_mode: host
        environment:
          # I've tried using the provided docker-in-docker with network_mode: host
          # - DOCKER_HOST=tcp://localhost:2375
          - DOCKER_REGISTRY_DOMAIN=registry.example.io
          # Enabling this flag tells the makefile to use "gcloud docker --" instead of "docker --"
          # - DOCKER_USE_GCP=true
        commands:
          - docker version
          - gcloud docker --authorize-only
          - jules -stage deploy_docker

So I can do this, and it will work in this environment. This is as close as I can get to reproducing the actual build process.

This uses the ci-drone-agent's dind container that’s deployed on my cluster, with the image that I provided above, an the same options that I provided above.

I then clone the repository in the same location, and run the commands provided above.

kubectl exec -it ci-drone-agent-75f65b6f99-zbq6p -c ci-drone-dind docker run --rm -it -e GKE_CLUSTER_NAME=my-cluster-1 -e GKE_CLUSTER_ZONE=us-east1-b -e GCP_PROJECT=my-gcp-project -e DOCKER_USE_GCP=true --network host my-build-runner-image-with-docker-ce /bin/bash

And it works perfectly when I do it manually.

I’ve tried:

  • Enabling host networking and using the dind docker daemon via tcp
  • Disabling dind and setting the pipeline container to privileged mode and mounting /var/run/docker.sock from the node.
  • Using various methods to authenticate with gcr.io, including gcloud docker -- push, gcloud docker --authorize-only, using the _json_token user with the key provided to docker login.
  • Switching the base image of my build runner jules form Debian to Ubuntu
  • Upgrading the dind image to :latest.

The weird thing about this issue?

It’s exclusive to gcr, but only when it’s ran by Drone.

FYI, these are really small FROM scratch images with just a binary on them.
Any ideas or anything else I could try?

transport connection broken: write tcp 10.28.4.2:56880->74.125.141.82:443: write: broken pipe

This would indicate something is breaking the network connection.

I then clone the repository in the same location, and run the commands provided above.
And it works perfectly when I do it manually.

These tests might not accurately reflect how drone behaves, because drone creates a user-defined network per pipeline and does not use the default docker bridge network. All of your testing (that does not involve plugins/gcr) seems to assume the default network. Some kubernetes clusters have issues with user-defined networks.

This is the equivalent to how Drone will run a container:

docker network create foo
docker run --network=foo [...] plugins/docker

Note that Drone uses vanilla docker commands to launch containers. It uses a standard user-defined bridge network and does not interfere with networking in any way that would make it responsible for breaking a network connection.

Gotya, I didn’t think about that.

I also have an instance of cockroachdb running alongside my tests; I’ll try to more accurately reproduce this using docker networks and report back.

Thanks for the quick response!

What really bothers me is how this is only an issue with GCR. Is it because of the different versions of docker registries that they’re running? v2 maybe uses gRPC with http/2 streams or something?

The gcr plugin uses the standard docker protocol. The plugins/gcr image is a lite shell script wrapper around the plugins/docker image [1] and just sets a few defaults. So probably not a protocol issue in this case :slight_smile:

[1] https://github.com/drone-plugins/drone-docker/blob/master/cmd/drone-docker-gcr/main.go

Right but I’m not using either of those. Instead my script runner is running the docker client that’s built into the image.

I’d like to be using the plugins but unfortunately they’re not very compatible with monorepos yet, so for now to get all of my images built I’ve gotta roll my own solution.

So here’s what I ended up doing.

kubectl exec -it <ci-drone-agent-pod> -c <dind-container> -- /bin/sh
$ docker network create test-network && \
docker run --network=test-network -d cockroachdb/cockroach:v1.1.2 -c /cockroach sql --insecure && \
docker run --rm -it -e GKE_CLUSTER_NAME=my-cluster-1 -e GKE_CLUSTER_ZONE=us-east1-b -e GCP_PROJECT=my-gcp-project -e DOCKER_USE_GCP=true -v /var/run/docker.sock:/var/run/docker.sock --network=test-network us.gcr.io/my-project/runner /bin/sh -c 'mkdir -p src/git.example.com/project && git clone https://user:[email protected]/project/project $GOPATH/src/git.example.com/project/project && cd $GOPATH/src/git.example.com/project/project && git checkout gcr && jules -stage deploy_docker'

Aaaaand it failed!

35d14a6c7a48: Retrying in 4 seconds
2d35c11015ca: Layer already exists
79ca8e799ed4: Layer already exists
d3d5e1591287: Layer already exists
d8e80354a27b: Layer already exists
9dfa40a0da3b: Layer already exists
88f64cb1430d: Retrying in 3 seconds
bc3314b6b873: Retrying in 3 seconds
aac2a0a8500c: Retrying in 3 seconds
35d14a6c7a48: Retrying in 3 seconds
88f64cb1430d: Retrying in 2 seconds
bc3314b6b873: Retrying in 2 seconds
aac2a0a8500c: Retrying in 2 seconds
35d14a6c7a48: Retrying in 2 seconds
88f64cb1430d: Retrying in 1 second
bc3314b6b873: Retrying in 1 second
aac2a0a8500c: Retrying in 1 second
35d14a6c7a48: Retrying in 1 second
net/http: HTTP/1.x transport connection broken: write tcp 10.28.1.34:36258->74.125.141.82:443: write: broken pipe

Perfect, now it’s reproducible.

So the problem is introduced when docker networks are thrown into the mix.

Perfect, now it’s reproducible.

awesome, that is half the battle :slight_smile:

So the problem is introduced when docker networks are thrown into the mix.

I unfortunately don’t have much kubernetes experience, so I’m afraid I won’t be much help at this point. I recommend posting to kubernetes support or stackoverflow. I’m sure someone there will know exactly what to do… And please post back if you figure it out. This has come up in the past, and I’m sure it would really help people reading this thread in the future. Good luck!

Does this also mean that the docker / gcr plugins would also not work correctly when using Drone on Kubernetes?

I’m not really in a position to try it with my project, but if you have a quick command / image to test with I’d be willing to throw that on my agent for you to test it

Does this also mean that the docker / gcr plugins would also not work correctly when using Drone on Kubernetes?

nope, I can confirm we have a number of customers running Drone and the Docker plugins on Kubernetes without issue. In my experience dealing with some individuals that have had issues, it has always been related to host machine configuration [1]. It is my understanding that user-defined networks have different behavior and defaults than the default bridge [2] which means your default bridge configuration does not apply to the user-defined network. In some cases I am also told networking plugins (e.g. calico) also have compatibility issues with user-defined networks.

My understanding is that individuals that have experienced these issues have been able to resolve them [3] however they have not posted back the steps they took to resolve. So please post back when you figure it out :slight_smile:

[1] Couldn't resolve host on clone
[2] https://docs.docker.com/network/bridge/#differences-between-user-defined-bridges-and-the-default-bridge
[3] Couldn't resolve host on clone

hmm. Interesting. So plugins are also containers that are ran in the user-defined network?

What do you think causes the difference between the Gitlab registry working well, but the Google Container Registry failing consistently?

Is there maybe a way or workaround to disable the --network flag on a pipeline step?

I’m typing up that stackoverflow question right now. :slight_smile:

hmm. Interesting. So plugins are also containers that are ran in the user-defined network?

yes, every step in your pipeline is a container. Similar to docker-compose, all containers launched in the pipeline share the same user-defined network, which is what allows the pipeline containers to communicate with the service containers.

What do you think causes the difference between the Gitlab registry working well, but the Google Container Registry failing consistently?

Is GitLab hosted on your kubernetes cluster or inside your firewall vs Google Container Registry which is external? Is there a proxy in place and do you need to adjust https_proxy or no_proxy variables? Do you have custom DNS? Unfortunately when it comes to networking there are many different variables to consider, and it is a bit outside of my area of expertise :frowning:

Is there maybe a way or workaround to disable the --network flag on a pipeline step?

You can set the network_mode variable for the container, similar to docker-compose. See https://docs.docker.com/compose/compose-file/#network_mode for more information.

Is GitLab hosted on your kubernetes cluster or inside your firewall

It is hosted in GCP while our cluster is in GKE in the same project, however the DNS resolves externally so all traffic should be going through the firewall.

I’m sticking to my guess that it’s the different Docker Registry versions. I’m gonna test it with docker.io.


Just tested it with docker.io and it worked fine… :thinking:

Thanks for the info about network_mode, I’ll try that with none and bridge!

Setting network_mode: bridge on that build step did not help, nor did setting it to none.

Is using network_mode: bridge synonymous with omitting the --network flag in docker run?

I misspoke, network_mode is not synonymous with the docker-compose implementation and does not prevent the user-defined network from attaching to the container.

Also is there a hard requirement to run your agents on your kubernetes cluster? You could run your server on kubernetes, and use the autoscaler to fully manage your agents https://github.com/drone/autoscaler

Gotya, that makes sense.

Not necessarily, but I would like for everything to be in Kubernetes, as it’s much easier for me to manage and scale.

The autoscaler does seem promising. I’ll put that in a deployment and see how it goes! :joy:

Ok, I’ve got the autoscaler running and it’s behaving as I would expect, but I’m still having the same issues.

Whenever I ssh into one of the agents created by the autoscaler and run the same command as above, the same thing happens.

Any ideas?

This time, after removing the docker network, the issue persists.

I’ve emailed Google Container Registry support to see if they have any relevant logs or information.

No clue, unfortunately. I do not have enough expertise with the glcoud tool to understand why it would throw broken pipe errors when running inside a vanilla docker environment (albeit with user-defined networks). Maybe the gcloud authors could provide more insight into possible root causes?

edit: I see you contacted their support, I hope that will be able to provide some insight.

Ah. It was user error.

Somehow gcloud was using the auth metadata on the Google Compute Engine instance rather than the gcloud config that I provided, which resulted in all of the errors.

Great, sounds like you got it figured out!