[0.8.1] Agent loses connection overnight

Like in the title … agent simply lost connection over night. All pipelines were pending startup. docker restart <container> fixed it. It happened for the second time and I don’t believe it’s related to network issues. Any tips for debugging? Please mind it never happened on version 0.7 with web sockets.

This is my docker-compose file.

version: '2'

services:
  drone-agent:
    image: drone/agent:latest
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DRONE_SERVER=my.ci.io:9000
      - DRONE_SECRET=${DRONE_SECRET}
      - DRONE_MAX_PROCS=16
      - DRONE_DEBUG=true
      - DRONE_DEBUG_PRETTY=true

I’ve ran into this too. I have a dev environment that doesn’t run many jobs. Seems like when there is no activity the agent disconnects. Restart the agents bring things back up again … until no activity and it disconnects again.

Is there any heartbeat between server and agent? The reason I ask is because I think in my environment the issue is a firewall that kills idle connections after a while. I’m still investigating on that.

I’m running drone/drone:0.8.1 currently in dev. I have seen this also on drone/drone:0.8.0-rc.3.

I do not see the issue in Prod. Prod is more active and in an environment that does not have the firewall.

Going to test resolving this by tuning linux keepalive

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

So I tried the following keep alive settings, which has helped other people running elasticsearch in the same environment who saw connectivity issues due to firewall killing idle connections:

    "tcp_keepalive_time": 1800
    "tcp_keepalive_probes": 3
    "tcp_keepalive_intvl": 60

And it didn’t seem to help. Still lost agent connection overnight.

@jmccann I hope we can find a solution but my gut feeling is that the problems lies in software

@bradrydzewski Do you have any thoughts on this? Should it be considered an issue? If so I can open one for it. Thanks!

I think we would need to add a healthcheck as described here https://github.com/grpc/grpc/blob/master/doc/health-checking.md

I am told that sending a healthcheck every 30 seconds will keep something like haproxy from killing idle connections.

4 Likes

I have the same issue with running drone in kubernetes

I did create https://github.com/drone/drone/issues/2246 to track this

I created new PR to add grpc health check https://github.com/drone/drone/pull/2262

1 Like

I’d like to continue this discussion please. I’ve tried to work-around these seemed connectivity issues between the agent<->server that continually cause missing builds (Pending). One issue I have with most discussion points thus far is the fact I’m not going through a LB or Reverse Proxy with agent communications; but through Docker’s overlay networking. @bradrydzewski tells me (likely rightly so) that there are some known issues with overlay networking with reliability. This config looks like this => https://gist.github.com/d596784d3138e9ba22483501589aa600

Problem is even working around this and going direct (e.g: this Docker stack file: https://gist.github.com/prologic/ddd39faba96618ec802d079517359104) I still have issues.

The biggest pain point right now is that there is no way to get the resulting Pending builds going again. Restarting the servers and/or the agents doesn’t help. Restarting the build just displays an error in the UI.

Interesting that with my updated config that connects the agent(s) directly over the host network (side-stepping any overlay issues) I see errors like this in the agent logs:

[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:42816->10.0.0.10:9000: read: connection reset by peer.
[email protected]    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:49262->10.0.0.10:9000: read: connection reset by peer.
[email protected]    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
[email protected]    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken EOF.
[email protected]    | INFO: 2017/11/25 17:39:55 transport: http2Client.notifyError got notified that the client transport was broken read tcp 172.17.0.3:33444->10.0.0.10:9000: read: connection reset by peer.
[email protected]    | INFO: 2017/11/25 17:39:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
[email protected]    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}
[email protected]    | INFO: 2017/11/25 17:39:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused"; Reconnecting to {10.0.0.10:9000 <nil>}

What I don’t see is the agent(s) trying to reconnect at all. Is this expected?

the agent uses the built-in reconnect logic provided by grpc [1]

	cli := &Client{
		endpoint: endpoint,
		retry:    defaultRetryClount,
		backoff:  defaultBackoff,
		headers:  map[string][]string{},
	}

It looks to me like the agent is trying to reconnect, but is returning a connection refused. Why is your network refusing connection from the agent to server?

 Error while dialing dial tcp 10.0.0.10:9000: getsockopt: connection refused

[1] see pipeline/rpc/client.go

Update Status: still not working on drone/agent:latest

The followings are error messages in drone/agent.

2018/01/14 07:19:06 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:07 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:08 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:09 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/14 07:19:10 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1

Have you tried setting the keep alive values? https://github.com/drone/drone/pull/2294

HI @bradrydzewski Yes, I have tried it but it can’t work.

See the setting:


agent.keepalive.time: "1s"
agent.keepalive.timeout: "5s"

Does this setting right?

The server will reject keepalive requests with a frequently under 5m by default and will result in the client from being blocked by the server. This is a built-in grpc security feature. If you want to set duration values under 5m, you need this patch https://github.com/drone/drone/pull/2295

Also note that 1s and 5s are pretty low values. Usually load balancers and proxies have 60 second default timeouts, so 30s and 15s are probably better values, respectively.

I will try the 5m value in keepalive.time without https://github.com/drone/drone/pull/2295 patch.

  agent.keepalive.time: "5m"
  agent.keepalive.timeout: "30s"

drone agent always outputs the following when the container starts in Kubernetes.

2018-01-15T01:46:00Z |DEBU| request next execution
2018/01/15 01:46:00 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1
2018/01/15 01:46:01 grpc error: done(): code: Internal: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: 1

You can see the message per second.

maybe they can help you in this thread https://github.com/grpc/grpc-go/issues/1134

I also try the patch https://github.com/drone/drone/pull/2295 and update the PermitWithoutStream as true

            grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
                MinTime: c.Duration("keepalive-min-time"),
                PermitWithoutStream: true,
            }),

Still not working with that.