Drone stack runs fine. But check Drone
server logs, get these error messages, are there anything I need to fix?
11:02:08 [GIN-debug] GET /metrics --> github.com/drone/drone/server/metrics.PromHandler.func1 (13 handlers)
11:05:16 INFO: 2017/09/07 11:05:16 grpc: Server.processUnaryRPC failed to write status stream error: code = Canceled desc = "context canceled"
11:05:28 INFO: 2017/09/07 11:05:28 grpc: Server.processUnaryRPC failed to write status connection error: desc = “transport is closing”
At the same time, get a lot error logs from Drone
agents as below:
11:16:50 INFO: 2017/09/07 11:16:50 transport: http2Client.notifyError got notified that the client transport was broken EOF.
My drone version is
This means something (external to drone) is breaking the http2 connection between your agent and server.
I don’t know if you’re using Amazon ECS (although the term “stack” says that to me), but load balancers on AWS have given me a fair amount of trouble with Drone. If you’re using ECS, I think only the Application Load Balancer will work with Drone - the Classic Load Balancer causes the same error message to appear.
Thanks, that’s useful information. Your guess is right, I set up Drone stack in aws ECS cluster.
I am not sure if Drone stack work well with ALB in version 0.8, since it uses gRPC.
Could you please confirm if Drone can work with aws ALB (application load balancer)?
Drone uses http2 which may not play nicely with load balancers. You should check your load balancer documentation to see if they support http2 and/or grpc. You might also want to check with AWS support to see if they have a recommended configuration. I do not use AWS, so this is not something I can confirm.
Did you implement Drone v0.8 under AWS ALB, There are two group of ports 443/80 and 9000, https/http traffic is managed by the external load balancer, but how about port 9000? That’s communication between drone agents to servers and it is internal load balancer
I currently set ELB on both external and internal load balancers.
I agree that I can switch external load balancer to ALB which supports https/http well. But how about the internal load balancer? will tcp port 9000 be supported by ALB, I don’t think so.
This error I got are mostly from the connection between agents to servers, so it is from 9000, how can I fix it?
According to the amazon documentation, the load balancers convert http/2 requests to http/1.1, which means they would not be compatible with drone 
Application Load Balancers provide native support for HTTP/2 with HTTPS listeners. You can send up to 128 requests in parallel using one HTTP/2 connection. The load balancer converts these to individual HTTP/1.1 requests and distributes them across the healthy targets
Port 9000 exposes a grpc endpoint which requires http/2
Thanks for the following up. The drone runs well currently in fact.
Seems I don’t need to spend too much time on these errors if nothing i can do.
I tried a lot of things, but in the end I just loaded 0.5, which worked perfectly over the classic load balancer. I guess you could try customising it with HAProxy or something like that, but I found that using the earlier version solved all of my problems, as it happily communicated with the server via TCP. I might post a blog at some point about my experiences configuring Drone behind classic AWS ELB as it’s useful, but I didn’t get it going via the ALB.
There is potential to use the new NLB (Network Load Balancer). I’m interested if anyone has tried it yet.
I’m not sure how it would help if gRPC requires HTTP2; the NLB is a layer 4 load balancer so it operates at a lower level than HTTP.
I have the same issue but with docker swarm without any LB between containers.
2018-02-04T18:37:[email protected] INFO: 2018/02/04 18:37:57 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
➜ infra git:(master) ✗ docker exec -ti c8 sh
/ # ping ci
PING ci (10.0.0.110): 56 data bytes
64 bytes from 10.0.0.110: seq=0 ttl=64 time=0.177 ms
64 bytes from 10.0.0.110: seq=1 ttl=64 time=0.093 ms
--- ci ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.093/0.135/0.177 ms
seems something wrong with grpc reconnection.
I may suggest that drone agent should stop the process.
But actual stuck happens on the server
maybe here https://github.com/grpc/grpc-go/blob/5ba054bf3709c0a51d1ed76f68d274a8c1eef459/server.go#L854
grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing is a grpc warning. And after this event a drone-server cant use drone-agent, that is why jobs stuck.
I will dig deeper.
I believe I’m using the same stack as you: a single drone-server running on AWS ECS behind a Classical ELB for both external and internal access.
I did some tests a while ago and IIRC, increasing the ELB’s “Idle timeout” would increase the time delta between these error messages; so it seems that the ELB is closing the connection periodically. However, increasing it too much (so it would close the connection less often, e.g only once every hour) would counter-intuitively cause the Agents to permannently lose connection and stay hanging.
So, like you, I’ve decided to just live with the error messages (Idle timeout is set to 30 sec ATM).
I also gave ALB/NLB a try, but I couldn’t get them to support multiple origin ports
AWS’s load balancers do not have full support of end-to-end HTTP/2 yet as of Feb 2018.
The best way to avoid those error messages is to have the agent communicate with the server directly via the server’s IP (public or private), and not behind a load balancer or reverse proxy of any kind.
This communication works in Kubernetes because there is no load balancing “product” at the Service layer, it is just some VIP mappings.