Cannot Use Task IAM Roles in ECS

I am currently running Drone in an ECS cluster and have created the Drone server and Agent as separate ECS tasks. Our build pipelines provision both infrastructure and deploy code all in the same pipelines. This means that the Drone agents need very liberal permissions when interacting with AWS in order to create and destroy infrastructure.

I had initially put these permissions at the EC2 instance level but obviously that means that everything in the cluster (there are other non-Drone related containers running) now has these crazy permissions. I decided to change the cluster to only give minimal permissions to the Drone server and other unrelated containers and the more elevated permissions to the agents.

However, unfortunately, in order to use Task IAM roles an environment variable in the container called AWS_CONTAINER_CREDENTIALS_RELATIVE_URI needs to be set dynamically based on the UUID of the task (more info here). The problem, of course, is that this is available in the agent container but not in any of the pipelines’ containers, thus, any AWS cli commands (both the CLI and SDK rely on this variable) that rely on the permissions set in the Task IAM role will fail unless they are also present in the EC2 instance the container is running on. This is obviously suboptimal from a security point of view.

There is also a way to set an IAM role at the ECS service level (services are composed of one or more tasks) but unfortunately that only works if the service has a load balancer; having to assign a load balancer to every service regardless of whether they need one or not also feels suboptimal. Is there a way around this that I’m not seeing?

I have no experience with ECS, but teams using Kubernetes generally deploy agents with a docker:dind container, linked to the agent container. Drone is then configured to use the docker:dind container instead of on the host machine docker daemon.

Perhaps this approach could be used with ECS to workaround these issues.

Thanks for the quick reply, much appreciated! That sounds very complicated so I think I’ll just segregate the Drone server and agents in their own specific cluster.

Hi @svozza

I deployed Drone on ECS, it’s a cluster that hosts different services. I use an application load balancer (ALB) on top of the cluster. Each service has it’s role and drone (server and agents) use the role with no problem.
I can explain how I deployed it if you’d like.

I’d also like to hear did you deploy to them as different services and set the communication between them.
You’re welcome to talk to me - [email protected]

Do you have any code that you can reference for the dind-> drone-agent linking?

I don’t think it’s possible with the ALB right now, since you need TCP + HTTPS for incoming for drone. I just tried this with an NLB + ALB in AWS and it didn’t work as ECS doesn’t want you to use two different load balancers.

If anyone has any insight, I would love it, as I’m keen on getting my setup to 0.8.

My current setup in ECS is having drone deployed as a task that has drone server and a few agnts in it. They are linked and communicate over TCP directly.
This is not ideal though, I’m waiting for a change, either AWS ALB start supporting HTTP/2, or TCP. Or maybe drone would have another communication option setup.

ALB does support HTTP/2, but not both HTTP2 and TCP at the same time on different ports.

I am able to get away with multiple agents on a docker swarm setup that I have, but with ECS there is no solution for this that can scale effectively without an overlay network of custom origin.

No, ALB only support incoming HTTP/2 but then translates them to HTTP/1.
Which is why gRPC cannot communicate. I’ll be happy to learn otherwise.
Full HTTP/2 support can be a solution. As well as TCP mode

Thanks for the further replies, I’m only getting a chance to reply now. So I have deployed the Drone sever as an ECS service and the agents as a separate ECS service. That’s the only way to do it if you want to have multiple agents talking to one server. The agents service doesn’t need a load balancer (they know how to communicate to the Drone server using the DRONE_SERVER environment variable) so that’s why I can’t use the service role approach. My colleague has put the CloudFormation for a very similar cluster here so you can see the setup I’m discussing here:

That example is running Drone 0.5 but I’m running 0.8. I’m not sure I follow on why you need TCP, I have the full Drone cluster just running with one ALB, which is a layer 7 load balancer that only operates at the HTTP/S level.

Can we see your cloudformation?

Unfortunately, my template is in our private company github repo so I can’t but it is virtually identical to the one I’ve posted. The only difference is that I’m using drone 0.8.

Could you give details on how you set ALB with Drone?

I explained my environment here:

So I am curious how you can work it out with ALB.

^^

Yeah I was going to ask, it can’t be the same because they changed the ports in drone, and its now gRPC instead of websockets

OK, that’s weird. I’m using websockets and it still works. I have set DRONE_SERVER to wss://drone.example.com/ws/broker and am forwarding any traffic sent to the load balancer on port 8000 to the Drone server task.

Ah wait, my mistake. I’m actually on Drone 0.7.