Hello,
We are running Drone v2.0.6, and noticing that prometheus scrapes will intermittently time out after 10 seconds, when a normal scrape time will typically take under a second -
This is leading to gaps in our data where we are experiencing the scrape failures -
Our scrape configuration -
- job_name: drone-server
# bearer-token was generated manually following https://docs.drone.io/server/user/machine/
bearer_token: "%{hiera('drone_prometheus_bearer_token')}"
ec2_sd_configs:
# dev
- region: eu-west-1
role_arn: arn:aws:iam::01234556789:role/prometheus_ec2_sd
port: 80
# foundation-prod
- region: eu-west-1
role_arn: arn:aws:iam::1234567890:role/prometheus_ec2_sd
port: 80
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
regex: drone-server.*
action: keep
- source_labels: [__meta_ec2_instance_id]
target_label: instance
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_tag_Vpc]
target_label: vpc
- source_labels: [__meta_ec2_tag_Rootaccount]
target_label: root_account
- source_labels: [__meta_ec2_tag_Subaccount]
target_label: sub_account
- source_labels: [__meta_ec2_tag_Owner]
target_label: owner
- source_labels: [__meta_ec2_tag_Subsystem]
target_label: sub_system
I’ve taken a look at our drone-server, and there isn’t any high utilization of resources that I can tell that could be contributing to the timeouts. We’re thinking it might be that the drone server process is sometimes locking resources when the /metrics endpoint is hit.