Kubernetes runner stuck in running state / bringing the Kubernetes runner to GA

marko-gacesa recently joined the team and he will be focusing on bringing the Kubernetes runner to GA. He authored this pull request that redesigns how we watch for updates, including periodic polling as a fallback just in case the runner fails to receive an event from Kubernetes (for whatever reason) with the hopes of making the runner much more resilient. We would ask everyone in reading this thread to begin testing using the latest runner image. Also please remember that when you are using our latest runner image you are using an unstable build; this should not be tested in production.

This particular issue has been very difficult for us to triage because it is inconsistent; some people use the runner in production without issues and others periodically see pipelines or steps stuck in a running state. Going forward when someone reports an issue we are going to ask for cluster details so that we can try to mirror your cluster internally and reproduce. We may also ask to schedule a Zoom call to triage live. We would love to have the runner reach GA by early July but that will depend on community testing and feedback, so we look forward to your help.

If you continue to experience issues with pipelines or steps getting stuck with the Kubernetes runner, please let us know and please provide the following:

  1. Ensure you are using the latest Kubernetes runner image (please avoid reporting issues with previous versions)
  2. Enable debug and trace logging
  3. Publish your runner logs to a gist and provide a link
  4. Publish your yaml to a gist and provide a link (bonus points for providing a simplified yaml that we can use to easily reproduce)
  5. Publish your cluster details to a gist and provide a link; provide whatever information it takes for us to re-create your cluster in our environment

If you experience issues with the kubernetes runner unrelated to pipelines or steps stuck in pending or running status, please create a separate thread and we will triage and address separately. Thanks!


EDIT 2021-06-08

we uncovered an issue with an unknown container error being returned and causing steps to fail sporadically. We are investigating. See Latest drone-runner-kube fails successful steps and does not display output


EDIT 2021-06-15

we uncovered an issue where an invalid or non-existing image would cause the pipeline to hang due to an uncaught error. This has been patched and a new version of the runner image is available. https://github.com/drone-runners/drone-runner-kube/pull/58

2 Likes

i’ve built the runner with the latest changes and it is running now on our cluster… WIll monitor it today and will come back with some results.

Thanks!

Any chance for a 1.0.0-beta.7 version to be published to Docker Hub?

@alonsodomin yes, we will plan on tagging 1.0.0-beta.7 end of this week if we do not receive any new issue reports. We would continue to ask everyone to test the latest kubernetes runner image (we just pushed another update) and continue reporting issues they encounter related to stuck pipelines.

I thought new version it is perfect already, but I noticed issue after I re enabled one repo what has over 20 steps. One step got stuck on running

  - name: db-connections
    image: ecr.ad.dice.fm/base:waitdbs-latest
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - waitpostgres $PG_URL
      - waitmongo $MONGO_HOST
      - waitredis $REDIS_HOST

Redis check is simple script and it clearly finished:

#!/bin/bash
failure=0
while [ true ]; do
    redis-cli -h $1 ping
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 120 ]; then
        exit 1
    fi
    echo "Waiting for redis..."
done%           

It said “PONG”, what means it connected and finished

Normally it takes 8s, but in this case it was Running until job timeout

Jobs with less step, are running without single failure since I am using new latest runner image

It looked exactly like this, with a difference that it never finish. But all looked the same in a step logs

@michalzxc thanks for the report. Is it possible to provide a yaml that can be used to reproduce? as well as your kubernetes runner logs with trace logging enabled? @marko-gacesa any other information that would be helpful?

I enabled trace only after, so I am trying to reproduce it now myself (also I wasn’t sure if my latest runner image was really the latest, so restarted runner pod with pull always).
As said, new version was working perfectly until now, and it was a first time when step got stuck since a week, so maybe it was just accident that it happened to this particular job at my first try to run it.
I will get back to you when I will have logs collected, thank you!

I tried to remove all private images while keeping all drone specific parts (persistent volumes, secret used, service type containers) and this is what I got:

name: default
kind: pipeline
type: kubernetes

volumes:
  - name: cache
    host:
      path: /tmp/drone/cache/v1

services:
  - name: postgres
    image: bitnami/postgresql:11
    environment:
      POSTGRESQL_POSTGRES_CONNECTION_LIMIT: 500
      POSTGRESQL_LOG_CONNECTIONS: YES
      POSTGRESQL_PASSWORD: "pass123"
      POSTGRESQL_USERNAME: postgres
  - name: mongodb
    image: mongo:3
    command: [ --smallfiles ]
  - name: redis
    image: redis
  - name: stripe-mock
    image: stripemock/stripe-mock:v0.84.0
  - name: geoip
    image: fiorix/freegeoip

basic_env_vars: &basic_env_vars
  PG_URL: "postgres://postgres:[email protected]:5432/test?ssl=false"
  MONGO_HOST: "mongodb"
  REDIS_HOST: redis
  REDIS_URL: "redis://redis:6379"
  CACHE_TEST_REDIS_URL: "redis://redis:6379/2"


steps:
  - name: restore-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      restore: true
      ttl: 7
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:
        - ./deps
        - ./_build

  - name: db-connections
    image: debian
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - apt update && apt install -y postgresql-client redis-tools
      - wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add -
      - echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list
      - apt update && apt install -y mongodb-org
      - while [ true ]; do mongo --norc --quiet --host=$MONGO_HOST:27017 <<< "db.getMongo()";     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for mongo..."; done
      - while [ true ]; do     DB="$(echo -e ''"${PG_URL}"''|sed 's/?ssl=false//g'|sed 's/?ssl=true//g')";     psql -Atx ''"${DB}"'' -c "select current_date";     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for postgres..."; done
      - while [ true ]; do     redis-cli -h $REDIS_HOST ping;     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for redis..."; done

  - name: install-deps
    image: elixir:latest
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Do some elixir stuff"
    depends_on:
      - restore-cache
      - db-connections
  - name: rebuild-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      rebuild: true
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:  
        - ./deps
        - ./_build
    depends_on:
      - install-deps

  - name: k-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Some app tests"
    depends_on:
      - install-deps
  - name: t-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "More tests"
    depends_on:
      - install-deps
  - name: other-tests-and-credo
    image: elixir:latest
    environment:
      <<: *basic_env_vars
    commands:
      - echo "other tests"
    depends_on:
      - install-deps
  - name: build-image
    image: debian
    privileged: true
    environment:
      SOME_SECRET:
        from_secret: SOME_SECRET
    commands:
      - echo "Build image and push it"
   
  - name: deploy-image
    image: debian
    environment:
      OLLY_SECRET:
        from_secret: OLLY_SECRET
    commands:
      - echo "Push chart to chart museum"
    depends_on:
      - other-tests-and-credo
      - t-tests
      - k-tests
      - build-image

@michalzxc thank you very much! It would be really great if you could get trace logs of a stuck runner…

At which step the runner was stuck? What was the step doing? Can you provide some details about your kubernetes setup?

Does the issue happen only with this repo (with a lot of steps)? Other repos (with pipelines with fewer steps) are not causing any problems?

It was the db-connections one, it was keep “running” after redis connected and displayed “PONG” result
I wasn’t able to replicated it so far, previously with old version of runner it was happening quite often, at this moment I saw it only once, I am restarting jobs today randomly today in hope to replicate it at some point
This is bare metal kubernetes on OVH dedicated hosts INFRA-3 ones (Processor: AMD Epyc 7371 - 16 c / 32 t - 3.1 GHz / 3.8 GHz with 256Gigs of memory, and hardware SSD raid), it is kubernetes 1.18 and is using calico networking. It is using crio to manage containers, with this path I got when I was asking for help there with previous version of drone runner - Different container state between kubernetes and crictl · Issue #4693 · cri-o/cri-o · GitHub

I got something else this time

After I noticed that “db-connections” step failed like this, I cancelled the job and restarted it

time="2021-06-24T14:33:30Z" level=debug msg="stage details fetched" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:33:31Z" level=debug msg="updated stage to running" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:33:32Z" level=debug msg="PodWatcher: Started" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:32Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-b8xnm0v4wi67znzpbi8b image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:33Z" level=debug msg="PodWatcher: Waiting..." container=drone-b8xnm0v4wi67znzpbi8b pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:40Z" level=debug msg="PodWatcher: Wait finished. Duration=7.66s" container=drone-b8xnm0v4wi67znzpbi8b error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-b8xnm0v4wi67znzpbi8b pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-b8xnm0v4wi67znzpbi8b error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:50Z" level=debug msg="received exit code 0" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 step.name=clone thread=66
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"build-image\"" container=drone-abg3r89abnbdzlg9i258 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"redis\"" container=drone-aw5mytoznqji4kqdkmf4 image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"postgres\"" container=drone-ynqhyrea0ftwqf51270v image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"db-connections\"" container=drone-sxt7effysqf6n01a15g6 image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"restore-cache\"" container=drone-zbg97fzhxezfx2lsip66 image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"mongodb\"" container=drone-9slbly9y70ivg9dhok2g image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" container=drone-kjv8whfgek2k5ntr5r91 image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"geoip\"" container=drone-t43uzmbdmf0b8s8wdryw image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-abg3r89abnbdzlg9i258 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-aw5mytoznqji4kqdkmf4 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-ynqhyrea0ftwqf51270v pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-sxt7effysqf6n01a15g6 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-zbg97fzhxezfx2lsip66 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-kjv8whfgek2k5ntr5r91 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-t43uzmbdmf0b8s8wdryw pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Wait finished. Duration=1.35s" container=drone-abg3r89abnbdzlg9i258 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:52Z" level=debug msg="PodWatcher: Waiting..." container=drone-9slbly9y70ivg9dhok2g pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.16s" container=drone-zbg97fzhxezfx2lsip66 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.32s" container=drone-aw5mytoznqji4kqdkmf4 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.27s" container=drone-ynqhyrea0ftwqf51270v error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.22s" container=drone-sxt7effysqf6n01a15g6 error="container failed to start (invalid image?)" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Waiting..." container=drone-zbg97fzhxezfx2lsip66 pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-zbg97fzhxezfx2lsip66 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:54Z" level=debug msg="received exit code 0" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 step.name=restore-cache thread=66
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=3.63s" container=drone-9slbly9y70ivg9dhok2g error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=4.23s" container=drone-t43uzmbdmf0b8s8wdryw error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=4.63s" container=drone-kjv8whfgek2k5ntr5r91 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:34:43Z" level=debug msg="PodWatcher: Waiting..." container=drone-vzivcajpr2mmprx7kn9f pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:43Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-vzivcajpr2mmprx7kn9f error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:43Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=tournesol-build-and-test thread=83
time="2021-06-24T14:34:43Z" level=debug msg="Engine: Starting step: \"tournesol-deployment\"" container=drone-pr0qw6hlqjeexlgspugo image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:44Z" level=debug msg="PodWatcher: Waiting..." container=drone-pr0qw6hlqjeexlgspugo pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Wait finished. Duration=1.28s" container=drone-pr0qw6hlqjeexlgspugo error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-pr0qw6hlqjeexlgspugo pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-pr0qw6hlqjeexlgspugo error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:45Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=tournesol-deployment thread=83
time="2021-06-24T14:34:45Z" level=debug msg="Engine: Starting step: \"prefect-deployment\"" container=drone-o2j0sjimgpuvkxg7go76 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:45Z" level=debug msg="Engine: Starting step: \"dam-deployment\"" container=drone-00tu1zxwlhzfg5yuz1rz image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-o2j0sjimgpuvkxg7go76 pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-00tu1zxwlhzfg5yuz1rz pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=2.96s" container=drone-o2j0sjimgpuvkxg7go76 error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=2.92s" container=drone-00tu1zxwlhzfg5yuz1rz error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Waiting..." container=drone-00tu1zxwlhzfg5yuz1rz pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-00tu1zxwlhzfg5yuz1rz error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:48Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=dam-deployment thread=83
time="2021-06-24T14:35:06Z" level=debug msg="stage received" stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="stage received" stage.id=4663 stage.name=default stage.number=1 thread=84
time="2021-06-24T14:35:06Z" level=debug msg="stage accepted" stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="stage accepted by another runner" stage.id=4663 stage.name=default stage.number=1 thread=84
time="2021-06-24T14:35:06Z" level=debug msg="poller: request stage from remote server" thread=84
time="2021-06-24T14:35:06Z" level=debug msg="stage details fetched" build.id=4280 build.number=3045 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="updated stage to running" build.id=4280 build.number=3045 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:07Z" level=debug msg="PodWatcher: Started" pod=drone-zjpllesl0kcrfdrv8ao9
time="2021-06-24T14:35:07Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-mov6r84ugyl04dw5k7jc image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-zjpllesl0kcrfdrv8ao9
time="2021-06-24T14:35:08Z" level=debug msg="PodWatcher: Waiting..." container=drone-mov6r84ugyl04dw5k7jc pod=drone-zjpllesl0kcrfdrv8ao9 state=RUNNING
time="2021-06-24T14:35:08Z" level=debug msg="received cancellation" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:10Z" level=debug msg="stage received" stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="stage received" stage.id=4664 stage.name=default stage.number=1 thread=85
time="2021-06-24T14:35:10Z" level=debug msg="stage accepted" stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="stage accepted by another runner" stage.id=4664 stage.name=default stage.number=1 thread=85
time="2021-06-24T14:35:10Z" level=debug msg="poller: request stage from remote server" thread=85
time="2021-06-24T14:35:10Z" level=debug msg="stage details fetched" build.id=4281 build.number=419 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="updated stage to running" build.id=4281 build.number=419 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:11Z" level=debug msg="PodWatcher: Started" pod=drone-alt3npdar9ttei61w56p
time="2021-06-24T14:35:11Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-zpga7msqmvbo9viioqdy image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-alt3npdar9ttei61w56p
time="2021-06-24T14:35:11Z" level=debug msg="PodWatcher: Waiting..." container=drone-zpga7msqmvbo9viioqdy pod=drone-alt3npdar9ttei61w56p state=RUNNING
time="2021-06-24T14:35:18Z" level=debug msg="destroying the pipeline environment" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:23Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:35:23Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:35:23Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:23Z" level=debug msg="stage failed" build.id=4279 build.number=418 duration=108 error="1 error occurred:\n\t* Optimistic Lock Error\n\n" repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66

I think there is some pattern that it only happens to this job. Maybe use of volumes, or amount of steps :thinking:

There are other jobs with less steps using “db-connections” container the same way and no single issue since I updated to latest runner last week

Wow @michalzxc ! Thank you very much. This is super useful.

In the log I noticed an error: ‘error=“container failed to start (invalid image?)”’. It happened twice! Both times for the same “db-connections” step. This error happens when Kubernetes fails to run a container. As the error text explains this can happen if there is a typo in the image name or the image doesn’t exist, but can also happen for other reasons. Do you have a idea why would this container fail to start? Memory requirement? Some other limits?

So after job restart it was fine (so image is there, and there was no typo), I noticed it has “pull:always” some maybe some repository short issue?
Removing pull:always now

I can’t imagine it might be resource issue, drone has 2 very big hosts just for itself:

free -m
              total        used        free      shared  buff/cache   available
Mem:         257890        8151       86191         644      163546      250510
Swap:             0           0           0
w
 16:51:17 up 30 days, 54 min,  1 user,  load average: 3.95, 1.96, 1.03

Out of 32 threads (16 physical cores)

All it runs are 3 scripts:

waitdbs❯ cat waitmongo                                                                                                                                  ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    mongo --norc --quiet --host=$1:27017 <<< "db.getMongo()"
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for mongo..."
done                                                                                                                                                           waitdbs❯ cat waitpostgres                                                                                                                               ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    DB="$(echo -e ''"${1}"''|sed 's/?ssl=false//g'|sed 's/?ssl=true//g')"
    psql -Atx ''"${DB}"'' -c "select current_date"
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for postgres..."
done                                                                                                                                                               waitdbs❯ cat waitredis                                                                                                                                  ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    redis-cli -h $1 ping
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for redis..."
done

In order to block pipeline test steps until mongo, redis, postgress (“drone services”) will start

I got the same case for different image now, without pull:always, like before it was all fine after job Restart


Like before I terminated it after observed issue and restarted a job

time="2021-06-24T17:23:13Z" level=debug msg="stage details fetched" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 thread=14
time="2021-06-24T17:23:13Z" level=debug msg="updated stage to running" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 thread=14
time="2021-06-24T17:23:13Z" level=debug msg="PodWatcher: Waiting..." container=drone-jmq50lkyndn7mhfl9mnb pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:13Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-jmq50lkyndn7mhfl9mnb error="<nil>" pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:13Z" level=debug msg="received exit code 0" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 step.name=kim-tests thread=23
time="2021-06-24T17:23:14Z" level=debug msg="PodWatcher: Started" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:14Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-wrnpbai071qsvie81kvy image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:15Z" level=debug msg="PodWatcher: Waiting..." container=drone-wrnpbai071qsvie81kvy pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:18Z" level=debug msg="PodWatcher: Waiting..." container=drone-vkvrzel3l1bzpeqxwis0 pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:18Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-vkvrzel3l1bzpeqxwis0 error="<nil>" pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:18Z" level=debug msg="received exit code 0" build.id=4316 build.number=431 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 step.name=ticketing-tests thread=64
time="2021-06-24T17:23:18Z" level=debug msg="Engine: Starting step: \"deploy-image\"" container=drone-x2mzoyh35lh4h99ofusg image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-g2lowjiru2bavhn48ekx
time="2021-06-24T17:23:18Z" level=debug msg="PodWatcher: Waiting..." container=drone-x2mzoyh35lh4h99ofusg pod=drone-g2lowjiru2bavhn48ekx state=RUNNING
time="2021-06-24T17:23:18Z" level=debug msg="PodWatcher: Wait finished. Duration=3.46s" container=drone-wrnpbai071qsvie81kvy error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:19Z" level=debug msg="PodWatcher: Wait finished. Duration=1.68s" container=drone-x2mzoyh35lh4h99ofusg error="<nil>" pod=drone-g2lowjiru2bavhn48ekx state=RUNNING
time="2021-06-24T17:23:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-x2mzoyh35lh4h99ofusg pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:19Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-x2mzoyh35lh4h99ofusg error="<nil>" pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:19Z" level=debug msg="received exit code 0" build.id=4316 build.number=431 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 step.name=deploy-image thread=64
time="2021-06-24T17:23:20Z" level=debug msg="destroying the pipeline environment" build.id=4316 build.number=431 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 thread=64
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-wrnpbai071qsvie81kvy pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-wrnpbai071qsvie81kvy error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:21Z" level=debug msg="received exit code 0" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 step.name=clone thread=14
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" container=drone-cpeyljjqaaffpdo4739g image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"geoip\"" container=drone-9ciy3gxehre9vi7y0kp7 image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"mongodb\"" container=drone-5582lk8efm8gk1stizbz image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"build-image\"" container=drone-r7arz62lvic2ky2g5nex image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"postgres\"" container=drone-6erl1jsbgmxebcbgqsjn image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"redis\"" container=drone-0lns3akpqxqgzoxl9a5a image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"db-connections\"" container=drone-knscxdhnncijhovg9idw image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="Engine: Starting step: \"restore-cache\"" container=drone-sjeba6nz8loeuzqi80ky image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-cpeyljjqaaffpdo4739g pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-9ciy3gxehre9vi7y0kp7 pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-5582lk8efm8gk1stizbz pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-r7arz62lvic2ky2g5nex pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:21Z" level=debug msg="PodWatcher: Waiting..." container=drone-6erl1jsbgmxebcbgqsjn pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:22Z" level=debug msg="PodWatcher: Waiting..." container=drone-knscxdhnncijhovg9idw pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:22Z" level=debug msg="PodWatcher: Waiting..." container=drone-sjeba6nz8loeuzqi80ky pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:22Z" level=debug msg="PodWatcher: Wait finished. Duration=1.37s" container=drone-cpeyljjqaaffpdo4739g error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:22Z" level=debug msg="PodWatcher: Waiting..." container=drone-0lns3akpqxqgzoxl9a5a pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:23Z" level=debug msg="PodWatcher: Waiting..." container=drone-sfz1tcrjxgb2gaq84l15 pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:23Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-sfz1tcrjxgb2gaq84l15 error="<nil>" pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:23Z" level=debug msg="received exit code 0" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 step.name=other-tests-and-credo thread=23
time="2021-06-24T17:23:24Z" level=debug msg="PodWatcher: Wait finished. Duration=3.30s" container=drone-5582lk8efm8gk1stizbz error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:24Z" level=debug msg="PodWatcher: Wait finished. Duration=3.25s" container=drone-r7arz62lvic2ky2g5nex error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:24Z" level=debug msg="PodWatcher: Wait finished. Duration=3.35s" container=drone-9ciy3gxehre9vi7y0kp7 error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:24Z" level=debug msg="PodWatcher: Wait finished. Duration=3.20s" container=drone-6erl1jsbgmxebcbgqsjn error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-g2lowjiru2bavhn48ekx state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4316 build.number=431 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 thread=64
time="2021-06-24T17:23:25Z" level=debug msg="updated stage to complete" build.id=4316 build.number=431 duration=307 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 thread=64
time="2021-06-24T17:23:25Z" level=debug msg="poller: request stage from remote server" thread=64
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=3.22s" container=drone-sjeba6nz8loeuzqi80ky error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=3.63s" container=drone-knscxdhnncijhovg9idw error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=2.81s" container=drone-0lns3akpqxqgzoxl9a5a error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Waiting..." container=drone-knscxdhnncijhovg9idw pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-knscxdhnncijhovg9idw error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Waiting..." container=drone-sjeba6nz8loeuzqi80ky pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-sjeba6nz8loeuzqi80ky error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:23:25Z" level=debug msg="received exit code 0" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 step.name=db-connections thread=14
time="2021-06-24T17:23:25Z" level=debug msg="received exit code 0" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 step.name=restore-cache thread=14
time="2021-06-24T17:23:25Z" level=debug msg="Engine: Starting step: \"install-deps\"" container=drone-22zn3979ezw4m56tr06p image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:23:26Z" level=debug msg="PodWatcher: Waiting..." container=drone-22zn3979ezw4m56tr06p pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:28Z" level=debug msg="PodWatcher: Wait finished. Duration=2.47s" container=drone-22zn3979ezw4m56tr06p error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:23:32Z" level=debug msg="done listening for cancellations" build.id=4316 build.number=431 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4699 stage.name=default stage.number=1 thread=64
time="2021-06-24T17:23:56Z" level=debug msg="PodWatcher: Waiting..." container=drone-m9n67vxujtpm2da4l8cr pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:56Z" level=debug msg="PodWatcher: Wait finished. Duration=0.07s" container=drone-m9n67vxujtpm2da4l8cr error="<nil>" pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:23:56Z" level=debug msg="received exit code 0" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 step.name=ticketing-tests thread=23
time="2021-06-24T17:23:56Z" level=debug msg="Engine: Starting step: \"deploy-image\"" container=drone-7s9xc6cuqaaiiwwkmrmh image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-3uv54ngenrvexyembl5c
time="2021-06-24T17:23:56Z" level=debug msg="PodWatcher: Waiting..." container=drone-7s9xc6cuqaaiiwwkmrmh pod=drone-3uv54ngenrvexyembl5c state=RUNNING
time="2021-06-24T17:24:00Z" level=debug msg="PodWatcher: Wait finished. Duration=3.99s" container=drone-7s9xc6cuqaaiiwwkmrmh error="<nil>" pod=drone-3uv54ngenrvexyembl5c state=RUNNING
time="2021-06-24T17:24:00Z" level=debug msg="PodWatcher: Waiting..." container=drone-7s9xc6cuqaaiiwwkmrmh pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:24:00Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-7s9xc6cuqaaiiwwkmrmh error="<nil>" pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:24:00Z" level=debug msg="received exit code 0" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 step.name=deploy-image thread=23
time="2021-06-24T17:24:00Z" level=debug msg="destroying the pipeline environment" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 thread=23
time="2021-06-24T17:24:05Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:24:05Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-3uv54ngenrvexyembl5c state=TERMINATED
time="2021-06-24T17:24:05Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 thread=23
time="2021-06-24T17:24:05Z" level=debug msg="updated stage to complete" build.id=4315 build.number=430 duration=428 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 thread=23
time="2021-06-24T17:24:05Z" level=debug msg="poller: request stage from remote server" thread=23
time="2021-06-24T17:24:06Z" level=debug msg="PodWatcher: Waiting..." container=drone-eqdokb54m2mdibsgx304 pod=drone-6jmzs7qziuimqz6of0an state=TERMINATED
time="2021-06-24T17:24:06Z" level=debug msg="PodWatcher: Wait finished. Duration=0.12s" container=drone-eqdokb54m2mdibsgx304 error="<nil>" pod=drone-6jmzs7qziuimqz6of0an state=TERMINATED
time="2021-06-24T17:24:06Z" level=debug msg="received exit code 0" build.id=4313 build.number=283 repo.id=1722 repo.name=ecr-base repo.namespace=dicefm stage.id=4696 stage.name=base-image stage.number=1 step.name=elixir-1.8-otp-22 thread=32
time="2021-06-24T17:24:11Z" level=debug msg="PodWatcher: Waiting..." container=drone-r7arz62lvic2ky2g5nex pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:24:11Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-r7arz62lvic2ky2g5nex error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:24:11Z" level=debug msg="received exit code 0" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 step.name=build-image thread=14
time="2021-06-24T17:24:12Z" level=debug msg="done listening for cancellations" build.id=4315 build.number=430 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4698 stage.name=default stage.number=1 thread=23
time="2021-06-24T17:24:26Z" level=debug msg="PodWatcher: Waiting..." container=drone-9plcd9veo0mr13nccqoh pod=drone-6jmzs7qziuimqz6of0an state=TERMINATED
time="2021-06-24T17:24:26Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-9plcd9veo0mr13nccqoh error="<nil>" pod=drone-6jmzs7qziuimqz6of0an state=TERMINATED
time="2021-06-24T17:24:26Z" level=debug msg="received exit code 0" build.id=4313 build.number=283 repo.id=1722 repo.name=ecr-base repo.namespace=dicefm stage.id=4696 stage.name=base-image stage.number=1 step.name=python3.6 thread=32
time="2021-06-24T17:24:26Z" level=debug msg="Engine: Starting step: \"data\"" container=drone-01kkwuq8d06edueyeez0 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-6jmzs7qziuimqz6of0an
time="2021-06-24T17:24:26Z" level=debug msg="PodWatcher: Waiting..." container=drone-01kkwuq8d06edueyeez0 pod=drone-6jmzs7qziuimqz6of0an state=RUNNING
time="2021-06-24T17:24:29Z" level=debug msg="PodWatcher: Wait finished. Duration=2.58s" container=drone-01kkwuq8d06edueyeez0 error="<nil>" pod=drone-6jmzs7qziuimqz6of0an state=RUNNING
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Waiting..." container=drone-22zn3979ezw4m56tr06p pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-22zn3979ezw4m56tr06p error="<nil>" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:26:42Z" level=debug msg="received exit code 0" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 step.name=install-deps thread=14
time="2021-06-24T17:26:42Z" level=debug msg="Engine: Starting step: \"other-tests-and-credo\"" container=drone-7cqtpie1szryclzyhzbm image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:26:42Z" level=debug msg="Engine: Starting step: \"ticketing-tests\"" container=drone-9t2fglr4xbi7f0v1ay6j image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:26:42Z" level=debug msg="Engine: Starting step: \"rebuild-cache\"" container=drone-bmtmwxrvujoliglad8dj image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:26:42Z" level=debug msg="Engine: Starting step: \"kim-tests\"" container=drone-iu5vmmu3twpha8s3cih4 image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-kgaa3vuwl17rzj40du0x
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Waiting..." container=drone-7cqtpie1szryclzyhzbm pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Waiting..." container=drone-9t2fglr4xbi7f0v1ay6j pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Waiting..." container=drone-bmtmwxrvujoliglad8dj pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:26:42Z" level=debug msg="PodWatcher: Waiting..." container=drone-iu5vmmu3twpha8s3cih4 pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:26:43Z" level=debug msg="PodWatcher: Wait finished. Duration=1.52s" container=drone-7cqtpie1szryclzyhzbm error="container failed to start (invalid image?)" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:27:13Z" level=debug msg="received cancellation" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 thread=14
time="2021-06-24T17:27:13Z" level=debug msg="PodWatcher: Wait finished. Duration=31.12s" container=drone-iu5vmmu3twpha8s3cih4 error="context canceled" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:27:13Z" level=debug msg="PodWatcher: Wait finished. Duration=31.17s" container=drone-bmtmwxrvujoliglad8dj error="context canceled" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:27:13Z" level=debug msg="PodWatcher: Wait finished. Duration=31.22s" container=drone-9t2fglr4xbi7f0v1ay6j error="context canceled" pod=drone-kgaa3vuwl17rzj40du0x state=RUNNING
time="2021-06-24T17:27:13Z" level=debug msg="destroying the pipeline environment" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 thread=14
time="2021-06-24T17:27:14Z" level=debug msg="stage received" stage.id=4701 stage.name=default stage.number=1 thread=7
time="2021-06-24T17:27:14Z" level=debug msg="stage received" stage.id=4701 stage.name=default stage.number=1 thread=81
time="2021-06-24T17:27:14Z" level=debug msg="stage accepted" stage.id=4701 stage.name=default stage.number=1 thread=7
time="2021-06-24T17:27:14Z" level=debug msg="stage accepted by another runner" stage.id=4701 stage.name=default stage.number=1 thread=81
time="2021-06-24T17:27:14Z" level=debug msg="poller: request stage from remote server" thread=81
time="2021-06-24T17:27:14Z" level=debug msg="stage details fetched" build.id=4318 build.number=433 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4701 stage.name=default stage.number=1 thread=7
time="2021-06-24T17:27:15Z" level=debug msg="updated stage to running" build.id=4318 build.number=433 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4701 stage.name=default stage.number=1 thread=7
time="2021-06-24T17:27:16Z" level=debug msg="PodWatcher: Started" pod=drone-t5sb3ns93cu45iz8y4py
time="2021-06-24T17:27:16Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-693jmu1xbd88t9awoqe3 image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-t5sb3ns93cu45iz8y4py
time="2021-06-24T17:27:16Z" level=debug msg="PodWatcher: Waiting..." container=drone-693jmu1xbd88t9awoqe3 pod=drone-t5sb3ns93cu45iz8y4py state=RUNNING
time="2021-06-24T17:27:18Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:27:18Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-kgaa3vuwl17rzj40du0x state=TERMINATED
time="2021-06-24T17:27:18Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4317 build.number=432 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4700 stage.name=default stage.number=1 thread=14

All other jobs (other than this a lot of steps one) still without a single issue

It happened 3 times today, out of 25 runs. Fixing it would be perfect, but if it could be just detected and job would get restarted it would be good enough

Just got second case today of this job getting stuck in “Running” state
3 steps got stuck without any logs at all, while one finished but was still in “Running” state:



time="2021-06-24T18:32:42Z" level=debug msg="stage details fetched" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 thread=67
time="2021-06-24T18:32:43Z" level=debug msg="updated stage to running" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 thread=67
time="2021-06-24T18:32:44Z" level=debug msg="PodWatcher: Started" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:44Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-xmfjilqdxznzx4rgzalb image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-xmfjilqdxznzx4rgzalb pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:48Z" level=debug msg="PodWatcher: Wait finished. Duration=3.25s" container=drone-xmfjilqdxznzx4rgzalb error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-xmfjilqdxznzx4rgzalb pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-xmfjilqdxznzx4rgzalb error="<nil>" pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:51Z" level=debug msg="received exit code 0" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 step.name=clone thread=67
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"restore-cache\"" container=drone-gfhak49lezpx1g1b62wj image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" container=drone-1pkzqybdrqjl9rmshr8u image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"build-image\"" container=drone-z8xnmnh1deumazp5ai6g image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"postgres\"" container=drone-qelryfwfwu3jgssyza7r image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"redis\"" container=drone-iniymqrpiqbgn4jbntot image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"geoip\"" container=drone-vzc71v5xkkipio7us3u3 image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"mongodb\"" container=drone-7rucbjmjffypipteios6 image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="Engine: Starting step: \"db-connections\"" container=drone-tt3y3tyec3a8wya66nsz image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-gfhak49lezpx1g1b62wj pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-1pkzqybdrqjl9rmshr8u pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-z8xnmnh1deumazp5ai6g pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-qelryfwfwu3jgssyza7r pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-iniymqrpiqbgn4jbntot pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:52Z" level=debug msg="PodWatcher: Waiting..." container=drone-7rucbjmjffypipteios6 pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:52Z" level=debug msg="PodWatcher: Waiting..." container=drone-tt3y3tyec3a8wya66nsz pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:52Z" level=debug msg="PodWatcher: Wait finished. Duration=1.34s" container=drone-gfhak49lezpx1g1b62wj error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:52Z" level=debug msg="PodWatcher: Waiting..." container=drone-gfhak49lezpx1g1b62wj pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:52Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-gfhak49lezpx1g1b62wj error="<nil>" pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:52Z" level=debug msg="received exit code 0" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 step.name=restore-cache thread=67
time="2021-06-24T18:32:53Z" level=debug msg="PodWatcher: Waiting..." container=drone-vzc71v5xkkipio7us3u3 pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.26s" container=drone-z8xnmnh1deumazp5ai6g error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.20s" container=drone-qelryfwfwu3jgssyza7r error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.15s" container=drone-iniymqrpiqbgn4jbntot error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.31s" container=drone-1pkzqybdrqjl9rmshr8u error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:55Z" level=debug msg="PodWatcher: Wait finished. Duration=3.59s" container=drone-7rucbjmjffypipteios6 error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:55Z" level=debug msg="PodWatcher: Wait finished. Duration=3.19s" container=drone-tt3y3tyec3a8wya66nsz error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:55Z" level=debug msg="PodWatcher: Wait finished. Duration=2.59s" container=drone-vzc71v5xkkipio7us3u3 error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:57Z" level=debug msg="PodWatcher: Waiting..." container=drone-tt3y3tyec3a8wya66nsz pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:57Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-tt3y3tyec3a8wya66nsz error="<nil>" pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:32:57Z" level=debug msg="received exit code 0" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 step.name=db-connections thread=67
time="2021-06-24T18:32:57Z" level=debug msg="Engine: Starting step: \"install-deps\"" container=drone-0b9va22xxbw2du68pq3m image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:32:57Z" level=debug msg="PodWatcher: Waiting..." container=drone-0b9va22xxbw2du68pq3m pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:32:59Z" level=debug msg="PodWatcher: Wait finished. Duration=2.08s" container=drone-0b9va22xxbw2du68pq3m error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:35:10Z" level=debug msg="PodWatcher: Waiting..." container=drone-z8xnmnh1deumazp5ai6g pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:35:10Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-z8xnmnh1deumazp5ai6g error="<nil>" pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:35:10Z" level=debug msg="received exit code 0" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 step.name=build-image thread=67
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-0b9va22xxbw2du68pq3m pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-0b9va22xxbw2du68pq3m error="<nil>" pod=drone-juz5rn291vioip9bistj state=TERMINATED
time="2021-06-24T18:36:19Z" level=debug msg="received exit code 0" build.id=4322 build.number=435 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4705 stage.name=default stage.number=1 step.name=install-deps thread=67
time="2021-06-24T18:36:19Z" level=debug msg="Engine: Starting step: \"other-tests-and-credo\"" container=drone-j7cntkylgio34zn4na8q image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:36:19Z" level=debug msg="Engine: Starting step: \"rebuild-cache\"" container=drone-fqipsud4zes37kl794eg image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:36:19Z" level=debug msg="Engine: Starting step: \"ticketing-tests\"" container=drone-qluumnl0c01zc6tuyvig image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:36:19Z" level=debug msg="Engine: Starting step: \"kim-tests\"" container=drone-jryag42v9q7zkhcydlpg image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-juz5rn291vioip9bistj
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-j7cntkylgio34zn4na8q pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-fqipsud4zes37kl794eg pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-qluumnl0c01zc6tuyvig pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:36:19Z" level=debug msg="PodWatcher: Waiting..." container=drone-jryag42v9q7zkhcydlpg pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:36:20Z" level=debug msg="PodWatcher: Wait finished. Duration=1.20s" container=drone-j7cntkylgio34zn4na8q error="<nil>" pod=drone-juz5rn291vioip9bistj state=RUNNING
time="2021-06-24T18:37:53Z" level=debug msg="PodWatcher: Waiting..." container=drone-j7cntkylgio34zn4na8q pod=drone-juz5rn291vioip9bistj state=TERMINATED

After restarting the job all was successful (for the same commit)

I managed to catch describe of this container and it included:

  drone-j7cntkylgio34zn4na8q:
    Container ID:  cri-o://7224e89207777a4b8134ea66d1dc361cb4a31904eb38fdd01d7f67b2fa689b1b
    Image:         docker.io/dicefm/elixir-image:elixir-1.8.2
    Image ID:      docker.io/dicefm/[email protected]:fa6947ffba8a2d3b8cfa04d9b13a2fc60279526fd3c28992622a4375e7217dfd
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      echo "$DRONE_SCRIPT" | /bin/sh
    State:          Running
      Started:      Thu, 24 Jun 2021 19:36:19 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 24 Jun 2021 19:32:46 +0100
      Finished:     Thu, 24 Jun 2021 19:36:19 +0100
    Ready:          True
[...]
  drone-z8xnmnh1deumazp5ai6g:
    Container ID:  cri-o://0fd214e4950cd4abd285934b97341485cea4735b53aa45b93e2566c360748c89
    Image:         ecr.ad.dice.fm/base:imagebuilder-latest
    Image ID:      ecr.ad.dice.fm/[email protected]:22738ce36d56ede72fee86762bc3b7a2c6d7b0369339cbf225469a1a0c2ee9d3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      echo "$DRONE_SCRIPT" | /bin/sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 24 Jun 2021 19:32:53 +0100
      Finished:     Thu, 24 Jun 2021 19:35:09 +0100
[...]
    Container ID:  cri-o://f5075c1970de8447bbe00d8f1a87a2297c7e004bc09b3e17e86ef7ef2f005c3f
    Image:         drone/placeholder:1
    Image ID:      docker.io/drone/[email protected]:1b791b37e0589c7f1470fe0210ec0e7ef1e1de479c8f742d25942d2b4617bc07
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      echo "$DRONE_SCRIPT" | /bin/sh
    State:          Running
      Started:      Thu, 24 Jun 2021 19:32:46 +0100
    Ready:          True

The first one (base on logs) is the one what finished but was stuck running:

 kubectl --namespace drone-ci-exec logs drone-juz5rn291vioip9bistj drone-j7cntkylgio34zn4na8q
[...]
Please report incorrect results: https://github.com/rrrene/credo/issues

Analysis took 39.4 seconds (1.5s to load, 37.9s running 53 checks on 1824 files)
10287 mods/funs, found no issues.

Just got weird case

https://pastebin.pl/view/453de23a (can’t make 3 posts in row, and it is too long to paste in any of my current posts)


As I mentioned before, we have other repos using drone, and they don’t have so many steps, but today they did run over 70 jobs, and no single one had any issues at all (there was no issues at all in a week time)

I believe we have some sort of periodic poller, which should check kubernetes for Pod updates just in case we miss a kubernetes event. I think ideally the periodic poller would have kicked-in an updated the step status, but this appears to have not been the case. Thoughts @marko-gacesa ?

Green failure for a change:


Step db-connection did timeout, so it wasn’t just display issue, it didn’t start

time="2021-06-25T10:09:38Z" level=debug msg="stage details fetched" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:09:39Z" level=debug msg="updated stage to running" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:09:40Z" level=debug msg="PodWatcher: Started" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:40Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-9o3uzwakntqj27vn2117 image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:40Z" level=debug msg="PodWatcher: Waiting..." container=drone-9o3uzwakntqj27vn2117 pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:44Z" level=debug msg="PodWatcher: Wait finished. Duration=3.82s" container=drone-9o3uzwakntqj27vn2117 error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-9o3uzwakntqj27vn2117 pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-9o3uzwakntqj27vn2117 error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:09:47Z" level=debug msg="received exit code 0" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 step.name=clone thread=16
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"build-image\"" container=drone-96zi3fe3wmtq7smnhc1v image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"geoip\"" container=drone-657vhq9710hi75s4b233 image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"restore-cache\"" container=drone-pi7hmsorn5bzn5c0v5ks image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"mongodb\"" container=drone-c3xguaofjipbgnlfr6te image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" container=drone-a8pm3r11mztjsa5ck3gh image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"db-connections\"" container=drone-4k3vteaxn1bbvipavfyr image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"redis\"" container=drone-8yi5ujpvg3ow6w73cz5z image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="Engine: Starting step: \"postgres\"" container=drone-y9ybod85bk2lfh9hdz0h image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-xtv2o3lt0kojgdqymyd3
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-96zi3fe3wmtq7smnhc1v pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-pi7hmsorn5bzn5c0v5ks pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-c3xguaofjipbgnlfr6te pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-a8pm3r11mztjsa5ck3gh pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-4k3vteaxn1bbvipavfyr pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:47Z" level=debug msg="PodWatcher: Waiting..." container=drone-8yi5ujpvg3ow6w73cz5z pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:48Z" level=debug msg="PodWatcher: Wait finished. Duration=1.05s" container=drone-96zi3fe3wmtq7smnhc1v error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:48Z" level=debug msg="PodWatcher: Waiting..." container=drone-657vhq9710hi75s4b233 pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:49Z" level=debug msg="PodWatcher: Wait finished. Duration=1.84s" container=drone-a8pm3r11mztjsa5ck3gh error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:49Z" level=debug msg="PodWatcher: Wait finished. Duration=1.90s" container=drone-c3xguaofjipbgnlfr6te error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:49Z" level=debug msg="PodWatcher: Wait finished. Duration=1.96s" container=drone-pi7hmsorn5bzn5c0v5ks error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-y9ybod85bk2lfh9hdz0h pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-pi7hmsorn5bzn5c0v5ks pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:09:51Z" level=debug msg="PodWatcher: Wait finished. Duration=0.33s" container=drone-pi7hmsorn5bzn5c0v5ks error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:09:51Z" level=debug msg="PodWatcher: Wait finished. Duration=3.71s" container=drone-4k3vteaxn1bbvipavfyr error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:51Z" level=debug msg="PodWatcher: Wait finished. Duration=2.31s" container=drone-657vhq9710hi75s4b233 error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:51Z" level=debug msg="PodWatcher: Wait finished. Duration=3.31s" container=drone-8yi5ujpvg3ow6w73cz5z error="container failed to start (invalid image?)" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:09:51Z" level=debug msg="received exit code 0" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 step.name=restore-cache thread=16
time="2021-06-25T10:09:53Z" level=debug msg="PodWatcher: Wait finished. Duration=2.92s" container=drone-y9ybod85bk2lfh9hdz0h error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=RUNNING
time="2021-06-25T10:12:01Z" level=debug msg="PodWatcher: Waiting..." container=drone-96zi3fe3wmtq7smnhc1v pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:12:01Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-96zi3fe3wmtq7smnhc1v error="<nil>" pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:12:01Z" level=debug msg="received exit code 0" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 step.name=build-image thread=16
time="2021-06-25T10:13:57Z" level=debug msg="PodWatcher: Waiting..." container=drone-4k3vteaxn1bbvipavfyr pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:13:57Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-4k3vteaxn1bbvipavfyr error="exitCode=1" pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:13:57Z" level=debug msg="received exit code 1" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 step.name=db-connections thread=16
time="2021-06-25T10:13:58Z" level=debug msg="destroying the pipeline environment" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:14:03Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:14:03Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-xtv2o3lt0kojgdqymyd3 state=TERMINATED
time="2021-06-25T10:14:03Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:14:03Z" level=debug msg="updated stage to complete" build.id=4357 build.number=458 duration=259 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:14:03Z" level=debug msg="poller: request stage from remote server" thread=16
time="2021-06-25T10:14:03Z" level=debug msg="done listening for cancellations" build.id=4357 build.number=458 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4740 stage.name=default stage.number=1 thread=16
time="2021-06-25T10:18:35Z" level=debug msg="stage received" stage.id=4741 stage.name=default stage.number=1 thread=31
time="2021-06-25T10:18:35Z" level=debug msg="stage received" stage.id=4741 stage.name=default stage.number=1 thread=25
time="2021-06-25T10:18:35Z" level=debug msg="stage accepted" stage.id=4741 stage.name=default stage.number=1 thread=31
time="2021-06-25T10:18:35Z" level=debug msg="stage accepted by another runner" stage.id=4741 stage.name=default stage.number=1 thread=25
time="2021-06-25T10:18:35Z" level=debug msg="poller: request stage from remote server" thread=25

The green bar indicates the step exited with a 0 exit code (passed) and there were no logs to display (hence the green bar instead of displaying logs). I think it would be helpful if we could have a sample yaml that can be used to reproduce the problem.

I do appreciate the report, but this looks like it might be an intra-pod networking issue, as opposed to an issue with builds hanging. If that is the case, I would recommend creating a separate thread so we can address, track and triage the issue separately.

Let me confirm if I can replicate it with yaml I shared before

Kubernetes runner version 1.0.0-beta.7 is just released. Please use it to try replicate the issue. Thank you!