Skip to content

Docker daemon fails to start after container forced termination due to stale PID file #4362

@taruishi-ma

Description

@taruishi-ma

Checks

Controller Version

N/A (not using Kubernetes controller, running dind image directly on Docker)

Deployment Method

Other

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start an actions-runner-dind container with the command above
2. Verify inner Docker is working: docker exec github-runner-1 docker ps (should succeed)
3. Forcefully terminate the container: docker kill github-runner-1
4. Restart the container: docker start github-runner-1
5. Check inner Docker: docker exec github-runner-1 docker ps
   -> Fails with: "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

Describe the bug

When an actions-runner-dind container is forcefully terminated (e.g., via docker kill, VM preemption, or system crash), the inner Docker daemon's PID file (/var/run/docker.pid) remains inside the container. On subsequent container restart, the inner Docker daemon fails to start because it detects the stale PID file.

This is a common scenario when running self-hosted runners on Spot/Preemptible VMs, where the VM can be terminated at any time without graceful shutdown.

Root Cause:
When the container is killed with SIGKILL, the inner Docker daemon doesn't have a chance to clean up /var/run/docker.pid. When the container restarts, the stale PID file prevents the new dockerd process from starting.

Describe the expected behavior

The inner Docker daemon should start successfully after container restart, even if the container was previously forcefully terminated.

Additional Context

Environment:

  • Image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:ubuntu-22.04
  • Runner Version: 2.320.0
  • Platform: GCP Spot VMs (both amd64 and arm64)
  • Docker version on host: 24.x

Workaround:
After container restart, manually clean up the PID file and restart dockerd:

# For a single runner
docker exec github-runner-1 bash -c 'sudo rm -f /var/run/docker.pid && sudo dockerd &'
sleep 3

# For all runners
docker ps -a --format json | jq .Names | grep github-runner | cut -d'"' -f2 | while read runner; do
  echo $runner
  docker exec $runner docker ps 2>&1 || \
    docker exec $runner bash -c 'sudo rm -f /var/run/docker.pid && sudo dockerd & sleep 3'
done

Suggested Fix:
The container's entrypoint script should clean up stale PID files before starting the Docker daemon:

# In the entrypoint script, before starting dockerd
rm -f /var/run/docker.pid /var/run/containerd/containerd.pid

Related Issues:

Controller Logs

N/A (not using Kubernetes controller)

Runner Pod Logs

# After forced termination and restart, checking docker inside container:
$ docker exec github-runner-1 docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

# The stale PID file exists:
$ docker exec github-runner-1 ls -la /var/run/docker.pid
-rw-r--r-- 1 root root 5 Jan 25 07:00 /var/run/docker.pid

$ docker exec github-runner-1 cat /var/run/docker.pid
123  # Old PID from before termination

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions