gitlab omnibus 11.10.3 – gitaly not starting


The logs lead me to realise gitaly was having issues.  This serves up the repositories to the rest of the gitlab installation.

The only bits of gitaly running were this:

           ├─3670 runsv gitaly
           ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
           └─9405 svlogd -tt /var/log/gitlab/gitaly

These were failing to stop via gitlab-ctl stop

It should have a gitaly process and a number of ruby processes (one, from the docs, by default. Two perhaps in practise, more if you’ve configured it to run more.)

root      9442  9436  0 10:21 ?        00:00:00 runsv gitaly
root      9455  9442  0 10:21 ?        00:00:00 svlogd -tt /var/log/gitlab/gitaly
git       9456  9442  0 10:21 ?        00:00:00 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
git       9521  9456  0 10:21 ?        00:00:00 /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
git       9564  9521 22 10:21 ?        00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.0
git       9566  9521 22 10:21 ?        00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.1

Might have been caused by my backup filesystem filling up.  It’s the only thing I changed/fixed.

Restarting the service in systemd cleaned up gitaly-wrapper, and then it started.


what I found ..

  • Bits of gitlab were working (like issue boards) but access to repositories via git client or gui were failing.
# rpm -qa gitlab-ce
  • Some log output .. the core dumps were during system shut down.  Not nice, but journald goes back a month, and it’s not happened during any of the last half a dozen or more reboots. The socket is stale from the last boot.
# egrep 'error.*gitaly' /var/log/gitlab/gitlab-shell/gitlab-shell.log
time="2019-05-26T09:35:24+01:00" level=error msg="error: %v" error="rpc error: code = Unavailable 
desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = 
\"transport: Error while dialing dial unix /var/opt/gitlab/gitaly/gitaly.socket: connect: connection refused\"" pid=5062

# journalctl | grep -i gitaly
May 19 11:15:46 akira abrt-hook-ccpp[2322]: Process 3754 (gitaly) of user 993 killed by SIGBUS - dumping core
May 19 11:15:46 akira abrt-hook-ccpp[2315]: Process 3700 (gitaly-wrapper) of user 993 killed by SIGBUS - dumping core
# ls -al /var/opt/gitlab/gitaly/gitaly.socket
srwxr-xr-x. 1 git git 0 May 25 20:03 /var/opt/gitlab/gitaly/gitaly.socket
  • Manipulation via service control:
# gitlab-ctl status
run: alertmanager: (pid 3707) 1613s; run: log: (pid 3706) 1613s
run: gitaly: (pid 3691) 1613s; run: log: (pid 3687) 1613s
run: gitlab-monitor: (pid 3680) 1613s; run: log: (pid 3679) 1613s
run: gitlab-workhorse: (pid 3690) 1613s; run: log: (pid 3688) 1613s
run: logrotate: (pid 3686) 1613s; run: log: (pid 3685) 1613s
run: nginx: (pid 3689) 1613s; run: log: (pid 3684) 1613s
run: node-exporter: (pid 3699) 1613s; run: log: (pid 3694) 1613s
run: postgres-exporter: (pid 3693) 1613s; run: log: (pid 3683) 1613s
run: postgresql: (pid 3701) 1613s; run: log: (pid 3696) 1613s
run: prometheus: (pid 3682) 1613s; run: log: (pid 3681) 1613s
run: redis: (pid 3704) 1613s; run: log: (pid 3702) 1613s
run: redis-exporter: (pid 3700) 1613s; run: log: (pid 3692) 1613s
run: sidekiq: (pid 3703) 1613s; run: log: (pid 3695) 1613s
run: unicorn: (pid 3698) 1613s; run: log: (pid 3697) 1613s

# gitlab-ctl restart
ok: run: alertmanager: (pid 7189) 0s
timeout: run: gitaly: (pid 3691) 1840s, got TERM
ok: run: gitlab-monitor: (pid 7246) 0s
ok: run: gitlab-workhorse: (pid 7250) 0s

# gitlab-ctl status
run: alertmanager: (pid 7189) 62s; run: log: (pid 3706) 1872s
run: gitaly: (pid 3691) 1872s, got TERM; run: log: (pid 3687) 1872s
run: gitlab-monitor: (pid 7246) 31s; run: log: (pid 3679) 1872s
run: gitlab-workhorse: (pid 7250) 31s; run: log: (pid 3688) 1872s

# gitlab-ctl stop
ok: down: alertmanager: 1s, normally up
timeout: run: gitaly: (pid 3691) 1919s, want down, got TERM
ok: down: gitlab-monitor: 1s, normally up

Turning it off and on again clearly not working ..

It magically restarted itself again, so I decided to ..

# systemctl stop gitlab-runsvdir
# systemctl status gitlab-runsvdir
● gitlab-runsvdir.service - GitLab Runit supervision process
   Loaded: loaded (/usr/lib/systemd/system/gitlab-runsvdir.service; enabled; vendor preset: disabled)
   Active: deactivating (final-sigterm) since Sun 2019-05-26 10:16:57 BST; 1min 28s ago
  Process: 3663 ExecStart=/opt/gitlab/embedded/bin/runsvdir-start (code=exited, status=0/SUCCESS)
 Main PID: 3663 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/gitlab-runsvdir.service
           ├─3670 runsv gitaly
           ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
           └─9405 svlogd -tt /var/log/gitlab/gitaly

May 26 09:22:26 akira systemd[1]: Started GitLab Runit supervision process.
May 26 10:16:57 akira systemd[1]: Stopping GitLab Runit supervision process...

May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service stop-final-sigterm timed out. Killing.
May 26 10:18:27 akira systemd[1]: Stopped GitLab Runit supervision process.
May 26 10:18:27 akira systemd[1]: Unit gitlab-runsvdir.service entered failed state.
May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service failed.

docs ..

There’s a page about gitaly here:

The debug stuff at the bottom doesn’t (at 11.10) include anything relevant for this.

Googling for issues didn’t turn up anything particularly relevant; a search for ‘starting’ amongst gitaly issues is here.



It looks like turn it off and turn it on fixed it.

The filesystem I had mounted on /var/opt/gitlab/backups had filled up .. clearly that needed fixing regardless, and it might have been the root cause.   Having done that, gitaly-wrapper was clearly locked up – once that was killed by systemd, restarting fixed it.

I’m one fix release behind; 11.10.4 is released, so I’ll upgrade.  11.11.0 is also out, but I’ll wait for some fix releases before deploying.

