gitlab omnibus 11.10.3 – gitaly not starting

tl;dr

The logs lead me to realise gitaly was having issues.  This serves up the repositories to the rest of the gitlab installation.

The only bits of gitaly running were this:

           ├─3670 runsv gitaly
           ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
           └─9405 svlogd -tt /var/log/gitlab/gitaly

These were failing to stop via gitlab-ctl stop

It should have a gitaly process and a number of ruby processes (one, from the docs, by default. Two perhaps in practise, more if you’ve configured it to run more.)

root      9442  9436  0 10:21 ?        00:00:00 runsv gitaly
root      9455  9442  0 10:21 ?        00:00:00 svlogd -tt /var/log/gitlab/gitaly
git       9456  9442  0 10:21 ?        00:00:00 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
git       9521  9456  0 10:21 ?        00:00:00 /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
git       9564  9521 22 10:21 ?        00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.0
git       9566  9521 22 10:21 ?        00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.1

Might have been caused by my backup filesystem filling up.  It’s the only thing I changed/fixed.

Restarting the service in systemd cleaned up gitaly-wrapper, and then it started.

updates

2019-07-03 Happened again! No full filesystem this time. Will upgrade to 12.0.x

My VM isn’t exactly massive; one vcpu, about 4gb ram.  Fine for one user, suspect an issue is arising during boot, perhaps when the KVM host machine is busy starting other things as well.

# rpm -qa | grep gitlab
gitlab-ce-11.10.4-ce.0.el7.x86_64
# gitlab-ctl stop
ok: down: alertmanager: 0s, normally up
timeout: run: gitaly: (pid 3685) 69289s, want down, got TERM
ok: down: gitlab-monitor: 0s, normally up
[...]

Useful info about troubleshooting / checking

Commands to run when gathering info / checking prior to raising a bug. Hat tip.

sudo gitlab-rake gitlab:check
sudo gitlab-rake gitlab:env:info

The only issue found was incorrect permissions within /var/opt/gitlab/gitlab-rails/uploads.

gitlab SOS .. interesting ..

 

what I found ..

  • Bits of gitlab were working (like issue boards) but access to repositories via git client or gui were failing.
# rpm -qa gitlab-ce
gitlab-ce-11.10.3-ce.0.el7.x86_64
  • Some log output .. the core dumps were during system shut down.  Not nice, but journald goes back a month, and it’s not happened during any of the last half a dozen or more reboots. The socket is stale from the last boot.
# egrep 'error.*gitaly' /var/log/gitlab/gitlab-shell/gitlab-shell.log
time="2019-05-26T09:35:24+01:00" level=error msg="error: %v" error="rpc error: code = Unavailable 
desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = 
\"transport: Error while dialing dial unix /var/opt/gitlab/gitaly/gitaly.socket: connect: connection refused\"" pid=5062

# journalctl | grep -i gitaly
May 19 11:15:46 akira abrt-hook-ccpp[2322]: Process 3754 (gitaly) of user 993 killed by SIGBUS - dumping core
May 19 11:15:46 akira abrt-hook-ccpp[2315]: Process 3700 (gitaly-wrapper) of user 993 killed by SIGBUS - dumping core
# ls -al /var/opt/gitlab/gitaly/gitaly.socket
srwxr-xr-x. 1 git git 0 May 25 20:03 /var/opt/gitlab/gitaly/gitaly.socket
  • Manipulation via service control:
# gitlab-ctl status
run: alertmanager: (pid 3707) 1613s; run: log: (pid 3706) 1613s
run: gitaly: (pid 3691) 1613s; run: log: (pid 3687) 1613s
run: gitlab-monitor: (pid 3680) 1613s; run: log: (pid 3679) 1613s
run: gitlab-workhorse: (pid 3690) 1613s; run: log: (pid 3688) 1613s
run: logrotate: (pid 3686) 1613s; run: log: (pid 3685) 1613s
run: nginx: (pid 3689) 1613s; run: log: (pid 3684) 1613s
run: node-exporter: (pid 3699) 1613s; run: log: (pid 3694) 1613s
run: postgres-exporter: (pid 3693) 1613s; run: log: (pid 3683) 1613s
run: postgresql: (pid 3701) 1613s; run: log: (pid 3696) 1613s
run: prometheus: (pid 3682) 1613s; run: log: (pid 3681) 1613s
run: redis: (pid 3704) 1613s; run: log: (pid 3702) 1613s
run: redis-exporter: (pid 3700) 1613s; run: log: (pid 3692) 1613s
run: sidekiq: (pid 3703) 1613s; run: log: (pid 3695) 1613s
run: unicorn: (pid 3698) 1613s; run: log: (pid 3697) 1613s

# gitlab-ctl restart
ok: run: alertmanager: (pid 7189) 0s
timeout: run: gitaly: (pid 3691) 1840s, got TERM
ok: run: gitlab-monitor: (pid 7246) 0s
ok: run: gitlab-workhorse: (pid 7250) 0s
[snip]

# gitlab-ctl status
run: alertmanager: (pid 7189) 62s; run: log: (pid 3706) 1872s
run: gitaly: (pid 3691) 1872s, got TERM; run: log: (pid 3687) 1872s
run: gitlab-monitor: (pid 7246) 31s; run: log: (pid 3679) 1872s
run: gitlab-workhorse: (pid 7250) 31s; run: log: (pid 3688) 1872s
[snip]

# gitlab-ctl stop
ok: down: alertmanager: 1s, normally up
timeout: run: gitaly: (pid 3691) 1919s, want down, got TERM
ok: down: gitlab-monitor: 1s, normally up
[snip]

Turning it off and on again clearly not working ..

It magically restarted itself again, so I decided to ..

# systemctl stop gitlab-runsvdir
# systemctl status gitlab-runsvdir
● gitlab-runsvdir.service - GitLab Runit supervision process
   Loaded: loaded (/usr/lib/systemd/system/gitlab-runsvdir.service; enabled; vendor preset: disabled)
   Active: deactivating (final-sigterm) since Sun 2019-05-26 10:16:57 BST; 1min 28s ago
  Process: 3663 ExecStart=/opt/gitlab/embedded/bin/runsvdir-start (code=exited, status=0/SUCCESS)
 Main PID: 3663 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/gitlab-runsvdir.service
           ├─3670 runsv gitaly
           ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
           └─9405 svlogd -tt /var/log/gitlab/gitaly

May 26 09:22:26 akira systemd[1]: Started GitLab Runit supervision process.
May 26 10:16:57 akira systemd[1]: Stopping GitLab Runit supervision process...

[eventually]
May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service stop-final-sigterm timed out. Killing.
May 26 10:18:27 akira systemd[1]: Stopped GitLab Runit supervision process.
May 26 10:18:27 akira systemd[1]: Unit gitlab-runsvdir.service entered failed state.
May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service failed.

docs ..

There’s a page about gitaly here: https://docs.gitlab.com/ee/administration/gitaly

The debug stuff at the bottom doesn’t (at 11.10) include anything relevant for this.

Googling for issues didn’t turn up anything particularly relevant; a search for ‘starting’ amongst gitaly issues is here.

 

fix

It looks like turn it off and turn it on fixed it.

The filesystem I had mounted on /var/opt/gitlab/backups had filled up .. clearly that needed fixing regardless, and it might have been the root cause.   Having done that, gitaly-wrapper was clearly locked up – once that was killed by systemd, restarting fixed it.

I’m one fix release behind; 11.10.4 is released, so I’ll upgrade.  11.11.0 is also out, but I’ll wait for some fix releases before deploying.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s