tl;dr
The logs lead me to realise gitaly was having issues. This serves up the repositories to the rest of the gitlab installation.
The only bits of gitaly running were this:
├─3670 runsv gitaly ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml └─9405 svlogd -tt /var/log/gitlab/gitaly
These were failing to stop via gitlab-ctl stop
It should have a gitaly process and a number of ruby processes (one, from the docs, by default. Two perhaps in practise, more if you’ve configured it to run more.)
root 9442 9436 0 10:21 ? 00:00:00 runsv gitaly root 9455 9442 0 10:21 ? 00:00:00 svlogd -tt /var/log/gitlab/gitaly git 9456 9442 0 10:21 ? 00:00:00 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml git 9521 9456 0 10:21 ? 00:00:00 /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml git 9564 9521 22 10:21 ? 00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.0 git 9566 9521 22 10:21 ? 00:00:01 ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 9521 /tmp/gitaly-ruby370067131/socket.1
Might have been caused by my backup filesystem filling up. It’s the only thing I changed/fixed.
Restarting the service in systemd cleaned up gitaly-wrapper, and then it started.
updates
2019-07-03 Happened again! No full filesystem this time. Will upgrade to 12.0.x
My VM isn’t exactly massive; one vcpu, about 4gb ram. Fine for one user, suspect an issue is arising during boot, perhaps when the KVM host machine is busy starting other things as well.
# rpm -qa | grep gitlab gitlab-ce-11.10.4-ce.0.el7.x86_64 # gitlab-ctl stop ok: down: alertmanager: 0s, normally up timeout: run: gitaly: (pid 3685) 69289s, want down, got TERM ok: down: gitlab-monitor: 0s, normally up [...]
Useful info about troubleshooting / checking
Commands to run when gathering info / checking prior to raising a bug. Hat tip.
sudo gitlab-rake gitlab:check sudo gitlab-rake gitlab:env:info
The only issue found was incorrect permissions within /var/opt/gitlab/gitlab-rails/uploads.
what I found ..
- Bits of gitlab were working (like issue boards) but access to repositories via git client or gui were failing.
# rpm -qa gitlab-ce gitlab-ce-11.10.3-ce.0.el7.x86_64
- Some log output .. the core dumps were during system shut down. Not nice, but journald goes back a month, and it’s not happened during any of the last half a dozen or more reboots. The socket is stale from the last boot.
# egrep 'error.*gitaly' /var/log/gitlab/gitlab-shell/gitlab-shell.log time="2019-05-26T09:35:24+01:00" level=error msg="error: %v" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /var/opt/gitlab/gitaly/gitaly.socket: connect: connection refused\"" pid=5062 # journalctl | grep -i gitaly May 19 11:15:46 akira abrt-hook-ccpp[2322]: Process 3754 (gitaly) of user 993 killed by SIGBUS - dumping core May 19 11:15:46 akira abrt-hook-ccpp[2315]: Process 3700 (gitaly-wrapper) of user 993 killed by SIGBUS - dumping core # ls -al /var/opt/gitlab/gitaly/gitaly.socket srwxr-xr-x. 1 git git 0 May 25 20:03 /var/opt/gitlab/gitaly/gitaly.socket
- Manipulation via service control:
# gitlab-ctl status run: alertmanager: (pid 3707) 1613s; run: log: (pid 3706) 1613s run: gitaly: (pid 3691) 1613s; run: log: (pid 3687) 1613s run: gitlab-monitor: (pid 3680) 1613s; run: log: (pid 3679) 1613s run: gitlab-workhorse: (pid 3690) 1613s; run: log: (pid 3688) 1613s run: logrotate: (pid 3686) 1613s; run: log: (pid 3685) 1613s run: nginx: (pid 3689) 1613s; run: log: (pid 3684) 1613s run: node-exporter: (pid 3699) 1613s; run: log: (pid 3694) 1613s run: postgres-exporter: (pid 3693) 1613s; run: log: (pid 3683) 1613s run: postgresql: (pid 3701) 1613s; run: log: (pid 3696) 1613s run: prometheus: (pid 3682) 1613s; run: log: (pid 3681) 1613s run: redis: (pid 3704) 1613s; run: log: (pid 3702) 1613s run: redis-exporter: (pid 3700) 1613s; run: log: (pid 3692) 1613s run: sidekiq: (pid 3703) 1613s; run: log: (pid 3695) 1613s run: unicorn: (pid 3698) 1613s; run: log: (pid 3697) 1613s # gitlab-ctl restart ok: run: alertmanager: (pid 7189) 0s timeout: run: gitaly: (pid 3691) 1840s, got TERM ok: run: gitlab-monitor: (pid 7246) 0s ok: run: gitlab-workhorse: (pid 7250) 0s [snip] # gitlab-ctl status run: alertmanager: (pid 7189) 62s; run: log: (pid 3706) 1872s run: gitaly: (pid 3691) 1872s, got TERM; run: log: (pid 3687) 1872s run: gitlab-monitor: (pid 7246) 31s; run: log: (pid 3679) 1872s run: gitlab-workhorse: (pid 7250) 31s; run: log: (pid 3688) 1872s [snip] # gitlab-ctl stop ok: down: alertmanager: 1s, normally up timeout: run: gitaly: (pid 3691) 1919s, want down, got TERM ok: down: gitlab-monitor: 1s, normally up [snip]
Turning it off and on again clearly not working ..
It magically restarted itself again, so I decided to ..
# systemctl stop gitlab-runsvdir # systemctl status gitlab-runsvdir ● gitlab-runsvdir.service - GitLab Runit supervision process Loaded: loaded (/usr/lib/systemd/system/gitlab-runsvdir.service; enabled; vendor preset: disabled) Active: deactivating (final-sigterm) since Sun 2019-05-26 10:16:57 BST; 1min 28s ago Process: 3663 ExecStart=/opt/gitlab/embedded/bin/runsvdir-start (code=exited, status=0/SUCCESS) Main PID: 3663 (code=exited, status=0/SUCCESS) CGroup: /system.slice/gitlab-runsvdir.service ├─3670 runsv gitaly ├─3691 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml └─9405 svlogd -tt /var/log/gitlab/gitaly May 26 09:22:26 akira systemd[1]: Started GitLab Runit supervision process. May 26 10:16:57 akira systemd[1]: Stopping GitLab Runit supervision process... [eventually] May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service stop-final-sigterm timed out. Killing. May 26 10:18:27 akira systemd[1]: Stopped GitLab Runit supervision process. May 26 10:18:27 akira systemd[1]: Unit gitlab-runsvdir.service entered failed state. May 26 10:18:27 akira systemd[1]: gitlab-runsvdir.service failed.
docs ..
There’s a page about gitaly here: https://docs.gitlab.com/ee/administration/gitaly
The debug stuff at the bottom doesn’t (at 11.10) include anything relevant for this.
Googling for issues didn’t turn up anything particularly relevant; a search for ‘starting’ amongst gitaly issues is here.
fix
It looks like turn it off and turn it on fixed it.
The filesystem I had mounted on /var/opt/gitlab/backups had filled up .. clearly that needed fixing regardless, and it might have been the root cause. Having done that, gitaly-wrapper was clearly locked up – once that was killed by systemd, restarting fixed it.
I’m one fix release behind; 11.10.4 is released, so I’ll upgrade. 11.11.0 is also out, but I’ll wait for some fix releases before deploying.