Starting point

upgraded GitLab Omnibus from 12.1.6 to 12.2.4
afterwards; Prometheus no longer showed as up

# gitlab-ctl status
[..]
down: prometheus: 1s, normally up, want up; run: log: (pid 1666) 2881s
[..]

At this time, last known good (for sure) was following the last upgrade (to 12.1.6) as I have output stashed
A restart (systemctl restart gitlab-runsvdir) didn’t fix it, neither did a reboot.

Spoilers

The upgrade had nothing to do with it (which was done in the afternoon on September 9th.)
The broad issue of empty meta.json files is a not entirely uncommon Prometheus thing, so this will be relevant to some extent for non GitLab implementations.
Lesson learned: I will add a gitlab-ctl status to my preflight checks for upgrading GitLab.

Update 2019-12-09: GitLab Omnibus upgraded from Prometheus 2.8.1 to 2.12.0 in release 12.3.0, so from then on, it’d have the fsync fix.

Logs and Processes

Gitlab Omnibus has a service management program which owns everything. Can see it tried ..

root      1648  1642  0 14:24 ?        00:00:04 runsv prometheus
root      1666  1648  0 14:24 ?        00:00:00 svlogd -tt /var/log/gitlab/prometheus

This provides a useful clue to where the logs are; they are in an unusual format ..

# ls -altr /var/log/gitlab/prometheus
total 244000
-rw-r--r--.  1 root              root        30 Oct 15  2017 config
-rw-------.  1 root              root         0 Oct 15  2017 lock
-rwxr--r--.  1 root              root    140120 Jan  1  2018 @400000005a4a67741657f2a4.s
-rw-r--r--.  1 root              root  10908704 Jun  4 14:49 @400000005cf690de3828462c.u
-rw-r--r--.  1 root              root         0 Jun 16 15:28 state
-rwxr--r--.  1 root              root    149095 Jun 16 15:28 @400000005d0652191df66af4.s
drwxr-xr-x. 20 root              root      4096 Jul  3 12:21 ..
-rw-r--r--.  1 root              root    626069 Jul  3 20:00 @400000005d2063a11e7ddebc.u
-rw-r--r--.  1 root              root    593560 Sep  8 10:28 @400000005d76324435f19fac.u
drwx------.  2 gitlab-prometheus root      4096 Sep  9 17:11 .
-rw-r--r--.  1 root              root 170182287 Sep 10 15:24 current

The current log shows that there are repeated and ongoing attempts to start prometheus. Most of it seems to be general chattiness, highlights:

caller=main.go:321 msg="Starting Prometheus" version="(version=2.8.1, branch=master, revision=)
caller=main.go:322 build_context="(go=go1.12.7, user=GitLab-Omnibus, date=)"
caller=main.go:640 msg="Starting TSDB ..."
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566547200000 maxt=1566583200000 ulid=01DK97NVZQTSK9H1SQJT2D153S
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566842400000 maxt=1566856800000 ulid=01DKBKMTMEGGGA718A4RRKFH56
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566892800000 maxt=1566914400000 ulid=01DKNW2N3JSDKCEA2JTME0WGPJ
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566972000000 maxt=1566986400000 ulid=01DKPZQ90DGN0RW5H9R10NFVSN
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1567317600000 maxt=1567382400000 ulid=01DKQYMSP8RRJ95J8DPGTWHMBB

It then logs stopping scrape discovery manager, notify discovery manager, rule manager, notification manager and ..

caller=main.go:717 err="opening storage failed: block dir: \"/var/opt/gitlab/prometheus/data/01DM83875Y7WH9AK7EFTXYVFEA\": unexpected end of JSON input"

Yay. Nice full path to where the problem is. It’s like it’s making it easy for me.

Broken data?

The error suggests there’a a broken JSON file to find.

There’s a bunch of directories in the data directory (which is actually what it says to expect in the error ..)

01DK97NVZQTSK9H1SQJT2D153S
01DKBKMTMEGGGA718A4RRKFH56
01DKNW2N3JSDKCEA2JTME0WGPJ
01DKPZQ90DGN0RW5H9R10NFVSN
01DKQYMSP8RRJ95J8DPGTWHMBB
01DM83875Y7WH9AK7EFTXYVFEA
01DM838A0DY0VDJNFYGP3RW6CW

At this level, the datestamps aren’t very interesting; all were last updated on the day before I did the upgrade. Probably booting the machine up.
The logs indicate that the first five load ok, it stumbles on the penultimate one.
Last good:

           #size;b
drwx------ 68       Sep 8 10:27 01DKQYMSP8RRJ95J8DPGTWHMBB
drwx------ 20       Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/chunks
-rw------- 12867691 Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/chunks/000001
-rw------- 1629968  Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/index
-rw------- 9        Sep 2 04:00 01DKQYMSP8RRJ95J8DPGTWHMBB/tombstones
-rw------- 827      Sep 8 10:27 01DKQYMSP8RRJ95J8DPGTWHMBB/meta.json

Broken:

drwx------ 68       Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA
drwx------ 20       Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/chunks
-rw------- 7424     Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/chunks/000001
-rw------- 47747    Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/index
-rw------- 0        Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/tombstones
-rw------- 0        Sep 8 10:28 01DM83875Y7WH9AK7EFTXYVFEA/meta.json

A google for the error turned up:

Looks like the fix will be to remove the dodgy directory and restart prometheus. (prometheus issue 4058)
Looks like there’s an issue due to be fixed in 2.9.0; this occurred with the bundled 2.8.1. (prometheus issue 5457, referring to 4058)
If the file (meta.json) wasn’t empty, there’s a python incantation to validate it. (prometheus issue 4297)
It seems to be (have been) a thing on NFS which is not recommended.
Also seen on (I think) Openshift 4.1 running Prometheus 2.3.2 on Gluster (which is probably against recommendations, like NFS)

Fix

Prometheus keeps trying to restart, and I want it not to do this temporarily.

#  systemctl stop gitlab-runsvdir
#  gitlab-ctl status
fail: alertmanager: runsv not running
fail: gitaly: runsv not running
fail: gitlab-monitor: runsv not running
fail: gitlab-workhorse: runsv not running
fail: grafana: runsv not running
fail: logrotate: runsv not running
fail: nginx: runsv not running
fail: node-exporter: runsv not running
fail: postgres-exporter: runsv not running
fail: postgresql: runsv not running
fail: prometheus: runsv not running
fail: redis: runsv not running
fail: redis-exporter: runsv not running
fail: sidekiq: runsv not running
fail: unicorn: runsv not running

meta.json is empty for the two most recent directories, so I’ll delete both.

# ls -alh /var/opt/gitlab/prometheus/data/*/meta.json | cut -b 51-
680 Sep  8 10:27 /var/opt/gitlab/prometheus/data/01DK97NVZQTSK9H1SQJT2D153S/meta.json
571 Sep  8 10:27 /var/opt/gitlab/prometheus/data/01DKBKMTMEGGGA718A4RRKFH56/meta.json
612 Sep  8 10:27 /var/opt/gitlab/prometheus/data/01DKNW2N3JSDKCEA2JTME0WGPJ/meta.json
579 Sep  8 10:27 /var/opt/gitlab/prometheus/data/01DKPZQ90DGN0RW5H9R10NFVSN/meta.json
827 Sep  8 10:27 /var/opt/gitlab/prometheus/data/01DKQYMSP8RRJ95J8DPGTWHMBB/meta.json
  0 Sep  8 10:28 /var/opt/gitlab/prometheus/data/01DM83875Y7WH9AK7EFTXYVFEA/meta.json
  0 Sep  8 10:28 /var/opt/gitlab/prometheus/data/01DM838A0DY0VDJNFYGP3RW6CW/meta.json
# rm -rf /var/opt/gitlab/prometheus/data/01DM83875Y7WH9AK7EFTXYVFEA \
> /var/opt/gitlab/prometheus/data/01DM838A0DY0VDJNFYGP3RW6CW

Restart.

#  systemctl start gitlab-runsvdir
#  gitlab-ctl status
run: alertmanager: (pid 8115) 139s; run: log: (pid 8114) 139s
run: gitaly: (pid 8121) 139s; run: log: (pid 8120) 139s
run: gitlab-monitor: (pid 8123) 139s; run: log: (pid 8122) 139s
run: gitlab-workhorse: (pid 8133) 139s; run: log: (pid 8132) 139s
run: grafana: (pid 8113) 139s; run: log: (pid 8112) 139s
run: logrotate: (pid 8129) 139s; run: log: (pid 8128) 139s
run: nginx: (pid 8131) 139s; run: log: (pid 8130) 139s
run: node-exporter: (pid 8137) 139s; run: log: (pid 8136) 139s
run: postgres-exporter: (pid 8125) 139s; run: log: (pid 8124) 139s
run: postgresql: (pid 8135) 139s; run: log: (pid 8134) 139s
run: prometheus: (pid 8119) 139s; run: log: (pid 8118) 139s
run: redis: (pid 8111) 139s; run: log: (pid 8110) 139s
run: redis-exporter: (pid 8127) 139s; run: log: (pid 8126) 139s
run: sidekiq: (pid 8117) 139s; run: log: (pid 8116) 139s
run: unicorn: (pid 8139) 139s; run: log: (pid 8138) 139s

Hunt for root cause

Performed before fixing it.
There’s enough free space to write files in this location.
I don’t use NFS; the data is on XFS

The previous log (/var/log/gitlab/prometheus) is: @400000005d76324435f19fac.u

It contains a clean startup (8th; day before GitLab upgrade) with a lot more data directories than I see now ..

2019-09-08_09:27:49.17776 level=info ts=2019-09-08T09:27:49.105611306Z 
caller=main.go:321 msg="Starting Prometheus" version="(version=2.8.1, branch=master, revision=)"
[..]
caller=main.go:640 msg="Starting TSDB ..."
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566122400000 maxt=1566136800000 ulid=01DJMZD18ZDB9R83M6GKVBGDVP
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566201600000 maxt=1566216000000 ulid=01DJQG461YW0R13VDE0KAABSKC
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566216000000 maxt=1566230400000 ulid=01DJSQXBT99DX6KGQZKD75SZ61
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566302400000 maxt=1566316800000 ulid=01DJT46FEVE95Q07ES7993K5Z3
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566374400000 maxt=1566403200000 ulid=01DJYYDAPFM23M44SX4HYKS42W
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566489600000 maxt=1566511200000 ulid=01DJZ8ZY26ZZ1QH9Y1X5Y41B5H
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566547200000 maxt=1566583200000 ulid=01DK97NVZQTSK9H1SQJT2D153S
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566842400000 maxt=1566856800000 ulid=01DKBKMTMEGGGA718A4RRKFH56
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566892800000 maxt=1566914400000 ulid=01DKNW2N3JSDKCEA2JTME0WGPJ
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1566972000000 maxt=1566986400000 ulid=01DKPZQ90DGN0RW5H9R10NFVSN
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1567382400000 maxt=1567389600000 ulid=01DKQYMEBW70D41CFTY4GSH80E
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1567317600000 maxt=1567382400000 ulid=01DKQYMSP8RRJ95J8DPGTWHMBB
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1567389600000 maxt=1567396800000 ulid=01DKR5G5KQ2ZX0NXMYJA5X90YC
caller=repair.go:48 component=tsdb msg="found healthy block" mint=1567396800000 maxt=1567404000000 ulid=01DM67W10QZKDTN1W2F22KBF1S
[..]
caller=main.go:655 msg="TSDB started"
[..]
caller=main.go:724 msg="Loading configuration file" filename=/var/opt/gitlab/prometheus/prometheus.yml
[..]
caller=main.go:609 msg="Server is ready to receive web requests."

It is then doing some compaction, I think (compact.go?)

2019-09-08_09:28:34.15589 level=info ts=2019-09-08T09:28:34.15526043Z caller=compact.go:496 
    component=tsdb msg="write block" mint=1567872000000 maxt=1567879200000 ulid=01DM83875Y7WH9AK7EFTXYVFEA duration=428.768903ms
2019-09-08_09:28:37.50895 level=info ts=2019-09-08T09:28:37.498124475Z caller=compact.go:441 
    component=tsdb msg="compact blocks" count=3 mint=1567382400000 maxt=1567404000000 ulid=01DM838A0DY0VDJNFYGP3RW6CW 
    sources="[01DKQYMEBW70D41CFTY4GSH80E 01DKR5G5KQ2ZX0NXMYJA5X90YC 01DM67W10QZKDTN1W2F22KBF1S]" duration=876.363746ms

And that’s the end of that log.

current log starts with:

2019-09-09_11:06:35.89163 level=info ts=2019-09-09T11:06:35.888309089Z caller=main.go:321 
    msg="Starting Prometheus" version="(version=2.8.1, branch=master, revision=)"

And that fails with 01DM83875Y7WH9AK7EFTXYVFEA. I read that it was compacting this the previous day when the log stopped.

The Prometheus logs are in UTC, not system time, which is currently UTC+1.

09:28 is consistent with the first boot on Sept 8th.
Journald (I have persistent logging set up) has nothing after Sep 08 10:27:59 (ie: just before 09:28 prometheus time) and the reboot info is odd; the end time for three of the last five uptimes is apparently 19:24 on September 9th.

# last -5 reboot
reboot   system boot  3.10.0-957.27.2. Tue Sep 10 14:23 - 17:02  (02:38)    
reboot   system boot  3.10.0-957.27.2. Mon Sep  9 19:24 - 17:02  (21:38)    
reboot   system boot  3.10.0-957.27.2. Mon Sep  9 12:06 - 19:24  (07:18)    
reboot   system boot  3.10.0-957.27.2. Sun Sep  8 10:27 - 19:24 (1+08:56)   
reboot   system boot  3.10.0-957.27.2. Sat Sep  7 17:56 - 19:24 (2+01:27)

It’s a KVM guest; the host was up until 10:29 (just after 09:28 prometheus time) although the uptime info on there is negative.

$ last -5 reboot
reboot   system boot  3.10.0-957.27.2. Tue Sep 10 15:23 - 17:07  (01:43)    
reboot   system boot  3.10.0-957.27.2. Mon Sep  9 13:05 - 17:07 (1+04:01)   
reboot   system boot  3.10.0-957.27.2. Sun Sep  8 11:26 - 10:29  (00:-57)   
reboot   system boot  3.10.0-957.27.2. Sat Sep  7 18:55 - 18:12  (00:-43)   
reboot   system boot  3.10.0-957.27.2. Sat Sep  7 18:09 - 18:12  (00:03)

I’m not going to go digging further.

Conclusion

I don’t think my guests are always going down cleanly at shutdown.

Prometheus was compacting it’s data when KVM guest went away.

Looking at issue 4058, it’s noted that the code doesn’t do a sync before closing the file; it does everywhere else, and I guess this is the fix that’ll be in 2.9.0. That might have helped, though it’s a case, here, of putting on a tourniquet before shooting oneself in a foot.

thewatertower.org

GitLab’s Prometheus misbehaving – unexpected end of JSON input

Starting point

Spoilers

Logs and Processes

Broken data?

Fix

Hunt for root cause

Conclusion

Testing bash scripts using BATS

Resolving wordpress.com Let’s Encrypt TLS issue: “update or remove the CAA DNS records”

x509 commit signing with SMIME / gpgsm – using a test certificate authority

Troubleshooting x509 commit signing with SMIME / gpgsm

Olympus camera comparison

Testing for certificate expiry with openssl

Leave a comment Cancel reply

GitLab’s Prometheus misbehaving – unexpected end of JSON input

Starting point

Spoilers

Logs and Processes

Broken data?

Fix

Hunt for root cause

Conclusion

Share this:

Testing bash scripts using BATS

Resolving wordpress.com Let’s Encrypt TLS issue: “update or remove the CAA DNS records”

x509 commit signing with SMIME / gpgsm – using a test certificate authority

Troubleshooting x509 commit signing with SMIME / gpgsm

Olympus camera comparison

Testing for certificate expiry with openssl

Leave a comment Cancel reply