Looks like there maybe a bug in lvm2-2.02.180-10.el7_6.3.
Will raise a ticket for it with Centos.
Ticket raised with Centos. Having tried to reproduce it on another machine, it takes more than just upgrading the package – as one would expect. No idea what the other variables might be.
I think the ticket needs to go upstream; no bites, and the prospect of it getting traction on the Centos bug tracker is probably low.
Got a second machine that’s demonstrating the issue, and I’ve also seen it on a KVM guest. As well as the package downgrade (see below) stopping lvmetad is another workaround. LVM commands moan about it not running, but I understand it’s going anyway in RHEL8 (?) and the commands work fine by scanning the disks direct, so perhaps this can be done persistently ..
systemctl stop lvm2-lvmetad.socket systemctl stop lvm2-lvmetad.service
The issue doesn’t necessarily reappear when lvmetad is restarted or when the server is rebooted, which makes this even more subtle. Downgrading the package has the benefit of not being subtle.
Centos 7.6; something odd has happened to one of the the volume groups.
I noticed something was wrong because it’s enforced by puppet, and puppet was trying to recreate it from the PV level upwards. The safeguard in the module is at the PV level, and because the PV wasn’t showing either, puppet thought it could proceed. This could potentially have been destructive.
Now you see it ..
$ df -h | grep mapper /dev/mapper/yeagervg-lv00sys00 32G 7.1G 25G 23% / /dev/mapper/datavg-lv01bkp00 400G 337G 64G 85% /srv/data/backup /dev/mapper/yeagervg-lv00hom00 10G 1.5G 8.6G 15% /home
Now you don’t.
# vgs VG #PV #LV #SN Attr VSize VFree yeagervg 1 3 0 wz--n- <230.76g <156.76g # vgdisplay datavg Volume group "datavg" not found Cannot process volume group datavg
Possible clues in syslog
lvmetad: vg_lookup vgid blah-blah-etc-etc-blah-blah-blah name datavg found incomplete mapping uuid none name none
- Googling for that error didn’t turn up a fix.
- /etc/lvm/backup/datavg (10 months old) contains a readable specification for the VG, including ids for the volume group, logical volume, and physical volume. The ID for the VG matches that in the error.
- There’s a volume with the right PV ID in /dev/disk/by-id – based on the backup
Other clues include
- The single physical volume (an MD RAID device) is not showing as a known LVM PV.
- /proc/mdstat shows the device. pvdisplay refuses to acknowledge it.
lvmdiskscan acknowledges it, but this doesn’t wake anything up.
# pvdisplay /dev/md10 Failed to find physical volume "/dev/md10". # lvmdiskscan [..] /dev/md2 [ <230.76 GiB] LVM physical volume [..] /dev/md10 [ <1.82 TiB] LVM physical volume [..] 2 LVM physical volumes # pvs PV VG Fmt Attr PSize PFree /dev/md2 yeagervg lvm2 a-- <230.76g <156.76g
One of the handful of references in google is Redhat bugzilla 1647167. Not a lot of relevance, except it’s closed with a comment that lvmetad is going away.
From its man page:
The lvmetad daemon caches LVM metadata so that LVM commands can read metadata from the cache rather than scanning disks. This can be an advantage because scanning disks is time consuming and may interfere with the normal work of the system. lvmetad can be a disadvantage when disk event notifications from the system are unreliable.
lvmetad does not read metadata from disks itself. Instead, it relies on an LVM command, like pvscan –cache, to read metadata from disks and send it to lvmetad to be cached.
So, it has its uses, but not critical. Caches have problems though .. a glance at the pvscan manpage suggests:
# pvscan --cache /dev/md10 # pvs PV VG Fmt Attr PSize PFree /dev/md10 datavg lvm2 a-- <1.82t <1.43t /dev/md2 yeagervg lvm2 a-- <230.76g <156.76g # vgs VG #PV #LV #SN Attr VSize VFree datavg 1 1 0 wz--n- <1.82t <1.43t yeagervg 1 3 0 wz--n- <230.76g <156.76g
Question is, will it survive a reboot.
Still not fixed
Answer is: no.
Now I have a clearer idea of the problem (PV and VG vanished, volume – happily mounted) perhaps google will be more helpful.
This stackexchange question was the best thing I found; possibly exporting and importing the VG might fix it.
Alternatively, if it looks like lvmetad is the issue, what about turning it off?
# lvmconfig --withcomments global/use_lvmetad # Configuration option global/use_lvmetad. # Use lvmetad to cache metadata and reduce disk scanning. # When enabled (and running), lvmetad provides LVM commands with VG # metadata and PV state. LVM commands then avoid reading this # information from disks which can be slow. When disabled (or not # running), LVM commands fall back to scanning disks to obtain VG # metadata. lvmetad is kept updated via udev rules which must be set # up for LVM to work correctly. (The udev rules should be installed # by default.) Without a proper udev setup, changes in the system's # block device configuration will be unknown to LVM, and ignored # until a manual 'pvscan --cache' is run. If lvmetad was running # while use_lvmetad was disabled, it must be stopped, use_lvmetad # enabled, and then started. When using lvmetad, LV activation is # switched to an automatic, event-based mode. In this mode, LVs are # activated based on incoming udev events that inform lvmetad when # PVs appear on the system. When a VG is complete (all PVs present), # it is auto-activated. The auto_activation_volume_list setting # controls which LVs are auto-activated (all by default.) # When lvmetad is updated (automatically by udev events, or directly # by pvscan --cache), devices/filter is ignored and all devices are # scanned by default. lvmetad always keeps unfiltered information # which is provided to LVM commands. Each LVM command then filters # based on devices/filter. This does not apply to other, non-regexp, # filtering settings: component filters such as multipath and MD # are checked during pvscan --cache. To filter a device and prevent # scanning from the LVM system entirely, including lvmetad, use # devices/global_filter. use_lvmetad=1
This might be a way to tackle it (tedious via puppet!) but alternatively, what broke it?
- messages logs going back about two months to Dec 11th; the first and only incident of puppet trying to rebuild the PV and VG was Feb 19th.
- I have puppet runs on Jan 2nd and Jan 8th – the machine wasn’t online at the same time as the master inbetween.
- yum.log runs from Jan 4th and was last updated Feb 19th.
Feb 18 08:33:23 Updated: 7:lvm2-2.02.180-10.el7_6.3.x86_64
- /var/log/yum.log-20190104 runs from Jan 1st and was last updated Jan 2nd.
Jan 02 14:38:35 Updated: 7:lvm2-2.02.180-10.el7_6.2.x86_64
- /var/log/yum.log-20190101 runs from May 27th and was last updated Nov 23rd
May 27 16:39:35 Updated: 7:lvm2-2.02.177-4.el7.x86_64
The smoking gun looks like it could be the Feb 18th update. System is running the el7_6.3 packages.
The following puppet run has the issue; if it was introduced on Jan 2nd, I’d expect the Jan 8th puppet runs to have tripped over it. The system was rebooted several times between the Jan 2nd update and the Jan 8th puppet run, so opportunity for the PV and VG to vanish.
So, can I find the release notes for the package (errata from Redhat is below). Google says:
The lvm2-2.02.180-10.el7_6.3 package introduces a new way of detecting MDRAID and multipath devices.
Which would seem to be the right sort of change to produce this sort of misbehaviour.
# yum downgrade lvm2-libs-2.02.180-10.el7_6.2 \ lvm2-2.02.180-10.el7_6.2 \ device-mapper-event-1.02.149-10.el7_6.2 \ device-mapper-event-libs-1.02.149-10.el7_6.2 \ device-mapper-1.02.149-10.el7_6.2 \ device-mapper-libs-1.02.149-10.el7_6.2 [..] Dependencies Resolved ============================================================================================== Package Arch Version Repository Size ============================================================================================== Downgrading: device-mapper x86_64 7:1.02.149-10.el7_6.2 updates 292 k device-mapper-event x86_64 7:1.02.149-10.el7_6.2 updates 188 k device-mapper-event-libs x86_64 7:1.02.149-10.el7_6.2 updates 187 k device-mapper-libs x86_64 7:1.02.149-10.el7_6.2 updates 320 k lvm2 x86_64 7:2.02.180-10.el7_6.2 updates 1.3 M lvm2-libs x86_64 7:2.02.180-10.el7_6.2 updates 1.1 M Transaction Summary ============================================================================================== Downgrade 6 Packages
# uptime 14:13:36 up 1 min, 2 users, load average: 2.31, 1.02, 0.38 # vgs VG #PV #LV #SN Attr VSize VFree datavg 1 1 0 wz--n- <1.82t <1.43t yeagervg 1 3 0 wz--n- <230.76g <156.76g # pvs PV VG Fmt Attr PSize PFree /dev/md10 datavg lvm2 a-- <1.82t <1.43t /dev/md2 yeagervg lvm2 a-- <230.76g <156.76