AWOL Linux LVM volume group and physical volume

TL;DR

Looks like there maybe a bug in lvm2-2.02.180-10.el7_6.3. Will raise a ticket for it with Centos.

Update 2019-02-23

Ticket raised with Centos. Having tried to reproduce it on another machine, it takes more than just upgrading the package – as one would expect. No idea what the other variables might be.

Update 2019-04-22

I think the ticket needs to go upstream; no bites, and the prospect of it getting traction on the Centos bug tracker is probably low.

Got a second machine that’s demonstrating the issue, and I’ve also seen it on a KVM guest. As well as the package downgrade (see below) stopping lvmetad is another workaround. LVM commands moan about it not running, but I understand it’s going anyway in RHEL8 (?) and the commands work fine by scanning the disks direct, so perhaps this can be done persistently ..

systemctl stop lvm2-lvmetad.socket
systemctl stop lvm2-lvmetad.service

The issue doesn’t necessarily reappear when lvmetad is restarted or when the server is rebooted, which makes this even more subtle. Downgrading the package has the benefit of not being subtle.

Workings

Centos 7.6; something odd has happened to one of the the volume groups.

I noticed something was wrong because it’s enforced by puppet, and puppet was trying to recreate it from the PV level upwards.  The safeguard in the module is at the PV level, and because the PV wasn’t showing either, puppet thought it could proceed. This could potentially have been destructive.

Now you see it ..

$ df -h | grep mapper
/dev/mapper/yeagervg-lv00sys00   32G  7.1G   25G  23% /
/dev/mapper/datavg-lv01bkp00    400G  337G   64G  85% /srv/data/backup
/dev/mapper/yeagervg-lv00hom00   10G  1.5G  8.6G  15% /home

Now you don’t.

# vgs
  VG       #PV #LV #SN Attr   VSize    VFree   
  yeagervg   1   3   0 wz--n- <230.76g <156.76g
# vgdisplay datavg
  Volume group "datavg" not found
  Cannot process volume group datavg

Possible clues in syslog

lvmetad[2599]: vg_lookup vgid blah-blah-etc-etc-blah-blah-blah name datavg found incomplete mapping uuid none name none
  • Googling for that error didn’t turn up a fix.
  • /etc/lvm/backup/datavg (10 months old) contains a readable specification for the VG, including ids for the volume group, logical volume, and physical volume.  The ID for the VG matches that in the error.
  • There’s a volume with the right PV ID in /dev/disk/by-id – based on the backup

Other clues include

  • The single physical volume (an MD RAID device) is not showing as a known LVM PV.
  • /proc/mdstat shows the device.  pvdisplay refuses to acknowledge it.

lvmdiskscan acknowledges it, but this doesn’t wake anything up.

#  pvdisplay /dev/md10
  Failed to find physical volume "/dev/md10".
# lvmdiskscan
[..]
  /dev/md2                [    <230.76 GiB] LVM physical volume
[..]
  /dev/md10               [      <1.82 TiB] LVM physical volume
[..]
  2 LVM physical volumes
# pvs
  PV         VG       Fmt  Attr PSize    PFree   
  /dev/md2   yeagervg lvm2 a--  <230.76g <156.76g

One of the handful of references in google is Redhat bugzilla 1647167.  Not a lot of relevance, except it’s closed with a comment that lvmetad is going away.

From its man page:

The lvmetad daemon caches LVM metadata so that LVM commands can read metadata from the cache rather than scanning disks. This can be an advantage because scanning disks is time consuming and may interfere with the normal work of the system. lvmetad can be a disadvantage when disk event notifications from the system are unreliable.

lvmetad does not read metadata from disks itself. Instead, it relies on an LVM command, like pvscan –cache, to read metadata from disks and send it to lvmetad to be cached.

So, it has its uses, but not critical. Caches have problems though .. a glance at the pvscan manpage suggests:

# pvscan --cache /dev/md10
# pvs
  PV         VG       Fmt  Attr PSize    PFree   
  /dev/md10  datavg   lvm2 a--    <1.82t   <1.43t
  /dev/md2   yeagervg lvm2 a--  <230.76g <156.76g
# vgs
  VG       #PV #LV #SN Attr   VSize    VFree   
  datavg     1   1   0 wz--n-   <1.82t   <1.43t
  yeagervg   1   3   0 wz--n- <230.76g <156.76g

Question is, will it survive a reboot.

Still not fixed

Answer is: no.

Now I have a clearer idea of the problem (PV and VG vanished, volume – happily mounted) perhaps google will be more helpful.

This stackexchange question was the best thing I found;  possibly exporting and importing the VG might fix it.

Alternatively, if it looks like lvmetad is the issue, what about turning it off?

# lvmconfig --withcomments global/use_lvmetad

# Configuration option global/use_lvmetad.
# Use lvmetad to cache metadata and reduce disk scanning.
# When enabled (and running), lvmetad provides LVM commands with VG
# metadata and PV state. LVM commands then avoid reading this
# information from disks which can be slow. When disabled (or not
# running), LVM commands fall back to scanning disks to obtain VG
# metadata. lvmetad is kept updated via udev rules which must be set
# up for LVM to work correctly. (The udev rules should be installed
# by default.) Without a proper udev setup, changes in the system's
# block device configuration will be unknown to LVM, and ignored
# until a manual 'pvscan --cache' is run. If lvmetad was running
# while use_lvmetad was disabled, it must be stopped, use_lvmetad
# enabled, and then started. When using lvmetad, LV activation is
# switched to an automatic, event-based mode. In this mode, LVs are
# activated based on incoming udev events that inform lvmetad when
# PVs appear on the system. When a VG is complete (all PVs present),
# it is auto-activated. The auto_activation_volume_list setting
# controls which LVs are auto-activated (all by default.)
# When lvmetad is updated (automatically by udev events, or directly
# by pvscan --cache), devices/filter is ignored and all devices are
# scanned by default. lvmetad always keeps unfiltered information
# which is provided to LVM commands. Each LVM command then filters
# based on devices/filter. This does not apply to other, non-regexp,
# filtering settings: component filters such as multipath and MD
# are checked during pvscan --cache. To filter a device and prevent
# scanning from the LVM system entirely, including lvmetad, use
# devices/global_filter.
use_lvmetad=1

What changed?

This might be a way to tackle it (tedious via puppet!) but alternatively, what broke it?

  • messages logs going back about two months to Dec 11th;  the first and only incident of puppet trying to rebuild the PV and VG was Feb 19th.
  • I have puppet runs on Jan 2nd and Jan 8th – the machine wasn’t online at the same time as the master inbetween.
  • yum.log runs from Jan 4th and was last updated Feb 19th.
Feb 18 08:33:23 Updated: 7:lvm2-2.02.180-10.el7_6.3.x86_64
  • /var/log/yum.log-20190104 runs from Jan 1st and was last updated Jan 2nd.
Jan 02 14:38:35 Updated: 7:lvm2-2.02.180-10.el7_6.2.x86_64
  • /var/log/yum.log-20190101 runs from May 27th and was last updated Nov 23rd
May 27 16:39:35 Updated: 7:lvm2-2.02.177-4.el7.x86_64

The smoking gun looks like it could be the Feb 18th update.  System is running the el7_6.3 packages.

The following puppet run has the issue;  if it was introduced on Jan 2nd, I’d expect the Jan 8th puppet runs to have tripped over it.  The system was rebooted several times between the Jan 2nd update and the Jan 8th puppet run, so opportunity for the PV and VG to vanish.

So, can I find the release notes for the package (errata from Redhat is below). Google says:

Bugzilla 1676921:

The lvm2-2.02.180-10.el7_6.3 package introduces a new way of detecting MDRAID and multipath devices.

Which would seem to be the right sort of change to produce this sort of misbehaviour.

Downgrade!

# yum downgrade lvm2-libs-2.02.180-10.el7_6.2 \
                lvm2-2.02.180-10.el7_6.2 \
                device-mapper-event-1.02.149-10.el7_6.2 \
                device-mapper-event-libs-1.02.149-10.el7_6.2 \
                device-mapper-1.02.149-10.el7_6.2 \
                device-mapper-libs-1.02.149-10.el7_6.2
[..]
Dependencies Resolved

==============================================================================================
 Package                        Arch         Version                      Repository     Size
==============================================================================================
Downgrading:
 device-mapper                  x86_64       7:1.02.149-10.el7_6.2        updates       292 k
 device-mapper-event            x86_64       7:1.02.149-10.el7_6.2        updates       188 k
 device-mapper-event-libs       x86_64       7:1.02.149-10.el7_6.2        updates       187 k
 device-mapper-libs             x86_64       7:1.02.149-10.el7_6.2        updates       320 k
 lvm2                           x86_64       7:2.02.180-10.el7_6.2        updates       1.3 M
 lvm2-libs                      x86_64       7:2.02.180-10.el7_6.2        updates       1.1 M

Transaction Summary
==============================================================================================
Downgrade  6 Packages

Later ..

# uptime
 14:13:36 up 1 min,  2 users,  load average: 2.31, 1.02, 0.38
# vgs
  VG       #PV #LV #SN Attr   VSize    VFree   
  datavg     1   1   0 wz--n-   <1.82t   <1.43t
  yeagervg   1   3   0 wz--n- <230.76g <156.76g
# pvs
  PV         VG       Fmt  Attr PSize    PFree   
  /dev/md10  datavg   lvm2 a--    <1.82t   <1.43t
  /dev/md2   yeagervg lvm2 a--  <230.76g <156.76

More information

 

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s