Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.8.0
-
None
-
CentOS 7.2 (Kernel: Various)
lustre-master Build 3424 & 3423
Hardware:
10x Lustre Servers (Intel Wildcat Pass, E5 v3 & 128GB)
Single LNET Network - o2ib0 (Omni-Path) IFS 10.1.1.0.9
All block devices for Lustre are NVMe, either DC P3700 or DC P3600 (Excluding MGT which is on normal SSD).
_____
kernel-3.10.0-327.22.2.el7_lustre.x86_64
kernel-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64
kernel-debuginfo-common-x86_64-3.10.0-327.22.2.el7_lustre.x86_64
kernel-devel-3.10.0-327.22.2.el7_lustre.x86_64
kernel-headers-3.10.0-327.22.2.el7_lustre.x86_64
kernel-tools-3.10.0-327.22.2.el7_lustre.x86_64
kernel-tools-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64
kernel-tools-libs-3.10.0-327.22.2.el7_lustre.x86_64
kernel-tools-libs-devel-3.10.0-327.22.2.el7_lustre.x86_64
kmod-lustre-2.8.56_26_g6fad3ab-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.8.56_26_g6fad3ab-1.el7.x86_64
kmod-lustre-osd-zfs-2.8.56_26_g6fad3ab-1.el7.x86_64
kmod-lustre-tests-2.8.56_26_g6fad3ab-1.el7.x86_64
kmod-spl-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64
kmod-spl-devel-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64
kmod-zfs-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64
kmod-zfs-devel-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64
lustre-2.8.56_26_g6fad3ab-1.el7.x86_64
lustre-debuginfo-2.8.56_26_g6fad3ab-1.el7.x86_64
lustre-iokit-2.8.56_26_g6fad3ab-1.el7.x86_64
lustre-osd-ldiskfs-mount-2.8.56_26_g6fad3ab-1.el7.x86_64
lustre-osd-zfs-mount-2.8.56_26_g6fad3ab-1.el7.x86_64
lustre-tests-2.8.56_26_g6fad3ab-1.el7.x86_64
perf-3.10.0-327.22.2.el7_lustre.x86_64
perf-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64
python-perf-3.10.0-327.22.2.el7_lustre.x86_64
python-perf-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64CentOS 7.2 (Kernel: Various) lustre-master Build 3424 & 3423 Hardware: 10x Lustre Servers (Intel Wildcat Pass, E5 v3 & 128GB) Single LNET Network - o2ib0 (Omni-Path) IFS 10.1.1.0.9 All block devices for Lustre are NVMe, either DC P3700 or DC P3600 (Excluding MGT which is on normal SSD). _____ kernel-3.10.0-327.22.2.el7_lustre.x86_64 kernel-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64 kernel-debuginfo-common-x86_64-3.10.0-327.22.2.el7_lustre.x86_64 kernel-devel-3.10.0-327.22.2.el7_lustre.x86_64 kernel-headers-3.10.0-327.22.2.el7_lustre.x86_64 kernel-tools-3.10.0-327.22.2.el7_lustre.x86_64 kernel-tools-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64 kernel-tools-libs-3.10.0-327.22.2.el7_lustre.x86_64 kernel-tools-libs-devel-3.10.0-327.22.2.el7_lustre.x86_64 kmod-lustre-2.8.56_26_g6fad3ab-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.8.56_26_g6fad3ab-1.el7.x86_64 kmod-lustre-osd-zfs-2.8.56_26_g6fad3ab-1.el7.x86_64 kmod-lustre-tests-2.8.56_26_g6fad3ab-1.el7.x86_64 kmod-spl-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64 kmod-spl-devel-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64 kmod-zfs-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64 kmod-zfs-devel-3.10.0-327.22.2.el7_lustre.x86_64-0.6.5.7-1.el7.x86_64 lustre-2.8.56_26_g6fad3ab-1.el7.x86_64 lustre-debuginfo-2.8.56_26_g6fad3ab-1.el7.x86_64 lustre-iokit-2.8.56_26_g6fad3ab-1.el7.x86_64 lustre-osd-ldiskfs-mount-2.8.56_26_g6fad3ab-1.el7.x86_64 lustre-osd-zfs-mount-2.8.56_26_g6fad3ab-1.el7.x86_64 lustre-tests-2.8.56_26_g6fad3ab-1.el7.x86_64 perf-3.10.0-327.22.2.el7_lustre.x86_64 perf-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64 python-perf-3.10.0-327.22.2.el7_lustre.x86_64 python-perf-debuginfo-3.10.0-327.22.2.el7_lustre.x86_64
-
3
-
9223372036854775807
Description
Lustre DNE2 Testing, noticed some issue with latest master builds. When mounting storage targets on servers other than ones with the MGT i get a kernel panic with the below; I have validated this is not (to the best of my ability) network, I have also tried and FE build which works and another master build (3419) which works:
[root@zlfs2-oss1 ~]# mount -vvv -t lustre /dev/nvme0n1 /mnt/MDT0000 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = /dev/nvme0n1 arg[5] = /mnt/MDT0000 source = /dev/nvme0n1 (/dev/nvme0n1), target = /mnt/MDT0000 options = rw checking for existing Lustre data: found Reading CONFIGS/mountdata Writing CONFIGS/mountdata mounting device /dev/nvme0n1 at /mnt/MDT0000, flags=0x1000000 options=osd=osd-ldiskfs,user_xattr,errors=remount-ro,mgsnode=192.168.5.21@o2ib,virgin,update,param=mgsnode=192.168.5.21@o2ib,svname=zlfs2-MDT0000,device=/dev/nvme0n1 mount.lustre: cannot parse scheduler options for '/sys/block/nvme0n1/queue/scheduler' Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ... kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ... kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) LBUG Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ... kernel:Kernel panic - not syncing: LBUG
Attached is some debugging / more info.
Builds Tried:
master b3424 - issues
master b3423 - issues
master b3420 - issues
master b3419 - works
fe 2.8 b18 - works
BTW the cause of the second bug is that if a new OST mounts before the MGC has pulled the nodemap config from the MGS, it creates a new blank config on disk. Part of that code was erroneously assuming that it was in the MGS, as normally all new records are created there and then sent to the OSTs, so it was returning an error. That's why the first OST failed to mount. When the other OSTs were mounted, the MGC was already connected to the MGS, so it was able to pull the config and save it properly. That's why the other OSTs were able to mount after rebooting, but nvme0n1 wasn't able to until the others were mounted.