Details
Description
conf-sanity test_99 started failing on 2018-09-01 for SLES12 SP3 testing. For the failure at https://testing.whamcloud.com/test_sets/dfe70e50-ba62-11e8-a7de-52540065bddc, we see the following in the test_log
== conf-sanity test 99: Adding meta_bg option ======================================================== 11:07:52 (1537121272) CMD: trevis-9vm3 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 CMD: trevis-9vm3 debugfs -c -R stats /dev/mapper/ost1_flakey trevis-9vm3: debugfs 1.42.13.wc6 (05-Feb-2017) trevis-9vm3: /dev/mapper/ost1_flakey: catastrophic mode - not reading inode or group bitmaps params: --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey CMD: trevis-9vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts || true CMD: trevis-9vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true CMD: trevis-9vm3 mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey trevis-9vm3: mkfs.lustre: Unable to mount /dev/mapper/ost1_flakey: Structure needs cleaning trevis-9vm3: trevis-9vm3: mkfs.lustre FATAL: failed to write local files trevis-9vm3: mkfs.lustre: exiting with 117 (Structure needs cleaning) Permanent disk data: Target: lustre:OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x62 (OST first_time update ) Persistent mount opts: ,errors=remount-ro Parameters: mgsnode=10.9.4.97@tcp sys.timeout=20 device size = 2048MB formatting backing filesystem ldiskfs on /dev/mapper/ost1_flakey target name lustre:OST0000 4k blocks 50000 options -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L lustre:OST0000 -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F /dev/mapper/ost1_flakey 50000 conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params
In the OST (vm3) console log, we see
[37266.504045] Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-O ^resize_inode,meta_bg -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey [37266.683422] LDISKFS-fs (dm-11): ldiskfs_check_descriptors: Block bitmap for group 0 overlaps block group descriptors [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted! [37266.789545] Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params
After the conf-sanity test 99 failure, conf-sanity test_100, 101, 102, 104, 1-5, 106, 107, 108b, 109a, 109b, and 122 all fail with
Kernel error detected: [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted! conf-sanity test_100: @@@@@@ FAIL: Error in dmesg detected
In many of the test sessions that we see conf-sanity test 99 fail, we also see racer test 1 and replay-single 0a, 0b, 0c also fail with corrupted group descriptors.
conf-sanity test 99 started failing with this error on 2018-09-01 and have seen it several times for SUSE 12 SP3 client and server testing. Some failures are at:
https://testing.whamcloud.com/test_sets/561cf376-ae44-11e8-bd05-52540065bddc
https://testing.whamcloud.com/test_sets/58fb8682-b09e-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/d657ba78-b63a-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/76f11f72-b921-11e8-9df3-52540065bddc