[LU-11395] conf-sanity test 99 fails with 'add ost1 failed with new params - group descriptors corrupted!' Created: 18/Sep/18 Updated: 19/Dec/18 Resolved: 26/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | sles12, suse | ||
| Environment: |
SUSE12 SP3 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
conf-sanity test_99 started failing on 2018-09-01 for SLES12 SP3 testing. For the failure at https://testing.whamcloud.com/test_sets/dfe70e50-ba62-11e8-a7de-52540065bddc, we see the following in the test_log == conf-sanity test 99: Adding meta_bg option ======================================================== 11:07:52 (1537121272)
CMD: trevis-9vm3 /usr/sbin/lctl get_param -n version 2>/dev/null ||
/usr/sbin/lctl lustre_build_version 2>/dev/null ||
/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: trevis-9vm3 debugfs -c -R stats /dev/mapper/ost1_flakey
trevis-9vm3: debugfs 1.42.13.wc6 (05-Feb-2017)
trevis-9vm3: /dev/mapper/ost1_flakey: catastrophic mode - not reading inode or group bitmaps
params: --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey
CMD: trevis-9vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts || true
CMD: trevis-9vm3 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: trevis-9vm3 mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey
trevis-9vm3: mkfs.lustre: Unable to mount /dev/mapper/ost1_flakey: Structure needs cleaning
trevis-9vm3:
trevis-9vm3: mkfs.lustre FATAL: failed to write local files
trevis-9vm3: mkfs.lustre: exiting with 117 (Structure needs cleaning)
Permanent disk data:
Target: lustre:OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x62
(OST first_time update )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=10.9.4.97@tcp sys.timeout=20
device size = 2048MB
formatting backing filesystem ldiskfs on /dev/mapper/ost1_flakey
target name lustre:OST0000
4k blocks 50000
options -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:OST0000 -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F /dev/mapper/ost1_flakey 50000
conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params
In the OST (vm3) console log, we see [37266.504045] Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-O ^resize_inode,meta_bg -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey [37266.683422] LDISKFS-fs (dm-11): ldiskfs_check_descriptors: Block bitmap for group 0 overlaps block group descriptors [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted! [37266.789545] Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params After the conf-sanity test 99 failure, conf-sanity test_100, 101, 102, 104, 1-5, 106, 107, 108b, 109a, 109b, and 122 all fail with Kernel error detected: [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted! conf-sanity test_100: @@@@@@ FAIL: Error in dmesg detected In many of the test sessions that we see conf-sanity test 99 fail, we also see racer test 1 and replay-single 0a, 0b, 0c also fail with corrupted group descriptors. conf-sanity test 99 started failing with this error on 2018-09-01 and have seen it several times for SUSE 12 SP3 client and server testing. Some failures are at: |
| Comments |
| Comment by James A Simmons [ 18/Sep/18 ] |
|
I wonder if https://review.whamcloud.com/#/c/33144 would help here? |
| Comment by Oleg Drokin [ 19/Sep/18 ] |
|
no, that ubuntu patch is not used for sles builds, unless you mean if there's a similar problem there? |
| Comment by Peter Jones [ 19/Sep/18 ] |
|
Jian This seems to line up with when the kernel update landed. Could you please investigate? Thanks Peter |
| Comment by Jian Yu [ 23/Sep/18 ] |
|
After updating SLES12 SP3 kernel to version 4.4.155-94.50.1, conf-sanity test 99 passed: I'm investigating the conf-sanity test 103 and sanity-sec test failures and will resolve the issues in the patch for |
| Comment by Jian Yu [ 26/Sep/18 ] |
|
It turns out conf-sanity test 103 is a known issue on master branch and tracked in |