[LU-11395] conf-sanity test 99 fails with 'add ost1 failed with new params - group descriptors corrupted!' Created: 18/Sep/18  Updated: 19/Dec/18  Resolved: 26/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: sles12, suse
Environment:

SUSE12 SP3


Issue Links:
Duplicate
duplicates LU-11412 kernel update [SLES12 SP3 4.4.155-94.... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_99 started failing on 2018-09-01 for SLES12 SP3 testing. For the failure at https://testing.whamcloud.com/test_sets/dfe70e50-ba62-11e8-a7de-52540065bddc, we see the following in the test_log

== conf-sanity test 99: Adding meta_bg option ======================================================== 11:07:52 (1537121272)
CMD: trevis-9vm3 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: trevis-9vm3 debugfs -c -R stats /dev/mapper/ost1_flakey
trevis-9vm3: debugfs 1.42.13.wc6 (05-Feb-2017)
trevis-9vm3: /dev/mapper/ost1_flakey: catastrophic mode - not reading inode or group bitmaps
params: --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost 			--index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000  		--mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" 		--reformat /dev/mapper/ost1_flakey 
CMD: trevis-9vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts || true
CMD: trevis-9vm3 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: trevis-9vm3 mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey
trevis-9vm3: mkfs.lustre: Unable to mount /dev/mapper/ost1_flakey: Structure needs cleaning
trevis-9vm3: 
trevis-9vm3: mkfs.lustre FATAL: failed to write local files
trevis-9vm3: mkfs.lustre: exiting with 117 (Structure needs cleaning)

   Permanent disk data:
Target:     lustre:OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x62
              (OST first_time update )
Persistent mount opts: ,errors=remount-ro
Parameters: mgsnode=10.9.4.97@tcp sys.timeout=20

device size = 2048MB
formatting backing filesystem ldiskfs on /dev/mapper/ost1_flakey
	target name   lustre:OST0000
	4k blocks     50000
	options         -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:OST0000   -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F /dev/mapper/ost1_flakey 50000
 conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params 

In the OST (vm3) console log, we see

[37266.504045] Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-O ^resize_inode,meta_bg -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey
[37266.683422] LDISKFS-fs (dm-11): ldiskfs_check_descriptors: Block bitmap for group 0 overlaps block group descriptors
[37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted!
[37266.789545] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params 

After the conf-sanity test 99 failure, conf-sanity test_100, 101, 102, 104, 1-5, 106, 107, 108b, 109a, 109b, and 122 all fail with

Kernel error detected: [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted!
 conf-sanity test_100: @@@@@@ FAIL: Error in dmesg detected 

In many of the test sessions that we see conf-sanity test 99 fail, we also see racer test 1 and replay-single 0a, 0b, 0c also fail with corrupted group descriptors.

conf-sanity test 99 started failing with this error on 2018-09-01 and have seen it several times for SUSE 12 SP3 client and server testing. Some failures are at:
https://testing.whamcloud.com/test_sets/561cf376-ae44-11e8-bd05-52540065bddc
https://testing.whamcloud.com/test_sets/58fb8682-b09e-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/d657ba78-b63a-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/76f11f72-b921-11e8-9df3-52540065bddc



 Comments   
Comment by James A Simmons [ 18/Sep/18 ]

I wonder if  https://review.whamcloud.com/#/c/33144 would help here?

Comment by Oleg Drokin [ 19/Sep/18 ]

no, that ubuntu patch is not used for sles builds, unless you mean if there's a similar problem there?

Comment by Peter Jones [ 19/Sep/18 ]

Jian

This seems to line up with when the kernel update landed. Could you please investigate?

Thanks

Peter

Comment by Jian Yu [ 23/Sep/18 ]

After updating SLES12 SP3 kernel to version 4.4.155-94.50.1, conf-sanity test 99 passed:
https://testing.whamcloud.com/sub_tests/c3fce2da-be5b-11e8-a9d9-52540065bddc

I'm investigating the conf-sanity test 103 and sanity-sec test failures and will resolve the issues in the patch for LU-11412.

Comment by Jian Yu [ 26/Sep/18 ]

It turns out conf-sanity test 103 is a known issue on master branch and tracked in LU-11196.
conf-sanity test 99 failure will be fixed in LU-11412.

Generated at Sat Feb 10 02:43:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.