Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11395

conf-sanity test 99 fails with 'add ost1 failed with new params - group descriptors corrupted!'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.12.0
    • SUSE12 SP3
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_99 started failing on 2018-09-01 for SLES12 SP3 testing. For the failure at https://testing.whamcloud.com/test_sets/dfe70e50-ba62-11e8-a7de-52540065bddc, we see the following in the test_log

      == conf-sanity test 99: Adding meta_bg option ======================================================== 11:07:52 (1537121272)
      CMD: trevis-9vm3 /usr/sbin/lctl get_param -n version 2>/dev/null ||
      				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      CMD: trevis-9vm3 debugfs -c -R stats /dev/mapper/ost1_flakey
      trevis-9vm3: debugfs 1.42.13.wc6 (05-Feb-2017)
      trevis-9vm3: /dev/mapper/ost1_flakey: catastrophic mode - not reading inode or group bitmaps
      params: --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost 			--index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000  		--mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" 		--reformat /dev/mapper/ost1_flakey 
      CMD: trevis-9vm3 grep -c /mnt/lustre-ost1' ' /proc/mounts || true
      CMD: trevis-9vm3 lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      CMD: trevis-9vm3 mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions=\"-O ^resize_inode,meta_bg -E lazy_itable_init\" --reformat /dev/mapper/ost1_flakey
      trevis-9vm3: mkfs.lustre: Unable to mount /dev/mapper/ost1_flakey: Structure needs cleaning
      trevis-9vm3: 
      trevis-9vm3: mkfs.lustre FATAL: failed to write local files
      trevis-9vm3: mkfs.lustre: exiting with 117 (Structure needs cleaning)
      
         Permanent disk data:
      Target:     lustre:OST0000
      Index:      0
      Lustre FS:  lustre
      Mount type: ldiskfs
      Flags:      0x62
                    (OST first_time update )
      Persistent mount opts: ,errors=remount-ro
      Parameters: mgsnode=10.9.4.97@tcp sys.timeout=20
      
      device size = 2048MB
      formatting backing filesystem ldiskfs on /dev/mapper/ost1_flakey
      	target name   lustre:OST0000
      	4k blocks     50000
      	options         -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F
      mkfs_cmd = mke2fs -j -b 4096 -L lustre:OST0000   -I 512 -q -O ^resize_inode,meta_bg,extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,lazy_journal_init -F /dev/mapper/ost1_flakey 50000
       conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params 
      

      In the OST (vm3) console log, we see

      [37266.504045] Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=trevis-9vm4@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=ldiskfs --device-size=200000 --mkfsoptions="-O ^resize_inode,meta_bg -E lazy_itable_init" --reformat /dev/mapper/ost1_flakey
      [37266.683422] LDISKFS-fs (dm-11): ldiskfs_check_descriptors: Block bitmap for group 0 overlaps block group descriptors
      [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted!
      [37266.789545] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_99: @@@@@@ FAIL: add ost1 failed with new params 
      

      After the conf-sanity test 99 failure, conf-sanity test_100, 101, 102, 104, 1-5, 106, 107, 108b, 109a, 109b, and 122 all fail with

      Kernel error detected: [37266.683428] LDISKFS-fs (dm-11): group descriptors corrupted!
       conf-sanity test_100: @@@@@@ FAIL: Error in dmesg detected 
      

      In many of the test sessions that we see conf-sanity test 99 fail, we also see racer test 1 and replay-single 0a, 0b, 0c also fail with corrupted group descriptors.

      conf-sanity test 99 started failing with this error on 2018-09-01 and have seen it several times for SUSE 12 SP3 client and server testing. Some failures are at:
      https://testing.whamcloud.com/test_sets/561cf376-ae44-11e8-bd05-52540065bddc
      https://testing.whamcloud.com/test_sets/58fb8682-b09e-11e8-80f7-52540065bddc
      https://testing.whamcloud.com/test_sets/d657ba78-b63a-11e8-b86b-52540065bddc
      https://testing.whamcloud.com/test_sets/76f11f72-b921-11e8-9df3-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              yujian Jian Yu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: