Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11915

conf-sanity test 115 is skipped or hangs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_115 is only run for ldiskfs MDS file systems and is skipped for ZFS. Looking back for the past couple of weeks, this test is either skipped and in the past few days it has started to hang.

      For some reason, the test is skipped when the formatting of MDS1 fails

      8316         local mds_opts="$(mkfs_opts mds1 ${mdsdev}) --device-size=$IMAGESIZE   \
      8317                 --mkfsoptions='-O lazy_itable_init,ea_inode,^resize_inode,meta_bg \
      8318                 -i 1024'"
      8319         add mds1 $mds_opts --mgs --reformat $mdsdev ||
      8320                 { skip_env "format large MDT failed"; return 0; }
      

      Shouldn’t this be an error?

      Starting on January 30, 2019, conf-sanity test 115 started hanging only for review-dne-part-3 test sessions. Looking at the logs from a recent hang, https://testing.whamcloud.com/test_sets/d49db868-2610-11e9-8486-52540065bddc , the last thing seen in the client test_log is

      == conf-sanity test 115: Access large xattr with inodes number over 2TB ============================== 09:51:24 (1549014684)
      Stopping clients: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 /mnt/lustre (opts:)
      CMD: onyx-34vm6.onyx.whamcloud.com,onyx-34vm7 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
      if [ \$running -ne 0 ] ; then
      echo Stopping client \$(hostname) /mnt/lustre opts:;
      lsof /mnt/lustre || need_kill=no;
      if [ x != x -a x\$need_kill != xno ]; then
          pids=\$(lsof -t /mnt/lustre | sort -u);
          if [ -n \"\$pids\" ]; then
                   kill -9 \$pids;
          fi
      fi;
      while umount  /mnt/lustre 2>&1 | grep -q busy; do
          echo /mnt/lustre is still busy, wait one second && sleep 1;
      done;
      fi
      

      The console logs don’t have much information about this test in them; no errors, LBUGS, etc.

      There were two new tests added to conf-sanity that run right before test 115; conf-santiy tests 110 and 111 added by https://review.whamcloud.com/22009 . Maybe there is some residual effect from these tests running in a DNE environment.

      Logs for other hangs are at
      https://testing.whamcloud.com/test_sets/d49db868-2610-11e9-8486-52540065bddc
      https://testing.whamcloud.com/test_sets/8d6ec5d2-25ab-11e9-a318-52540065bddc

      Logs for skipping this test are at
      https://testing.whamcloud.com/test_sets/bb4dbd90-25a7-11e9-b97f-52540065bddc
      https://testing.whamcloud.com/test_sets/1adf3cf6-2590-11e9-b54c-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11915] conf-sanity test 115 is skipped or hangs

            Bzzz, despite the fact the test "requires" 3072 GiB it expect sparse files are used really. Here is disk usage on my local VM just after the test successfully passed.

            [root@CO82 lustre-wc]# ls -ls /tmp/lustre-*
             340 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-mdt1
            4256 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-ost1
            4268 -rw-r--r-- 1 root root 409600000 Dec 13 16:04 /tmp/lustre-ost2
            
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - Bzzz , despite the fact the test "requires" 3072 GiB it expect sparse files are used really. Here is disk usage on my local VM just after the test successfully passed. [root@CO82 lustre-wc]# ls -ls /tmp/lustre-* 340 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-mdt1 4256 -rw-r--r-- 1 root root 204800000 Dec 13 16:04 /tmp/lustre-ost1 4268 -rw-r--r-- 1 root root 409600000 Dec 13 16:04 /tmp/lustre-ost2

            artem_blagodarenko I can't, this is local setup (with small /tmp). I think you can reproduce this easily in VM with limited /tmp

            bzzz Alex Zhuravlev added a comment - artem_blagodarenko I can't, this is local setup (with small /tmp). I think you can reproduce this easily in VM with limited /tmp

            bzzz, could you please share the session URL so I can get more details? Thanks.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - bzzz , could you please share the session URL so I can get more details? Thanks.

            the last patch breaks local testing:

             SKIP: conf-sanity test_115 format large MDT failed
            ./../tests/test-framework.sh: line 5894: cannot create temp file for here-document: No space left on device
            cannot run remote command on /mnt/build/lustre/tests/../utils/lctl with no_dsh
            ./../tests/test-framework.sh: line 6391: echo: write error: No space left on device
            ./../tests/test-framework.sh: line 6395: echo: write error: No space left on device
            Stopping clients: tmp.pMDa8L24t2 /mnt/lustre (opts:)
            Stopping clients: tmp.pMDa8L24t2 /mnt/lustre2 (opts:)
            SKIP 115 (38s)
            tee: /tmp/conf-sanity.log: No space left on device
            
            bzzz Alex Zhuravlev added a comment - the last patch breaks local testing: SKIP: conf-sanity test_115 format large MDT failed ./../tests/test-framework.sh: line 5894: cannot create temp file for here-document: No space left on device cannot run remote command on /mnt/build/lustre/tests/../utils/lctl with no_dsh ./../tests/test-framework.sh: line 6391: echo: write error: No space left on device ./../tests/test-framework.sh: line 6395: echo: write error: No space left on device Stopping clients: tmp.pMDa8L24t2 /mnt/lustre (opts:) Stopping clients: tmp.pMDa8L24t2 /mnt/lustre2 (opts:) SKIP 115 (38s) tee: /tmp/conf-sanity.log: No space left on device

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38849/
            Subject: LU-11915 tests: fix conf-sanity 115 test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ef13d5464ec7c91c8479ef3e987732dc6355d5ee

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38849/ Subject: LU-11915 tests: fix conf-sanity 115 test Project: fs/lustre-release Branch: master Current Patch Set: Commit: ef13d5464ec7c91c8479ef3e987732dc6355d5ee

            With parameter "FLAKEY=false" the error "trevis-54vm3: The target service's index is already in use (/dev/mapper/ost1_flakey)" is gone. With some fixes, test passed on my local test environment.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - With parameter "FLAKEY=false" the error "trevis-54vm3: The target service's index is already in use (/dev/mapper/ost1_flakey)" is gone. With some fixes, test passed on my local test environment.

            Artem Blagodarenko (artem.blagodarenko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38849
            Subject: LU-11915 tests: fix conf-sanity 115 test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 76c4c3f42b655db45a632f3608c2607e25a90f29

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38849 Subject: LU-11915 tests: fix conf-sanity 115 test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 76c4c3f42b655db45a632f3608c2607e25a90f29

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37548/
            Subject: LU-11915 tests: add debugging to conf-sanity test_115
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 037abde0c33f4fff49c237e8588a45cc3da21c59

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37548/ Subject: LU-11915 tests: add debugging to conf-sanity test_115 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 037abde0c33f4fff49c237e8588a45cc3da21c59

            I verified that the filesystems are newly formatted, and none of the modules are loaded during the test:
            https://testing.whamcloud.com/test_sets/4244009c-4d94-11ea-b69a-52540065bddc

            client: 
            opening /dev/obd failed: No such file or directory
            hint: the kernel modules may not be loaded
            mds1: 
            CMD: trevis-27vm12 lctl dl; mount | grep lustre
            ost1: 
            CMD: trevis-27vm3 lctl dl; mount | grep lustre
            

            so it isn't at all clear to me why it considers the device already in use.

            adilger Andreas Dilger added a comment - I verified that the filesystems are newly formatted, and none of the modules are loaded during the test: https://testing.whamcloud.com/test_sets/4244009c-4d94-11ea-b69a-52540065bddc client: opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded mds1: CMD: trevis-27vm12 lctl dl; mount | grep lustre ost1: CMD: trevis-27vm3 lctl dl; mount | grep lustre so it isn't at all clear to me why it considers the device already in use.

            Looking further at the logs, I see:

            mkfs.lustre --mgs --fsname=lustre --mdt --index=0 ...
               Permanent disk data:
            Target:     lustre:MDT0000
            Mount type: ldiskfs
            Flags:      0x65
                          (MDT MGS first_time update)
            

            So this is actually a combined MGS+MDS that is just newly formatted.

            I've pushed a debug patch to see if there is something wrong with the config (left-over MGS or MDS device mounted).

            adilger Andreas Dilger added a comment - Looking further at the logs, I see: mkfs.lustre --mgs --fsname=lustre --mdt --index=0 ... Permanent disk data: Target: lustre:MDT0000 Mount type: ldiskfs Flags: 0x65 (MDT MGS first_time update) So this is actually a combined MGS+MDS that is just newly formatted. I've pushed a debug patch to see if there is something wrong with the config (left-over MGS or MDS device mounted).

            People

              ablagodarenko Artem Blagodarenko
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: