Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11774

replay-single test 0c fails with ‘mount fails’

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.14.0, Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      replay-singe test_0c fails with ‘mount fails’.

      Looking at a recent failure, https://testing.whamcloud.com/test_sets/5532dea2-fd87-11e8-93ea-52540065bddc , we can see in the client test_log that there is a problem unmounting the file system because it is busy and then failover MDS1 (vm9)

      CMD: trevis-9vm9 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      umount: /mnt/lustre: target is busy
              (In some cases useful info about processes that
               use the device is found by lsof(8) or fuser(1).)
      Failing mds1 on trevis-9vm9
      

      Then we see that the file system fails to mount because it is already mouted

      CMD: trevis-9vm6 mkdir -p /mnt/lustre
      CMD: trevis-9vm6 mount -t lustre -o user_xattr,flock trevis-9vm9@tcp:/lustre /mnt/lustre
      mount.lustre: according to /etc/mtab trevis-9vm9@tcp:/lustre is already mounted on /mnt/lustre
       replay-single test_0c: @@@@@@ FAIL: mount fails 
      

      It’s possible that the file system never unmounted.

      replay-single test_0d always fails after 0c and in the same way.

      Looking over all branches for the past year, this test started to fail on July 31, 2018.

      Logs for past failures are at
      https://testing.whamcloud.com/test_sets/d02535e4-958e-11e8-8ee3-52540065bddc
      https://testing.whamcloud.com/test_sets/ab6bf330-9838-11e8-b0aa-52540065bddc
      https://testing.whamcloud.com/test_sets/26d280e4-a35f-11e8-8853-52540065bddc
      https://testing.whamcloud.com/test_sets/9438cbfa-dee9-11e8-89f8-52540065bddc

      aarch64
      https://testing.whamcloud.com/test_sets/6d97765e-ed94-11e8-815b-52540065bddc

      ppc64
      https://testing.whamcloud.com/test_sets/925cab5a-a783-11e8-80f7-52540065bddc
      https://testing.whamcloud.com/test_sets/33d0ef58-fb55-11e8-8a18-52540065bddc

      Suse 12
      https://testing.whamcloud.com/test_sets/3c98eace-9a8d-11e8-8ee3-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11774] replay-single test 0c fails with ‘mount fails’
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/fc1418b8-b684-4a7e-94e5-79c70f711f2b
            yujian Jian Yu added a comment - +1 on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/124eefae-c997-11e9-a25b-52540065bddc

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33904
            Subject: LU-11774 test: wait to finish in racer's file_exec
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3197ce11538db8953be09ef940ce0cf5f0cda1a3

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33904 Subject: LU-11774 test: wait to finish in racer's file_exec Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3197ce11538db8953be09ef940ce0cf5f0cda1a3

            As per the retry count of this running process "cp", it should the process in the previous test "racer" and could be in the
            test script "file_exec.sh".

            while /bin/true ; do
                    file=$((RANDOM % MAX))
                    cp -p $PROG $DIR/$file > /dev/null 2>&1
                    $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
                    sleep $((RANDOM % 3))
            done 2>&1 | egrep -v "Segmentation fault|Bus error"
            
            hongchao.zhang Hongchao Zhang added a comment - As per the retry count of this running process "cp", it should the process in the previous test "racer" and could be in the test script "file_exec.sh". while /bin/true ; do file=$((RANDOM % MAX)) cp -p $PROG $DIR/$file > /dev/null 2>&1 $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null sleep $((RANDOM % 3)) done 2>&1 | egrep -v "Segmentation fault|Bus error"

            As per the logs, there is a running process "cp" in the background for a long time, which affect the mount/umount

            00020000:00200000:1.0:1533057091.189815:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 878, mirror count: 2
            00020000:00200000:0.0:1533057092.190236:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
            00020000:00200000:0.0:1533057092.194654:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 879, mirror count: 2
            00020000:00200000:0.0:1533057093.194224:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
            ...
            00020000:00200000:0.0:1533057124.268068:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 911, mirror count: 2
            00020000:00200000:0.0:1533057125.267217:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
            00020000:00200000:0.0:1533057125.273560:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 912, mirror count: 2
            00020000:00200000:0.0:1533057126.273228:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
            
            hongchao.zhang Hongchao Zhang added a comment - As per the logs, there is a running process "cp" in the background for a long time, which affect the mount/umount 00020000:00200000:1.0:1533057091.189815:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 878, mirror count: 2 00020000:00200000:0.0:1533057092.190236:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO 00020000:00200000:0.0:1533057092.194654:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 879, mirror count: 2 00020000:00200000:0.0:1533057093.194224:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO ... 00020000:00200000:0.0:1533057124.268068:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 911, mirror count: 2 00020000:00200000:0.0:1533057125.267217:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO 00020000:00200000:0.0:1533057125.273560:0:12598:0:(lov_io.c:435:lov_io_mirror_init()) [0x200000401:0x218:0x0]: flr state: 2, move mirror from 0 to 0, have retried: 912, mirror count: 2 00020000:00200000:0.0:1533057126.273228:0:12598:0:(lov_io.c:454:lov_io_mirror_init()) use non-delayed RPC state for this IO
            pjones Peter Jones added a comment -

            Hongchao

            Could you please assess this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please assess this issue? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: