Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14992

replay-vbr test 7a fails with 'Test 7a.2 failed'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      replay-vbr test_7a started failing with the error message 'Test 7a.2 failed' on 03 AUG 2021 and fails 100% of the time for the full-dne-part-2 test sessions. For Lustre 2.14.53 build #4206, this test does NOT fail. For Lustre 2.14.53.7 build #4207, this test fails 100% of the time for DNE test sessions.

      Looking at a recent failure at https://testing.whamcloud.com/test_sets/9b87da49-9024-48ba-91b9-e5d006b73d65, we see the following in the suite_log

      test_7a.2 first: createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      CMD: trevis-68vm5.trevis.whamcloud.com createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      total: 1 open/close in 0.00 seconds: 284.40 ops/second
      test_7a.2 lost: rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      test_7a.2 last: mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm5.trevis.whamcloud.com mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 grep -c /mnt/lustre2' ' /proc/mounts
      Stopping client trevis-68vm6 /mnt/lustre2 (opts:)
      CMD: trevis-68vm6 lsof -t /mnt/lustre2
      pdsh@trevis-68vm5: trevis-68vm6: ssh exited with exit code 1
      CMD: trevis-68vm6 umount  /mnt/lustre2 2>&1
      Failing mds1 on trevis-68vm8
      CMD: trevis-68vm8 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      Stopping /mnt/lustre-mds1 (opts:) on trevis-68vm8
      CMD: trevis-68vm8 umount -d /mnt/lustre-mds1
      CMD: trevis-68vm8 lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      reboot facets: mds1
      Failover mds1 to trevis-68vm8
      CMD: trevis-68vm8 hostname
      mount facets: mds1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey 2>&1
      CMD: trevis-68vm8 dmsetup table /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 linear 252:0 0\"
      CMD: trevis-68vm8 dmsetup resume /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 test -b /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey
      Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 /usr/sbin/lctl get_param -n health_check
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      pdsh@trevis-68vm5: trevis-68vm8: ssh exited with exit code 1
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 2>/dev/null
      Started lustre-MDT0000
      CMD: trevis-68vm5.trevis.whamcloud.com lctl get_param -n at_max
      affected facets: mds1
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
      trevis-68vm8: *.lustre-MDT0000.recovery_status status: COMPLETE
      Waiting for orphan cleanup...
      CMD: trevis-68vm8 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
      osp.lustre-OST0000-osc-MDT0000.old_sync_processed
      osp.lustre-OST0000-osc-MDT0002.old_sync_processed
      osp.lustre-OST0001-osc-MDT0000.old_sync_processed
      osp.lustre-OST0001-osc-MDT0002.old_sync_processed
      osp.lustre-OST0002-osc-MDT0000.old_sync_processed
      osp.lustre-OST0002-osc-MDT0002.old_sync_processed
      osp.lustre-OST0003-osc-MDT0000.old_sync_processed
      osp.lustre-OST0003-osc-MDT0002.old_sync_processed
      osp.lustre-OST0004-osc-MDT0000.old_sync_processed
      osp.lustre-OST0004-osc-MDT0002.old_sync_processed
      osp.lustre-OST0005-osc-MDT0000.old_sync_processed
      osp.lustre-OST0005-osc-MDT0002.old_sync_processed
      osp.lustre-OST0006-osc-MDT0000.old_sync_processed
      osp.lustre-OST0006-osc-MDT0002.old_sync_processed
      osp.lustre-OST0007-osc-MDT0000.old_sync_processed
      osp.lustre-OST0007-osc-MDT0002.old_sync_processed
      wait 40 secs maximumly for trevis-68vm8,trevis-68vm9 mds-ost sync done.
      CMD: trevis-68vm8,trevis-68vm9 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
       replay-vbr test_7a: @@@@@@ FAIL: Test 7a.2 failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6237:error()
        = /usr/lib64/lustre/tests/replay-vbr.sh:727:test_7a()
      

      This must be a results of patch https://review.whamcloud.com/38553; 3e04b0fd6c3dd36372f33c54ea5f401c27485d60 “LU-13417 mdd: set default LMV on ROOT”. We may need to use the routine mkdir_on_mdt0() as a temporary fix

      Logs for more failures are at
      https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5
      https://testing.whamcloud.com/test_sets/a00d3625-d4b0-48ef-88b1-e50707d75462
      https://testing.whamcloud.com/test_sets/08c2b227-3285-438e-87b6-1d34e147a412

      Attachments

        Issue Links

          Activity

            [LU-14992] replay-vbr test 7a fails with 'Test 7a.2 failed'
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55714/
            Subject: LU-14992 mdt: restore mkdir VBR support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 652032d6c18caffc0782d49e5d5e373010f2bc61

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55714/ Subject: LU-14992 mdt: restore mkdir VBR support Project: fs/lustre-release Branch: master Current Patch Set: Commit: 652032d6c18caffc0782d49e5d5e373010f2bc61

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55714
            Subject: LU-14992 mdt: restore mkdir VBR support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d3400e2576aeac489976df1c9b34642acac136e9

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55714 Subject: LU-14992 mdt: restore mkdir VBR support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d3400e2576aeac489976df1c9b34642acac136e9
            adilger Andreas Dilger added a comment - It looks like the tests are still failing regularly for master servers, if they are not being skipped because of SLOW: https://testing.whamcloud.com/search?client_branch_type_id=24a6947e-04a9-11e1-bb5f-52540025f9af&horizon=518400&test_set_script_id=9f182464-4070-11e0-8bad-52540025f9af&sub_test_script_id=738cad1e-5c00-11e0-a272-52540025f9af&source=sub_tests#redirect

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49404/
            Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: b71ad0c466c4e5a69e55af7f20a37bf4c2b883ec

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49404/ Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0 Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: b71ad0c466c4e5a69e55af7f20a37bf4c2b883ec
            adilger Andreas Dilger added a comment - - edited

            It looks like replay-vbr test_7a still failed 133 times in the past 4 week (100% of the actual runs), all of them with master servers, mostly with master clients. There were 12 failures during interop testing with master servers and EXA6 clients, but the test does not fail with EXA6 servers, and is also passing (at least some of the time) with b2_14 and b2_12 servers, so it seems like the issue is on the master servers.

            It looks like only runs during full test sessions, and is skipped by "SLOW" for all patch review test runs.

            It looks like the mkdir is forced to be on MDT0000, so I don't know if this is some other aspect of the test expecting a specific MDT, or some other issue?

            adilger Andreas Dilger added a comment - - edited It looks like replay-vbr test_7a still failed 133 times in the past 4 week (100% of the actual runs), all of them with master servers, mostly with master clients. There were 12 failures during interop testing with master servers and EXA6 clients, but the test does not fail with EXA6 servers, and is also passing (at least some of the time) with b2_14 and b2_12 servers, so it seems like the issue is on the master servers. It looks like only runs during full test sessions, and is skipped by " SLOW " for all patch review test runs. It looks like the mkdir is forced to be on MDT0000, so I don't know if this is some other aspect of the test expecting a specific MDT, or some other issue?

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52065/
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 04775bc14fe12d15784ccfec78d0ed1975cbc45f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52065/ Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 04775bc14fe12d15784ccfec78d0ed1975cbc45f

            "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52065
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 93f16ff3f5a1744a8bd62e07eada5ff2a4dd7bca

            gerrit Gerrit Updater added a comment - "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52065 Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 93f16ff3f5a1744a8bd62e07eada5ff2a4dd7bca

            "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49404
            Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 7e2edf8157fc715887fb8e10dff5c92fab81fefb

            gerrit Gerrit Updater added a comment - "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49404 Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 7e2edf8157fc715887fb8e10dff5c92fab81fefb

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49252/
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d56ea0c80a959ebd9b393f2da048cc179cb16127

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49252/ Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: master Current Patch Set: Commit: d56ea0c80a959ebd9b393f2da048cc179cb16127

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: