Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14992

replay-vbr test 7a fails with 'Test 7a.2 failed'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.15.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      replay-vbr test_7a started failing with the error message 'Test 7a.2 failed' on 03 AUG 2021 and fails 100% of the time for the full-dne-part-2 test sessions. For Lustre 2.14.53 build #4206, this test does NOT fail. For Lustre 2.14.53.7 build #4207, this test fails 100% of the time for DNE test sessions.

      Looking at a recent failure at https://testing.whamcloud.com/test_sets/9b87da49-9024-48ba-91b9-e5d006b73d65, we see the following in the suite_log

      test_7a.2 first: createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      CMD: trevis-68vm5.trevis.whamcloud.com createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      total: 1 open/close in 0.00 seconds: 284.40 ops/second
      test_7a.2 lost: rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      test_7a.2 last: mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm5.trevis.whamcloud.com mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 grep -c /mnt/lustre2' ' /proc/mounts
      Stopping client trevis-68vm6 /mnt/lustre2 (opts:)
      CMD: trevis-68vm6 lsof -t /mnt/lustre2
      pdsh@trevis-68vm5: trevis-68vm6: ssh exited with exit code 1
      CMD: trevis-68vm6 umount  /mnt/lustre2 2>&1
      Failing mds1 on trevis-68vm8
      CMD: trevis-68vm8 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      Stopping /mnt/lustre-mds1 (opts:) on trevis-68vm8
      CMD: trevis-68vm8 umount -d /mnt/lustre-mds1
      CMD: trevis-68vm8 lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      reboot facets: mds1
      Failover mds1 to trevis-68vm8
      CMD: trevis-68vm8 hostname
      mount facets: mds1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey 2>&1
      CMD: trevis-68vm8 dmsetup table /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 linear 252:0 0\"
      CMD: trevis-68vm8 dmsetup resume /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 test -b /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey
      Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 /usr/sbin/lctl get_param -n health_check
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      pdsh@trevis-68vm5: trevis-68vm8: ssh exited with exit code 1
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 2>/dev/null
      Started lustre-MDT0000
      CMD: trevis-68vm5.trevis.whamcloud.com lctl get_param -n at_max
      affected facets: mds1
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
      trevis-68vm8: *.lustre-MDT0000.recovery_status status: COMPLETE
      Waiting for orphan cleanup...
      CMD: trevis-68vm8 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
      osp.lustre-OST0000-osc-MDT0000.old_sync_processed
      osp.lustre-OST0000-osc-MDT0002.old_sync_processed
      osp.lustre-OST0001-osc-MDT0000.old_sync_processed
      osp.lustre-OST0001-osc-MDT0002.old_sync_processed
      osp.lustre-OST0002-osc-MDT0000.old_sync_processed
      osp.lustre-OST0002-osc-MDT0002.old_sync_processed
      osp.lustre-OST0003-osc-MDT0000.old_sync_processed
      osp.lustre-OST0003-osc-MDT0002.old_sync_processed
      osp.lustre-OST0004-osc-MDT0000.old_sync_processed
      osp.lustre-OST0004-osc-MDT0002.old_sync_processed
      osp.lustre-OST0005-osc-MDT0000.old_sync_processed
      osp.lustre-OST0005-osc-MDT0002.old_sync_processed
      osp.lustre-OST0006-osc-MDT0000.old_sync_processed
      osp.lustre-OST0006-osc-MDT0002.old_sync_processed
      osp.lustre-OST0007-osc-MDT0000.old_sync_processed
      osp.lustre-OST0007-osc-MDT0002.old_sync_processed
      wait 40 secs maximumly for trevis-68vm8,trevis-68vm9 mds-ost sync done.
      CMD: trevis-68vm8,trevis-68vm9 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
       replay-vbr test_7a: @@@@@@ FAIL: Test 7a.2 failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6237:error()
        = /usr/lib64/lustre/tests/replay-vbr.sh:727:test_7a()
      

      This must be a results of patch https://review.whamcloud.com/38553; 3e04b0fd6c3dd36372f33c54ea5f401c27485d60 “LU-13417 mdd: set default LMV on ROOT”. We may need to use the routine mkdir_on_mdt0() as a temporary fix

      Logs for more failures are at
      https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5
      https://testing.whamcloud.com/test_sets/a00d3625-d4b0-48ef-88b1-e50707d75462
      https://testing.whamcloud.com/test_sets/08c2b227-3285-438e-87b6-1d34e147a412

      Attachments

        Issue Links

          Activity

            [LU-14992] replay-vbr test 7a fails with 'Test 7a.2 failed'

            "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52065
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 93f16ff3f5a1744a8bd62e07eada5ff2a4dd7bca

            gerrit Gerrit Updater added a comment - "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52065 Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 93f16ff3f5a1744a8bd62e07eada5ff2a4dd7bca

            "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49404
            Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 7e2edf8157fc715887fb8e10dff5c92fab81fefb

            gerrit Gerrit Updater added a comment - "Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49404 Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 7e2edf8157fc715887fb8e10dff5c92fab81fefb

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49252/
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d56ea0c80a959ebd9b393f2da048cc179cb16127

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49252/ Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: master Current Patch Set: Commit: d56ea0c80a959ebd9b393f2da048cc179cb16127

            "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49252
            Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 53979bf9a3a41c68df30a28c7555711f2eb4d20e

            gerrit Gerrit Updater added a comment - "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49252 Subject: LU-14992 tests: add more mkdir_on_mdt0 calls Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 53979bf9a3a41c68df30a28c7555711f2eb4d20e
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44902/
            Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f0324c5c2f4390d6d7e93ed799e95d8eef4704f4

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44902/ Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0 Project: fs/lustre-release Branch: master Current Patch Set: Commit: f0324c5c2f4390d6d7e93ed799e95d8eef4704f4
            cfaber Colin Faber added a comment -

            Hi yujian 

            Can you take a look?

             

            Thank you!

            cfaber Colin Faber added a comment - Hi yujian   Can you take a look?   Thank you!

            From the test results for patch https://review.whamcloud.com/44902, it looks like the sanity 133a/133b failures are a separate issue. Ticket LU-15042 was opened to track the sanity test failures.

            jamesanunez James Nunez (Inactive) added a comment - From the test results for patch https://review.whamcloud.com/44902 , it looks like the sanity 133a/133b failures are a separate issue. Ticket LU-15042 was opened to track the sanity test failures.

            "James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44902
            Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4e78399b3e3b8504ea811ae93d6cb61fb4317533

            gerrit Gerrit Updater added a comment - "James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44902 Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4e78399b3e3b8504ea811ae93d6cb61fb4317533

            I think there are a few other tests that hit this same issue; checking MDT0/MDS0 for stats or other information and mkdir may have created the directory on a different MDT. We can create separate tickets for each failure if this turns out to be wrong.

            sanity test 133a/b - error 'The counter for mkdir on mds1 was not incremented'
            https://testing.whamcloud.com/test_sets/f7481079-7fbe-47af-abbb-d376b877700b
            https://testing.whamcloud.com/test_sets/802ba343-7f6e-4d01-beba-ac1060836f5f

            jamesanunez James Nunez (Inactive) added a comment - I think there are a few other tests that hit this same issue; checking MDT0/MDS0 for stats or other information and mkdir may have created the directory on a different MDT. We can create separate tickets for each failure if this turns out to be wrong. sanity test 133a/b - error 'The counter for mkdir on mds1 was not incremented' https://testing.whamcloud.com/test_sets/f7481079-7fbe-47af-abbb-d376b877700b https://testing.whamcloud.com/test_sets/802ba343-7f6e-4d01-beba-ac1060836f5f

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: