Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14992

replay-vbr test 7a fails with 'Test 7a.2 failed'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.16.0, Lustre 2.15.5
    • Lustre 2.15.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      replay-vbr test_7a started failing with the error message 'Test 7a.2 failed' on 03 AUG 2021 and fails 100% of the time for the full-dne-part-2 test sessions. For Lustre 2.14.53 build #4206, this test does NOT fail. For Lustre 2.14.53.7 build #4207, this test fails 100% of the time for DNE test sessions.

      Looking at a recent failure at https://testing.whamcloud.com/test_sets/9b87da49-9024-48ba-91b9-e5d006b73d65, we see the following in the suite_log

      test_7a.2 first: createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      CMD: trevis-68vm5.trevis.whamcloud.com createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
      total: 1 open/close in 0.00 seconds: 284.40 ops/second
      test_7a.2 lost: rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
      test_7a.2 last: mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm5.trevis.whamcloud.com mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
      CMD: trevis-68vm6 grep -c /mnt/lustre2' ' /proc/mounts
      Stopping client trevis-68vm6 /mnt/lustre2 (opts:)
      CMD: trevis-68vm6 lsof -t /mnt/lustre2
      pdsh@trevis-68vm5: trevis-68vm6: ssh exited with exit code 1
      CMD: trevis-68vm6 umount  /mnt/lustre2 2>&1
      Failing mds1 on trevis-68vm8
      CMD: trevis-68vm8 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      Stopping /mnt/lustre-mds1 (opts:) on trevis-68vm8
      CMD: trevis-68vm8 umount -d /mnt/lustre-mds1
      CMD: trevis-68vm8 lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      reboot facets: mds1
      Failover mds1 to trevis-68vm8
      CMD: trevis-68vm8 hostname
      mount facets: mds1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
      CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey 2>&1
      CMD: trevis-68vm8 dmsetup table /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 linear 252:0 0\"
      CMD: trevis-68vm8 dmsetup resume /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 test -b /dev/mapper/mds1_flakey
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey
      Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
      CMD: trevis-68vm8 /usr/sbin/lctl get_param -n health_check
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      pdsh@trevis-68vm5: trevis-68vm8: ssh exited with exit code 1
      CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 2>/dev/null
      Started lustre-MDT0000
      CMD: trevis-68vm5.trevis.whamcloud.com lctl get_param -n at_max
      affected facets: mds1
      CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
      trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
      trevis-68vm8: *.lustre-MDT0000.recovery_status status: COMPLETE
      Waiting for orphan cleanup...
      CMD: trevis-68vm8 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
      osp.lustre-OST0000-osc-MDT0000.old_sync_processed
      osp.lustre-OST0000-osc-MDT0002.old_sync_processed
      osp.lustre-OST0001-osc-MDT0000.old_sync_processed
      osp.lustre-OST0001-osc-MDT0002.old_sync_processed
      osp.lustre-OST0002-osc-MDT0000.old_sync_processed
      osp.lustre-OST0002-osc-MDT0002.old_sync_processed
      osp.lustre-OST0003-osc-MDT0000.old_sync_processed
      osp.lustre-OST0003-osc-MDT0002.old_sync_processed
      osp.lustre-OST0004-osc-MDT0000.old_sync_processed
      osp.lustre-OST0004-osc-MDT0002.old_sync_processed
      osp.lustre-OST0005-osc-MDT0000.old_sync_processed
      osp.lustre-OST0005-osc-MDT0002.old_sync_processed
      osp.lustre-OST0006-osc-MDT0000.old_sync_processed
      osp.lustre-OST0006-osc-MDT0002.old_sync_processed
      osp.lustre-OST0007-osc-MDT0000.old_sync_processed
      osp.lustre-OST0007-osc-MDT0002.old_sync_processed
      wait 40 secs maximumly for trevis-68vm8,trevis-68vm9 mds-ost sync done.
      CMD: trevis-68vm8,trevis-68vm9 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
       replay-vbr test_7a: @@@@@@ FAIL: Test 7a.2 failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6237:error()
        = /usr/lib64/lustre/tests/replay-vbr.sh:727:test_7a()
      

      This must be a results of patch https://review.whamcloud.com/38553; 3e04b0fd6c3dd36372f33c54ea5f401c27485d60 “LU-13417 mdd: set default LMV on ROOT”. We may need to use the routine mkdir_on_mdt0() as a temporary fix

      Logs for more failures are at
      https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5
      https://testing.whamcloud.com/test_sets/a00d3625-d4b0-48ef-88b1-e50707d75462
      https://testing.whamcloud.com/test_sets/08c2b227-3285-438e-87b6-1d34e147a412

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: