Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11538

replay-single test 80g fails with '/usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.0
    • DNE/ZFS
    • 3
    • 9223372036854775807

    Description

      replay-single test_80g fails for ZFS with DNE Lustre configurations. Looking at a recent failure, https://testing.whamcloud.com/test_sets/6c06f67c-cf6a-11e8-82f2-52540065bddc , we see ‘lfs getstripe’ fails

      onyx-42vm6: CMD: onyx-42vm6.onyx.whamcloud.com lctl get_param -n at_max
      onyx-42vm7: CMD: onyx-42vm7.onyx.whamcloud.com lctl get_param -n at_max
      onyx-42vm6: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-42vm7: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      lfs getstripe: cannot open '/mnt/lustre/d80g.replay-single/remote_dir': No such file or directory (2)
      error: getstripe failed for /mnt/lustre/d80g.replay-single/remote_dir.
       replay-single test_80g: @@@@@@ FAIL: /usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5788:error()
        = /usr/lib64/lustre/tests/replay-single.sh:2580:remote_dir_check_80()
        = /usr/lib64/lustre/tests/replay-single.sh:2792:test_80g()
      

      Comparing the console log from this failed test session to one where test 80g passes, we see a few errors in the MDS2, MDS4 (vm10) log:

      [75477.742299] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
      [75477.926652] Lustre: Failing over lustre-MDT0001
      [75477.946482] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8f07870a0f00 x1614211128795744/t0(0) o1000->lustre-MDT0000-osp-MDT0001@10.2.8.153@tcp:24/4 lens 304/4320 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      [75477.948593] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
      [75477.949570] LustreError: 6854:0:(osp_object.c:582:osp_attr_get()) lustre-MDT0000-osp-MDT0001:osp_attr_get update error [0x200000401:0x1:0x0]: rc = -5
      [75478.049796] Lustre: lustre-MDT0001: Not available for connect from 10.2.8.153@tcp (stopping)
      [75478.605896] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      

      This test fails almost 100% of the time for a DNE wth ZFS configuration. Frequently, replay-single test 80g fails after test 80f fails, but this is not always true.

      Some other recent failures are at
      https://testing.whamcloud.com/test_sets/121a90c6-c6e4-11e8-82f2-52540065bddc
      https://testing.whamcloud.com/test_sets/cb442ad8-d17c-11e8-b589-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: