Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11538

replay-single test 80g fails with '/usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed'

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.0
    • DNE/ZFS
    • 3
    • 9223372036854775807

    Description

      replay-single test_80g fails for ZFS with DNE Lustre configurations. Looking at a recent failure, https://testing.whamcloud.com/test_sets/6c06f67c-cf6a-11e8-82f2-52540065bddc , we see ‘lfs getstripe’ fails

      onyx-42vm6: CMD: onyx-42vm6.onyx.whamcloud.com lctl get_param -n at_max
      onyx-42vm7: CMD: onyx-42vm7.onyx.whamcloud.com lctl get_param -n at_max
      onyx-42vm6: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      onyx-42vm7: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      lfs getstripe: cannot open '/mnt/lustre/d80g.replay-single/remote_dir': No such file or directory (2)
      error: getstripe failed for /mnt/lustre/d80g.replay-single/remote_dir.
       replay-single test_80g: @@@@@@ FAIL: /usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5788:error()
        = /usr/lib64/lustre/tests/replay-single.sh:2580:remote_dir_check_80()
        = /usr/lib64/lustre/tests/replay-single.sh:2792:test_80g()
      

      Comparing the console log from this failed test session to one where test 80g passes, we see a few errors in the MDS2, MDS4 (vm10) log:

      [75477.742299] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
      [75477.926652] Lustre: Failing over lustre-MDT0001
      [75477.946482] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8f07870a0f00 x1614211128795744/t0(0) o1000->lustre-MDT0000-osp-MDT0001@10.2.8.153@tcp:24/4 lens 304/4320 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      [75477.948593] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
      [75477.949570] LustreError: 6854:0:(osp_object.c:582:osp_attr_get()) lustre-MDT0000-osp-MDT0001:osp_attr_get update error [0x200000401:0x1:0x0]: rc = -5
      [75478.049796] Lustre: lustre-MDT0001: Not available for connect from 10.2.8.153@tcp (stopping)
      [75478.605896] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      

      This test fails almost 100% of the time for a DNE wth ZFS configuration. Frequently, replay-single test 80g fails after test 80f fails, but this is not always true.

      Some other recent failures are at
      https://testing.whamcloud.com/test_sets/121a90c6-c6e4-11e8-82f2-52540065bddc
      https://testing.whamcloud.com/test_sets/cb442ad8-d17c-11e8-b589-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11538] replay-single test 80g fails with '/usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed'

            Issue was fixed via patch https://review.whamcloud.com/34069 "LU-10143 osd-zfs: allocate sequence in advance"

            adilger Andreas Dilger added a comment - Issue was fixed via patch https://review.whamcloud.com/34069 " LU-10143 osd-zfs: allocate sequence in advance "
            sarah Sarah Liu added a comment -

            seeing similar error on soak which is running b2_10-ib build 98
            on MDS 0, ldiskfs

            [13029.868775] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [13052.013064] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 4 seconds
            [13112.234837] Lustre: MGS: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
            [13112.244106] Lustre: Skipped 1 previous similar message
            [13112.621426] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.111@o2ib, removing former export from same NID
            [13173.841531] LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
            [13173.857375] LustreError: 19073:0:(osp_object.c:582:osp_attr_get()) soaked-MDT0003-osp-MDT0000:osp_attr_get update error [0x2c000ee60:0x1:0x0]: rc = -5
            [13173.862277] Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
            [13173.862280] Lustre: Skipped 2 previous similar messages
            
            sarah Sarah Liu added a comment - seeing similar error on soak which is running b2_10-ib build 98 on MDS 0, ldiskfs [13029.868775] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message [13052.013064] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 4 seconds [13112.234837] Lustre: MGS: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) [13112.244106] Lustre: Skipped 1 previous similar message [13112.621426] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.111@o2ib, removing former export from same NID [13173.841531] LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. [13173.857375] LustreError: 19073:0:(osp_object.c:582:osp_attr_get()) soaked-MDT0003-osp-MDT0000:osp_attr_get update error [0x2c000ee60:0x1:0x0]: rc = -5 [13173.862277] Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) [13173.862280] Lustre: Skipped 2 previous similar messages

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: