[LU-11538] replay-single test 80g fails with '/usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed' Created: 17/Oct/18  Updated: 03/Aug/22  Resolved: 26/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: DNE, zfs
Environment:

DNE/ZFS


Issue Links:
Duplicate
duplicates LU-10143 LBUG dt_object.h:2166:dt_declare_reco... Resolved
Related
is related to LU-11366 replay-single timeout test 80f: rm: c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_80g fails for ZFS with DNE Lustre configurations. Looking at a recent failure, https://testing.whamcloud.com/test_sets/6c06f67c-cf6a-11e8-82f2-52540065bddc , we see ‘lfs getstripe’ fails

onyx-42vm6: CMD: onyx-42vm6.onyx.whamcloud.com lctl get_param -n at_max
onyx-42vm7: CMD: onyx-42vm7.onyx.whamcloud.com lctl get_param -n at_max
onyx-42vm6: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
onyx-42vm7: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
lfs getstripe: cannot open '/mnt/lustre/d80g.replay-single/remote_dir': No such file or directory (2)
error: getstripe failed for /mnt/lustre/d80g.replay-single/remote_dir.
 replay-single test_80g: @@@@@@ FAIL: /usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5788:error()
  = /usr/lib64/lustre/tests/replay-single.sh:2580:remote_dir_check_80()
  = /usr/lib64/lustre/tests/replay-single.sh:2792:test_80g()

Comparing the console log from this failed test session to one where test 80g passes, we see a few errors in the MDS2, MDS4 (vm10) log:

[75477.742299] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
[75477.926652] Lustre: Failing over lustre-MDT0001
[75477.946482] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8f07870a0f00 x1614211128795744/t0(0) o1000->lustre-MDT0000-osp-MDT0001@10.2.8.153@tcp:24/4 lens 304/4320 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[75477.948593] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
[75477.949570] LustreError: 6854:0:(osp_object.c:582:osp_attr_get()) lustre-MDT0000-osp-MDT0001:osp_attr_get update error [0x200000401:0x1:0x0]: rc = -5
[75478.049796] Lustre: lustre-MDT0001: Not available for connect from 10.2.8.153@tcp (stopping)
[75478.605896] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&

This test fails almost 100% of the time for a DNE wth ZFS configuration. Frequently, replay-single test 80g fails after test 80f fails, but this is not always true.

Some other recent failures are at
https://testing.whamcloud.com/test_sets/121a90c6-c6e4-11e8-82f2-52540065bddc
https://testing.whamcloud.com/test_sets/cb442ad8-d17c-11e8-b589-52540065bddc



 Comments   
Comment by Sarah Liu [ 05/Feb/19 ]

seeing similar error on soak which is running b2_10-ib build 98
on MDS 0, ldiskfs

[13029.868775] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[13052.013064] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 4 seconds
[13112.234837] Lustre: MGS: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
[13112.244106] Lustre: Skipped 1 previous similar message
[13112.621426] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.111@o2ib, removing former export from same NID
[13173.841531] LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
[13173.857375] LustreError: 19073:0:(osp_object.c:582:osp_attr_get()) soaked-MDT0003-osp-MDT0000:osp_attr_get update error [0x2c000ee60:0x1:0x0]: rc = -5
[13173.862277] Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib)
[13173.862280] Lustre: Skipped 2 previous similar messages
Comment by Andreas Dilger [ 26/Feb/19 ]

Issue was fixed via patch https://review.whamcloud.com/34069 "LU-10143 osd-zfs: allocate sequence in advance"

Generated at Sat Feb 10 02:44:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.