[LU-11538] replay-single test 80g fails with '/usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed' Created: 17/Oct/18 Updated: 03/Aug/22 Resolved: 26/Feb/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | DNE, zfs | ||
| Environment: |
DNE/ZFS |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
replay-single test_80g fails for ZFS with DNE Lustre configurations. Looking at a recent failure, https://testing.whamcloud.com/test_sets/6c06f67c-cf6a-11e8-82f2-52540065bddc , we see ‘lfs getstripe’ fails onyx-42vm6: CMD: onyx-42vm6.onyx.whamcloud.com lctl get_param -n at_max onyx-42vm7: CMD: onyx-42vm7.onyx.whamcloud.com lctl get_param -n at_max onyx-42vm6: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-42vm7: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec lfs getstripe: cannot open '/mnt/lustre/d80g.replay-single/remote_dir': No such file or directory (2) error: getstripe failed for /mnt/lustre/d80g.replay-single/remote_dir. replay-single test_80g: @@@@@@ FAIL: /usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5788:error() = /usr/lib64/lustre/tests/replay-single.sh:2580:remote_dir_check_80() = /usr/lib64/lustre/tests/replay-single.sh:2792:test_80g() Comparing the console log from this failed test session to one where test 80g passes, we see a few errors in the MDS2, MDS4 (vm10) log: [75477.742299] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [75477.926652] Lustre: Failing over lustre-MDT0001 [75477.946482] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8f07870a0f00 x1614211128795744/t0(0) o1000->lustre-MDT0000-osp-MDT0001@10.2.8.153@tcp:24/4 lens 304/4320 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [75477.948593] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [75477.949570] LustreError: 6854:0:(osp_object.c:582:osp_attr_get()) lustre-MDT0000-osp-MDT0001:osp_attr_get update error [0x200000401:0x1:0x0]: rc = -5 [75478.049796] Lustre: lustre-MDT0001: Not available for connect from 10.2.8.153@tcp (stopping) [75478.605896] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && This test fails almost 100% of the time for a DNE wth ZFS configuration. Frequently, replay-single test 80g fails after test 80f fails, but this is not always true. Some other recent failures are at |
| Comments |
| Comment by Sarah Liu [ 05/Feb/19 ] |
|
seeing similar error on soak which is running b2_10-ib build 98 [13029.868775] Lustre: 12363:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message [13052.013064] LNet: 12352:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 192.168.1.111@o2ib: 4 seconds [13112.234837] Lustre: MGS: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) [13112.244106] Lustre: Skipped 1 previous similar message [13112.621426] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.111@o2ib, removing former export from same NID [13173.841531] LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. [13173.857375] LustreError: 19073:0:(osp_object.c:582:osp_attr_get()) soaked-MDT0003-osp-MDT0000:osp_attr_get update error [0x2c000ee60:0x1:0x0]: rc = -5 [13173.862277] Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.111@o2ib (at 192.168.1.111@o2ib) [13173.862280] Lustre: Skipped 2 previous similar messages |
| Comment by Andreas Dilger [ 26/Feb/19 ] |
|
Issue was fixed via patch https://review.whamcloud.com/34069 " |