Description
replay-single test_80g fails for ZFS with DNE Lustre configurations. Looking at a recent failure, https://testing.whamcloud.com/test_sets/6c06f67c-cf6a-11e8-82f2-52540065bddc , we see ‘lfs getstripe’ fails
onyx-42vm6: CMD: onyx-42vm6.onyx.whamcloud.com lctl get_param -n at_max onyx-42vm7: CMD: onyx-42vm7.onyx.whamcloud.com lctl get_param -n at_max onyx-42vm6: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec onyx-42vm7: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec lfs getstripe: cannot open '/mnt/lustre/d80g.replay-single/remote_dir': No such file or directory (2) error: getstripe failed for /mnt/lustre/d80g.replay-single/remote_dir. replay-single test_80g: @@@@@@ FAIL: /usr/bin/lfs getstripe -m /mnt/lustre/d80g.replay-single/remote_dir failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5788:error() = /usr/lib64/lustre/tests/replay-single.sh:2580:remote_dir_check_80() = /usr/lib64/lustre/tests/replay-single.sh:2792:test_80g()
Comparing the console log from this failed test session to one where test 80g passes, we see a few errors in the MDS2, MDS4 (vm10) log:
[75477.742299] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [75477.926652] Lustre: Failing over lustre-MDT0001 [75477.946482] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8f07870a0f00 x1614211128795744/t0(0) o1000->lustre-MDT0000-osp-MDT0001@10.2.8.153@tcp:24/4 lens 304/4320 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [75477.948593] LustreError: 6854:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [75477.949570] LustreError: 6854:0:(osp_object.c:582:osp_attr_get()) lustre-MDT0000-osp-MDT0001:osp_attr_get update error [0x200000401:0x1:0x0]: rc = -5 [75478.049796] Lustre: lustre-MDT0001: Not available for connect from 10.2.8.153@tcp (stopping) [75478.605896] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
This test fails almost 100% of the time for a DNE wth ZFS configuration. Frequently, replay-single test 80g fails after test 80f fails, but this is not always true.
Some other recent failures are at
https://testing.whamcloud.com/test_sets/121a90c6-c6e4-11e8-82f2-52540065bddc
https://testing.whamcloud.com/test_sets/cb442ad8-d17c-11e8-b589-52540065bddc