Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3, Lustre 2.12.4
-
3
-
9223372036854775807
Description
replay-single test_70f fails on the first OST failover and continues to fail over OSTs a total of nine times. Looking at the suite_log for a recent failure for Lustre 2.12.4, https://testing.whamcloud.com/test_sets/e74bc5bc-44d3-11ea-bffa-52540065bddc, we see
CMD: trevis-47vm10 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000 test_70f failing OST 1 times replay-single test_70f: @@@@@@ FAIL: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 CMD: trevis-47vm10 /usr/sbin/lctl dl Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5900:error() = /usr/lib64/lustre/tests/replay-single.sh:2355:test_70f_write_and_read() = /usr/lib64/lustre/tests/replay-single.sh:2388:test_70f_loop() = /usr/lib64/lustre/tests/replay-single.sh:2435:test_70f()
Client 3 (vm8) is the client that issued the failed dd, but the client console log doesn’t reveal the issue
[26207.492527] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/dev/urandom of=/tmp/f70f.replay-single [26208.392570] Lustre: DEBUG MARKER: md5sum /tmp/f70f.replay-single [26213.468597] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) [26213.847635] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi [26216.624820] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 [26216.837739] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 1 times [26217.072446] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_70f: @@@@@@ FAIL: dd bs=1M count=10 if=\/tmp\/f70f.replay-single of=\/mnt\/lustre\/d70f.replay-single\/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 [26217.108406] Lustre: DEBUG MARKER: test_70f failing OST 1 times [26217.340902] Lustre: DEBUG MARKER: replay-single test_70f: @@@@@@ FAIL: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 [26217.777956] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2020-01-29/lustre-b2_12-el7_7-x86_64--failover--1_6__62___56738b6f-28bc-459e-a9cf-a3f728fca5df/replay-single.test_70f.debug_log.$(hostname -s).1580545204.log; [26217.777956] dmesg > /autotest/autotest2/2020-01-29/lustre-b [26219.354501] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [26229.481266] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580545208/real 1580545208] req@ffff8fc5fadb7a80 x1657294312905536/t0(0) o400->lustre-OST0000-osc-ffff8fc5fa46e800@10.9.3.130@tcp:28/4 lens 224/224 e 0 to 1 dl 1580545215 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [26229.486117] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 22 previous similar messages [26229.487755] Lustre: lustre-OST0000-osc-ffff8fc5fa46e800: Connection to lustre-OST0000 (at 10.9.3.130@tcp) was lost; in progress operations using this service will wait for recovery to complete [26229.490534] Lustre: Skipped 2 previous similar messages [26235.133943] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.44@tcp added to recovery queue. Health = 900 [26235.136037] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message
Logs for the same replay-single test_70f failures are at
https://testing.whamcloud.com/test_sets/93b685d4-1f28-11ea-b1e8-52540065bddc
https://testing.whamcloud.com/test_sets/b3fce110-1962-11ea-98f1-52540065bddc
https://testing.whamcloud.com/test_sets/f09eaa44-ff96-11e9-a9d7-52540065bddc
Attachments
Issue Links
- mentioned in
-
Page Loading...