[LU-13203] replay-single test 70f fails on OST failover with “dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.<node> failed on <node>, rc=1 “ Created: 04/Feb/20  Updated: 05/Feb/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: failover

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_70f fails on the first OST failover and continues to fail over OSTs a total of nine times. Looking at the suite_log for a recent failure for Lustre 2.12.4, https://testing.whamcloud.com/test_sets/e74bc5bc-44d3-11ea-bffa-52540065bddc, we see

CMD: trevis-47vm10 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
test_70f failing OST 1 times
 replay-single test_70f: @@@@@@ FAIL: dd  bs=1M count=10 if=/tmp/f70f.replay-single  of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 
CMD: trevis-47vm10 /usr/sbin/lctl dl
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5900:error()
  = /usr/lib64/lustre/tests/replay-single.sh:2355:test_70f_write_and_read()
  = /usr/lib64/lustre/tests/replay-single.sh:2388:test_70f_loop()
  = /usr/lib64/lustre/tests/replay-single.sh:2435:test_70f()

Client 3 (vm8) is the client that issued the failed dd, but the client console log doesn’t reveal the issue

[26207.492527] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/dev/urandom of=/tmp/f70f.replay-single
[26208.392570] Lustre: DEBUG MARKER: md5sum /tmp/f70f.replay-single
[26213.468597] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
[26213.847635] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
[26216.624820] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8
[26216.837739] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 1 times
[26217.072446] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70f: @@@@@@ FAIL: dd  bs=1M count=10 if=\/tmp\/f70f.replay-single  of=\/mnt\/lustre\/d70f.replay-single\/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 
[26217.108406] Lustre: DEBUG MARKER: test_70f failing OST 1 times
[26217.340902] Lustre: DEBUG MARKER: replay-single test_70f: @@@@@@ FAIL: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1
[26217.777956] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2020-01-29/lustre-b2_12-el7_7-x86_64--failover--1_6__62___56738b6f-28bc-459e-a9cf-a3f728fca5df/replay-single.test_70f.debug_log.$(hostname -s).1580545204.log;
[26217.777956]          dmesg > /autotest/autotest2/2020-01-29/lustre-b
[26219.354501] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[26229.481266] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580545208/real 1580545208]  req@ffff8fc5fadb7a80 x1657294312905536/t0(0) o400->lustre-OST0000-osc-ffff8fc5fa46e800@10.9.3.130@tcp:28/4 lens 224/224 e 0 to 1 dl 1580545215 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[26229.486117] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
[26229.487755] Lustre: lustre-OST0000-osc-ffff8fc5fa46e800: Connection to lustre-OST0000 (at 10.9.3.130@tcp) was lost; in progress operations using this service will wait for recovery to complete
[26229.490534] Lustre: Skipped 2 previous similar messages
[26235.133943] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.44@tcp added to recovery queue. Health = 900
[26235.136037] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message

Logs for the same replay-single test_70f failures are at
https://testing.whamcloud.com/test_sets/93b685d4-1f28-11ea-b1e8-52540065bddc
https://testing.whamcloud.com/test_sets/b3fce110-1962-11ea-98f1-52540065bddc
https://testing.whamcloud.com/test_sets/f09eaa44-ff96-11e9-a9d7-52540065bddc


Generated at Sat Feb 10 02:59:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.