Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13203

replay-single test 70f fails on OST failover with “dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.<node> failed on <node>, rc=1 “

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3, Lustre 2.12.4
    • 3
    • 9223372036854775807

    Description

      replay-single test_70f fails on the first OST failover and continues to fail over OSTs a total of nine times. Looking at the suite_log for a recent failure for Lustre 2.12.4, https://testing.whamcloud.com/test_sets/e74bc5bc-44d3-11ea-bffa-52540065bddc, we see

      CMD: trevis-47vm10 /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
      test_70f failing OST 1 times
       replay-single test_70f: @@@@@@ FAIL: dd  bs=1M count=10 if=/tmp/f70f.replay-single  of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 
      CMD: trevis-47vm10 /usr/sbin/lctl dl
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5900:error()
        = /usr/lib64/lustre/tests/replay-single.sh:2355:test_70f_write_and_read()
        = /usr/lib64/lustre/tests/replay-single.sh:2388:test_70f_loop()
        = /usr/lib64/lustre/tests/replay-single.sh:2435:test_70f()
      

      Client 3 (vm8) is the client that issued the failed dd, but the client console log doesn’t reveal the issue

      [26207.492527] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/dev/urandom of=/tmp/f70f.replay-single
      [26208.392570] Lustre: DEBUG MARKER: md5sum /tmp/f70f.replay-single
      [26213.468597] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
      [26213.847635] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
      [26216.624820] Lustre: DEBUG MARKER: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8
      [26216.837739] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 1 times
      [26217.072446] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70f: @@@@@@ FAIL: dd  bs=1M count=10 if=\/tmp\/f70f.replay-single  of=\/mnt\/lustre\/d70f.replay-single\/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1 
      [26217.108406] Lustre: DEBUG MARKER: test_70f failing OST 1 times
      [26217.340902] Lustre: DEBUG MARKER: replay-single test_70f: @@@@@@ FAIL: dd bs=1M count=10 if=/tmp/f70f.replay-single of=/mnt/lustre/d70f.replay-single/f70f.replay-single.trevis-47vm8 failed on trevis-47vm8, rc=1
      [26217.777956] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2020-01-29/lustre-b2_12-el7_7-x86_64--failover--1_6__62___56738b6f-28bc-459e-a9cf-a3f728fca5df/replay-single.test_70f.debug_log.$(hostname -s).1580545204.log;
      [26217.777956]          dmesg > /autotest/autotest2/2020-01-29/lustre-b
      [26219.354501] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [26229.481266] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1580545208/real 1580545208]  req@ffff8fc5fadb7a80 x1657294312905536/t0(0) o400->lustre-OST0000-osc-ffff8fc5fa46e800@10.9.3.130@tcp:28/4 lens 224/224 e 0 to 1 dl 1580545215 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [26229.486117] Lustre: 1650:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
      [26229.487755] Lustre: lustre-OST0000-osc-ffff8fc5fa46e800: Connection to lustre-OST0000 (at 10.9.3.130@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [26229.490534] Lustre: Skipped 2 previous similar messages
      [26235.133943] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) ni 10.9.6.44@tcp added to recovery queue. Health = 900
      [26235.136037] LNetError: 1642:0:(lib-msg.c:485:lnet_handle_local_failure()) Skipped 1 previous similar message
      

      Logs for the same replay-single test_70f failures are at
      https://testing.whamcloud.com/test_sets/93b685d4-1f28-11ea-b1e8-52540065bddc
      https://testing.whamcloud.com/test_sets/b3fce110-1962-11ea-98f1-52540065bddc
      https://testing.whamcloud.com/test_sets/f09eaa44-ff96-11e9-a9d7-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: