Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8668

sanityn test 77a, 77b, 77c, 77d, 77e, 77f all fail with 'dd on client failed'

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.9.0
    • None
    • autotest
    • 3
    • 9223372036854775807

    Description

      All sanityn test_77a, test_77b, est_77c, test_77d, test_77e, test_77f all fail with

      'dd on client failed'
      

      The test log for 77a looks like:

      == sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417)
      CMD: trevis-66vm4 lctl set_param ost.OSS.*.nrs_policies=fifo
      ost.OSS.ost.nrs_policies=fifo
      ost.OSS.ost_create.nrs_policies=fifo
      ost.OSS.ost_io.nrs_policies=fifo
      ost.OSS.ost_out.nrs_policies=fifo
      ost.OSS.ost_seq.nrs_policies=fifo
      CMD: trevis-66vm5.trevis.hpdd.intel.com,trevis-66vm6 sync
      sanityn test_77a: @@@@@@ FAIL: dd on client failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4835:error()
        = /usr/lib64/lustre/tests/sanityn.sh:2975:nrs_write_read()
        = /usr/lib64/lustre/tests/sanityn.sh:2999:test_77a()
      

      Since all of the NRS policy tests fail, the problem is in the routine they all call which is nrs_write_read().

      Since dd runs in the background, it’s not clear which of the dds fail. From the client console:

      23:47:04:[ 8104.314157] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417)
      23:47:04:[ 8105.118942] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_r_trevis-66vm5.trevis.hpdd.intel.com bs=1M count=16
      23:47:04:[ 8109.441422] Lustre: DEBUG MARKER: sync
      23:47:04:[ 8109.560257] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=10 count=1
      23:47:04:[ 8109.575839] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=15 count=1
      23:47:04:[ 8109.590028] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=1 count=1
      23:47:04:[ 8109.604393] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=9 count=1
      23:47:04:[ 8109.619225] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=12 count=1
      23:47:04:[ 8109.635401] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=2 count=1
      23:47:04:[ 8109.650030] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=4 count=1
      23:47:04:[ 8109.665628] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=13 count=1
      23:47:04:[ 8109.680064] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=11 count=1
      23:47:35:[ 8109.696518] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=7 count=1
      23:47:35:[ 8109.716382] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=0 count=1
      23:47:35:[ 8109.747046] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=8 count=1
      23:47:35:[ 8109.767675] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=5 count=1
      23:47:35:[ 8109.784620] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=6 count=1
      23:47:35:[ 8109.801728] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=14 count=1
      23:47:35:[ 8114.267018] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=0 count=1
      23:47:35:[ 8114.467267] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd on client failed 
      23:47:35:[ 8115.076977] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd on client failed
      23:47:35:[ 8115.196188] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=1 count=1
      23:47:35:[ 8115.682787] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-10-03/lustre-reviews-el7-x86_64--review-zfs-part-1--1_8_1__41796__-70202352898540-210143/sanityn.test_77a.debug_log.$(hostname -s).1475538429.log;
      23:47:35:[ 8115.682787]          dmesg > /logdir/test_logs/2016-10-03/lustre-reviews-el7-
      23:47:35:[ 8116.138228] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=2 count=1
      23:47:35:[ 8116.960193] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=3 count=1
      23:47:35:[ 8117.502537] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=4 count=1
      23:47:35:[ 8117.900023] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      23:47:35:[ 8118.524133] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=5 count=1
      23:47:35:[ 8119.059330] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=6 count=1
      23:47:35:[ 8119.505604] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=7 count=1
      23:47:35:[ 8119.938421] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=8 count=1
      23:47:35:[ 8120.365768] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=9 count=1
      23:47:35:[ 8120.859821] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=10 count=1
      23:47:35:[ 8121.303947] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=11 count=1
      23:47:35:[ 8121.747911] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=12 count=1
      23:47:35:[ 8122.180064] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=13 count=1
      23:47:35:[ 8122.619751] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=14 count=1
      23:47:35:[ 8123.046963] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=15 count=1
      23:47:35:[ 8123.590134] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      23:47:35:[ 8124.254671] Lustre: DEBUG MARKER: rc=0;
      

      These tests started failing on October 1, 2016. Logs for failures are at
      2016-10-04 - https://testing.hpdd.intel.com/test_sets/b0ff1952-8a03-11e6-a9b0-5254006e85c2
      2016-10-03 - https://testing.hpdd.intel.com/test_sets/371f30a8-89d7-11e6-a9b0-5254006e85c2
      2016-10-01 - https://testing.hpdd.intel.com/test_sets/903c4160-87b3-11e6-a9b0-5254006e85c2
      2016-10-01 -https://testing.hpdd.intel.com/test_sets/332ebf58-87df-11e6-91aa-5254006e85c2
      2016-10-01 -https://testing.hpdd.intel.com/test_sets/e0e6f028-87ff-11e6-a8b7-5254006e85c2
      2016-10-01 –https://testing.hpdd.intel.com/test_sets/30dee3fe-87df-11e6-a9b0-5254006e85c2

      Attachments

        Activity

          [LU-8668] sanityn test 77a, 77b, 77c, 77d, 77e, 77f all fail with 'dd on client failed'

          Problem was caused by test environment.

          adilger Andreas Dilger added a comment - Problem was caused by test environment.
          pjones Peter Jones added a comment -

          So I see that a change has been made to autotest under DCO-6061. Will this change just take effect at the next autotest upgrade? Should we close this LU ticket as a duplicate of the DCO ticket?

          pjones Peter Jones added a comment - So I see that a change has been made to autotest under DCO-6061. Will this change just take effect at the next autotest upgrade? Should we close this LU ticket as a duplicate of the DCO ticket?
          niu Niu Yawei (Inactive) added a comment - - edited

          Yes, I think so. Lots of test cases rely on pdsh to issue command remotely, if pdsh returns error, it can't tell if the command is performed successfully, so failing the test looks a reasonable choice to me.

          niu Niu Yawei (Inactive) added a comment - - edited Yes, I think so. Lots of test cases rely on pdsh to issue command remotely, if pdsh returns error, it can't tell if the command is performed successfully, so failing the test looks a reasonable choice to me.

          From the testing I ran the cause was narrowed down to Autotest changing the rcmd module from mrsh to ssh for pdsh:

          PDSH="pdsh -t 120 -S -Rssh -w"
          

          wheras before it was

          PDSH="pdsh -t 120 -S -Rmrsh -w"
          

          I can easily change pdsh back to mrsh but should these 77* tests fail because pdsh is using ssh?

          colmstea Charlie Olmstead added a comment - From the testing I ran the cause was narrowed down to Autotest changing the rcmd module from mrsh to ssh for pdsh: PDSH="pdsh -t 120 -S -Rssh -w" wheras before it was PDSH="pdsh -t 120 -S -Rmrsh -w" I can easily change pdsh back to mrsh but should these 77* tests fail because pdsh is using ssh?

          All of the last 50 failures are trevis66-71.

          adilger Andreas Dilger added a comment - All of the last 50 failures are trevis66-71.

          All failures happened on 'trevis', probably we'd ask DCO team to take a look what kind of change to 'trevis' may cause such frequent ssh connection failures?

          niu Niu Yawei (Inactive) added a comment - All failures happened on 'trevis', probably we'd ask DCO team to take a look what kind of change to 'trevis' may cause such frequent ssh connection failures?

          Ok, take https://testing.hpdd.intel.com/test_logs/66d51444-8e95-11e6-91aa-5254006e85c2/show_text as an example:

          We can see 4 "ssh_exchange_identification" errors:

          CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=5 count=1
          CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=4 count=1
          CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=0 count=1
          trevis-66vm2: 1+0 records in
          trevis-66vm2: 1+0 records out
          trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0365618 s, 28.7 MB/s
          trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
          pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
          trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
          pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
          trevis-66vm2: ssh_exchange_identification: Connection closed by remote host
          pdsh@trevis-66vm1: trevis-66vm2: ssh exited with exit code 255
          trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
          pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
          trevis-66vm2: 1+0 records in
          trevis-66vm2: 1+0 records out
          trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0534754 s, 19.6 MB/s
          

          I think it's the reason of 4 'dd failed' errors:

          sanityn test_77a: @@@@@@ FAIL: dd at 10MB on client failed (2)
          ...
          sanityn test_77a: @@@@@@ FAIL: dd at 15MB on client failed (2)
          ...
          sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2)
          ...
          sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2)
          

          The dd itself actually didn't fail, but the pdsh command failed. Are there any network related hardware or configuration changes in our testing system?

          niu Niu Yawei (Inactive) added a comment - Ok, take https://testing.hpdd.intel.com/test_logs/66d51444-8e95-11e6-91aa-5254006e85c2/show_text as an example: We can see 4 "ssh_exchange_identification" errors: CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if =/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=5 count=1 CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if =/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=4 count=1 CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if =/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=0 count=1 trevis-66vm2: 1+0 records in trevis-66vm2: 1+0 records out trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0365618 s, 28.7 MB/s trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm2: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm2: ssh exited with exit code 255 trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm2: 1+0 records in trevis-66vm2: 1+0 records out trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0534754 s, 19.6 MB/s I think it's the reason of 4 'dd failed' errors: sanityn test_77a: @@@@@@ FAIL: dd at 10MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 15MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2) The dd itself actually didn't fail, but the pdsh command failed. Are there any network related hardware or configuration changes in our testing system?
          jamesanunez James Nunez (Inactive) added a comment - Failures with the debug information from Andreas' patch: https://testing.hpdd.intel.com/test_sets/dcf408e8-8e9e-11e6-91aa-5254006e85c2 https://testing.hpdd.intel.com/test_sets/b63dd56c-8e71-11e6-a9b0-5254006e85c2 https://testing.hpdd.intel.com/test_sets/6167e6ee-8e95-11e6-91aa-5254006e85c2

          Didn't find any error message in the logs, the debug patch from Andreas has been landed, it may help us to identify what kind of error the 'dd' got.

          niu Niu Yawei (Inactive) added a comment - Didn't find any error message in the logs, the debug patch from Andreas has been landed, it may help us to identify what kind of error the 'dd' got.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22990/
          Subject: LU-8668 tests: print more information about dd failure
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d3409e4a55c8c9a18a63743014a1e1c1bfc8b86d

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22990/ Subject: LU-8668 tests: print more information about dd failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: d3409e4a55c8c9a18a63743014a1e1c1bfc8b86d

          People

            niu Niu Yawei (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: