[LU-8668] sanityn test 77a, 77b, 77c, 77d, 77e, 77f all fail with 'dd on client failed' Created: 04/Oct/16 Updated: 14/Oct/16 Resolved: 14/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
All sanityn test_77a, test_77b, est_77c, test_77d, test_77e, test_77f all fail with 'dd on client failed' The test log for 77a looks like: == sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417) CMD: trevis-66vm4 lctl set_param ost.OSS.*.nrs_policies=fifo ost.OSS.ost.nrs_policies=fifo ost.OSS.ost_create.nrs_policies=fifo ost.OSS.ost_io.nrs_policies=fifo ost.OSS.ost_out.nrs_policies=fifo ost.OSS.ost_seq.nrs_policies=fifo CMD: trevis-66vm5.trevis.hpdd.intel.com,trevis-66vm6 sync sanityn test_77a: @@@@@@ FAIL: dd on client failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4835:error() = /usr/lib64/lustre/tests/sanityn.sh:2975:nrs_write_read() = /usr/lib64/lustre/tests/sanityn.sh:2999:test_77a() Since all of the NRS policy tests fail, the problem is in the routine they all call which is nrs_write_read(). Since dd runs in the background, it’s not clear which of the dds fail. From the client console: 23:47:04:[ 8104.314157] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417) 23:47:04:[ 8105.118942] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_r_trevis-66vm5.trevis.hpdd.intel.com bs=1M count=16 23:47:04:[ 8109.441422] Lustre: DEBUG MARKER: sync 23:47:04:[ 8109.560257] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=10 count=1 23:47:04:[ 8109.575839] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=15 count=1 23:47:04:[ 8109.590028] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=1 count=1 23:47:04:[ 8109.604393] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=9 count=1 23:47:04:[ 8109.619225] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=12 count=1 23:47:04:[ 8109.635401] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=2 count=1 23:47:04:[ 8109.650030] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=4 count=1 23:47:04:[ 8109.665628] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=13 count=1 23:47:04:[ 8109.680064] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=11 count=1 23:47:35:[ 8109.696518] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=7 count=1 23:47:35:[ 8109.716382] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=0 count=1 23:47:35:[ 8109.747046] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=8 count=1 23:47:35:[ 8109.767675] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=5 count=1 23:47:35:[ 8109.784620] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=6 count=1 23:47:35:[ 8109.801728] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=14 count=1 23:47:35:[ 8114.267018] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=0 count=1 23:47:35:[ 8114.467267] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanityn test_77a: @@@@@@ FAIL: dd on client failed 23:47:35:[ 8115.076977] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd on client failed 23:47:35:[ 8115.196188] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=1 count=1 23:47:35:[ 8115.682787] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-10-03/lustre-reviews-el7-x86_64--review-zfs-part-1--1_8_1__41796__-70202352898540-210143/sanityn.test_77a.debug_log.$(hostname -s).1475538429.log; 23:47:35:[ 8115.682787] dmesg > /logdir/test_logs/2016-10-03/lustre-reviews-el7- 23:47:35:[ 8116.138228] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=2 count=1 23:47:35:[ 8116.960193] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=3 count=1 23:47:35:[ 8117.502537] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=4 count=1 23:47:35:[ 8117.900023] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null 23:47:35:[ 8118.524133] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=5 count=1 23:47:35:[ 8119.059330] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=6 count=1 23:47:35:[ 8119.505604] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=7 count=1 23:47:35:[ 8119.938421] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=8 count=1 23:47:35:[ 8120.365768] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=9 count=1 23:47:35:[ 8120.859821] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=10 count=1 23:47:35:[ 8121.303947] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=11 count=1 23:47:35:[ 8121.747911] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=12 count=1 23:47:35:[ 8122.180064] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=13 count=1 23:47:35:[ 8122.619751] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=14 count=1 23:47:35:[ 8123.046963] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=15 count=1 23:47:35:[ 8123.590134] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null 23:47:35:[ 8124.254671] Lustre: DEBUG MARKER: rc=0; These tests started failing on October 1, 2016. Logs for failures are at |
| Comments |
| Comment by Oleg Drokin [ 06/Oct/16 ] |
|
So patches landed on Sep 29 are: none of them seems to be the obvious culprit, though. |
| Comment by Gerrit Updater [ 06/Oct/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22990 |
| Comment by Bob Glossman (Inactive) [ 06/Oct/16 ] |
|
more on master: |
| Comment by Gerrit Updater [ 08/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22990/ |
| Comment by Niu Yawei (Inactive) [ 09/Oct/16 ] |
|
Didn't find any error message in the logs, the debug patch from Andreas has been landed, it may help us to identify what kind of error the 'dd' got. |
| Comment by James Nunez (Inactive) [ 10/Oct/16 ] |
|
Failures with the debug information from Andreas' patch: |
| Comment by Niu Yawei (Inactive) [ 11/Oct/16 ] |
|
Ok, take https://testing.hpdd.intel.com/test_logs/66d51444-8e95-11e6-91aa-5254006e85c2/show_text as an example: We can see 4 "ssh_exchange_identification" errors: CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=5 count=1 CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=4 count=1 CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=0 count=1 trevis-66vm2: 1+0 records in trevis-66vm2: 1+0 records out trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0365618 s, 28.7 MB/s trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm2: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm2: ssh exited with exit code 255 trevis-66vm1: ssh_exchange_identification: Connection closed by remote host pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255 trevis-66vm2: 1+0 records in trevis-66vm2: 1+0 records out trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0534754 s, 19.6 MB/s I think it's the reason of 4 'dd failed' errors: sanityn test_77a: @@@@@@ FAIL: dd at 10MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 15MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2) ... sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2) The dd itself actually didn't fail, but the pdsh command failed. Are there any network related hardware or configuration changes in our testing system? |
| Comment by Niu Yawei (Inactive) [ 11/Oct/16 ] |
|
All failures happened on 'trevis', probably we'd ask DCO team to take a look what kind of change to 'trevis' may cause such frequent ssh connection failures? |
| Comment by Andreas Dilger [ 12/Oct/16 ] |
|
All of the last 50 failures are trevis66-71. |
| Comment by Charlie Olmstead [ 13/Oct/16 ] |
|
From the testing I ran the cause was narrowed down to Autotest changing the rcmd module from mrsh to ssh for pdsh: PDSH="pdsh -t 120 -S -Rssh -w" wheras before it was PDSH="pdsh -t 120 -S -Rmrsh -w" I can easily change pdsh back to mrsh but should these 77* tests fail because pdsh is using ssh? |
| Comment by Niu Yawei (Inactive) [ 14/Oct/16 ] |
|
Yes, I think so. Lots of test cases rely on pdsh to issue command remotely, if pdsh returns error, it can't tell if the command is performed successfully, so failing the test looks a reasonable choice to me. |
| Comment by Peter Jones [ 14/Oct/16 ] |
|
So I see that a change has been made to autotest under DCO-6061. Will this change just take effect at the next autotest upgrade? Should we close this LU ticket as a duplicate of the DCO ticket? |
| Comment by Andreas Dilger [ 14/Oct/16 ] |
|
Problem was caused by test environment. |