[LU-8668] sanityn test 77a, 77b, 77c, 77d, 77e, 77f all fail with 'dd on client failed' Created: 04/Oct/16  Updated: 14/Oct/16  Resolved: 14/Oct/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

autotest


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

All sanityn test_77a, test_77b, est_77c, test_77d, test_77e, test_77f all fail with

'dd on client failed'

The test log for 77a looks like:

== sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417)
CMD: trevis-66vm4 lctl set_param ost.OSS.*.nrs_policies=fifo
ost.OSS.ost.nrs_policies=fifo
ost.OSS.ost_create.nrs_policies=fifo
ost.OSS.ost_io.nrs_policies=fifo
ost.OSS.ost_out.nrs_policies=fifo
ost.OSS.ost_seq.nrs_policies=fifo
CMD: trevis-66vm5.trevis.hpdd.intel.com,trevis-66vm6 sync
sanityn test_77a: @@@@@@ FAIL: dd on client failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4835:error()
  = /usr/lib64/lustre/tests/sanityn.sh:2975:nrs_write_read()
  = /usr/lib64/lustre/tests/sanityn.sh:2999:test_77a()

Since all of the NRS policy tests fail, the problem is in the routine they all call which is nrs_write_read().

Since dd runs in the background, it’s not clear which of the dds fail. From the client console:

23:47:04:[ 8104.314157] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 23:46:57 (1475538417)
23:47:04:[ 8105.118942] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_r_trevis-66vm5.trevis.hpdd.intel.com bs=1M count=16
23:47:04:[ 8109.441422] Lustre: DEBUG MARKER: sync
23:47:04:[ 8109.560257] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=10 count=1
23:47:04:[ 8109.575839] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=15 count=1
23:47:04:[ 8109.590028] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=1 count=1
23:47:04:[ 8109.604393] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=9 count=1
23:47:04:[ 8109.619225] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=12 count=1
23:47:04:[ 8109.635401] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=2 count=1
23:47:04:[ 8109.650030] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=4 count=1
23:47:04:[ 8109.665628] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=13 count=1
23:47:04:[ 8109.680064] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=11 count=1
23:47:35:[ 8109.696518] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=7 count=1
23:47:35:[ 8109.716382] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=0 count=1
23:47:35:[ 8109.747046] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=8 count=1
23:47:35:[ 8109.767675] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=5 count=1
23:47:35:[ 8109.784620] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=6 count=1
23:47:35:[ 8109.801728] Lustre: DEBUG MARKER: dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com bs=1M seek=14 count=1
23:47:35:[ 8114.267018] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=0 count=1
23:47:35:[ 8114.467267] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd on client failed 
23:47:35:[ 8115.076977] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd on client failed
23:47:35:[ 8115.196188] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=1 count=1
23:47:35:[ 8115.682787] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-10-03/lustre-reviews-el7-x86_64--review-zfs-part-1--1_8_1__41796__-70202352898540-210143/sanityn.test_77a.debug_log.$(hostname -s).1475538429.log;
23:47:35:[ 8115.682787]          dmesg > /logdir/test_logs/2016-10-03/lustre-reviews-el7-
23:47:35:[ 8116.138228] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=2 count=1
23:47:35:[ 8116.960193] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=3 count=1
23:47:35:[ 8117.502537] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=4 count=1
23:47:35:[ 8117.900023] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
23:47:35:[ 8118.524133] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=5 count=1
23:47:35:[ 8119.059330] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=6 count=1
23:47:35:[ 8119.505604] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=7 count=1
23:47:35:[ 8119.938421] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=8 count=1
23:47:35:[ 8120.365768] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=9 count=1
23:47:35:[ 8120.859821] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=10 count=1
23:47:35:[ 8121.303947] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=11 count=1
23:47:35:[ 8121.747911] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=12 count=1
23:47:35:[ 8122.180064] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=13 count=1
23:47:35:[ 8122.619751] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=14 count=1
23:47:35:[ 8123.046963] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm5.trevis.hpdd.intel.com of=/dev/zero bs=1M seek=15 count=1
23:47:35:[ 8123.590134] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
23:47:35:[ 8124.254671] Lustre: DEBUG MARKER: rc=0;

These tests started failing on October 1, 2016. Logs for failures are at
2016-10-04 - https://testing.hpdd.intel.com/test_sets/b0ff1952-8a03-11e6-a9b0-5254006e85c2
2016-10-03 - https://testing.hpdd.intel.com/test_sets/371f30a8-89d7-11e6-a9b0-5254006e85c2
2016-10-01 - https://testing.hpdd.intel.com/test_sets/903c4160-87b3-11e6-a9b0-5254006e85c2
2016-10-01 -https://testing.hpdd.intel.com/test_sets/332ebf58-87df-11e6-91aa-5254006e85c2
2016-10-01 -https://testing.hpdd.intel.com/test_sets/e0e6f028-87ff-11e6-a8b7-5254006e85c2
2016-10-01 –https://testing.hpdd.intel.com/test_sets/30dee3fe-87df-11e6-a9b0-5254006e85c2



 Comments   
Comment by Oleg Drokin [ 06/Oct/16 ]

So patches landed on Sep 29 are:
f2b457a LU-8632 osc: remove of usage removed 'page_count' local variable
3ed0a4a LU-8579 osd-ldiskfs: code cleanup for osd_ldiskfs_add_entry
5207c48 LU-6401 header: remove assert from interval_set()
4dfa29e LU-3888 utils: print lmm_fid as part of getstripe
7597c67 LU-4474 osd: Add nodelalloc to ldiskfs mount options
091739b LU-6910 osp: add procfs values for OST reserved size
4a5e355 LU-8446 llite: clear inode timestamps after losing UPDATE lock
72ec6eb LU-8544 test: using lfs df in client_up

none of them seems to be the obvious culprit, though.

Comment by Gerrit Updater [ 06/Oct/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/22990
Subject: LU-8668 tests: print more information about dd failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c722e57af3e433b50933f2488dbfcf89d1fa53fa

Comment by Bob Glossman (Inactive) [ 06/Oct/16 ]

more on master:
https://testing.hpdd.intel.com/test_sets/14cc482c-8bff-11e6-a9b0-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0422b016-8c2b-11e6-91aa-5254006e85c2

Comment by Gerrit Updater [ 08/Oct/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22990/
Subject: LU-8668 tests: print more information about dd failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d3409e4a55c8c9a18a63743014a1e1c1bfc8b86d

Comment by Niu Yawei (Inactive) [ 09/Oct/16 ]

Didn't find any error message in the logs, the debug patch from Andreas has been landed, it may help us to identify what kind of error the 'dd' got.

Comment by James Nunez (Inactive) [ 10/Oct/16 ]

Failures with the debug information from Andreas' patch:
https://testing.hpdd.intel.com/test_sets/dcf408e8-8e9e-11e6-91aa-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b63dd56c-8e71-11e6-a9b0-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6167e6ee-8e95-11e6-91aa-5254006e85c2

Comment by Niu Yawei (Inactive) [ 11/Oct/16 ]

Ok, take https://testing.hpdd.intel.com/test_logs/66d51444-8e95-11e6-91aa-5254006e85c2/show_text as an example:

We can see 4 "ssh_exchange_identification" errors:

CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=5 count=1
CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=4 count=1
CMD: trevis-66vm1.trevis.hpdd.intel.com,trevis-66vm2 dd if=/dev/zero of=/mnt/lustre/d77a.sanityn/nrs_w_trevis-66vm1.trevis.hpdd.intel.com bs=1M seek=0 count=1
trevis-66vm2: 1+0 records in
trevis-66vm2: 1+0 records out
trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0365618 s, 28.7 MB/s
trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
trevis-66vm2: ssh_exchange_identification: Connection closed by remote host
pdsh@trevis-66vm1: trevis-66vm2: ssh exited with exit code 255
trevis-66vm1: ssh_exchange_identification: Connection closed by remote host
pdsh@trevis-66vm1: trevis-66vm1: ssh exited with exit code 255
trevis-66vm2: 1+0 records in
trevis-66vm2: 1+0 records out
trevis-66vm2: 1048576 bytes (1.0 MB) copied, 0.0534754 s, 19.6 MB/s

I think it's the reason of 4 'dd failed' errors:

sanityn test_77a: @@@@@@ FAIL: dd at 10MB on client failed (2)
...
sanityn test_77a: @@@@@@ FAIL: dd at 15MB on client failed (2)
...
sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2)
...
sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2)

The dd itself actually didn't fail, but the pdsh command failed. Are there any network related hardware or configuration changes in our testing system?

Comment by Niu Yawei (Inactive) [ 11/Oct/16 ]

All failures happened on 'trevis', probably we'd ask DCO team to take a look what kind of change to 'trevis' may cause such frequent ssh connection failures?

Comment by Andreas Dilger [ 12/Oct/16 ]

All of the last 50 failures are trevis66-71.

Comment by Charlie Olmstead [ 13/Oct/16 ]

From the testing I ran the cause was narrowed down to Autotest changing the rcmd module from mrsh to ssh for pdsh:

PDSH="pdsh -t 120 -S -Rssh -w"

wheras before it was

PDSH="pdsh -t 120 -S -Rmrsh -w"

I can easily change pdsh back to mrsh but should these 77* tests fail because pdsh is using ssh?

Comment by Niu Yawei (Inactive) [ 14/Oct/16 ]

Yes, I think so. Lots of test cases rely on pdsh to issue command remotely, if pdsh returns error, it can't tell if the command is performed successfully, so failing the test looks a reasonable choice to me.

Comment by Peter Jones [ 14/Oct/16 ]

So I see that a change has been made to autotest under DCO-6061. Will this change just take effect at the next autotest upgrade? Should we close this LU ticket as a duplicate of the DCO ticket?

Comment by Andreas Dilger [ 14/Oct/16 ]

Problem was caused by test environment.

Generated at Sat Feb 10 02:19:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.