[LU-10891] sanityn test 77a, 77b, 77c, 77d, 77e, 77f, 77j and 77k all fail after 32a with 'dd at *MB on client failed (2)' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0
Labels:
- DNE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Looking at the client test_log for the test session

2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc

we see several calls to dd fail

trevis-52vm2: 1048576 bytes (1.0 MB) copied, 0.00186497 s, 562 MB/s
trevis-52vm1: dd: error writing ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error
trevis-52vm1: 1+0 records in
trevis-52vm1: 0+0 records out
trevis-52vm1: 0 bytes (0 B) copied, 99.1917 s, 0.0 kB/s
trevis-52vm1: 1+0 records in
trevis-52vm1: 1+0 records out
trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.0330184 s, 31.8 MB/s 
sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2) 
trevis-52vm1: dd: cannot fstat ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown 
sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2)

…

trevis-52vm1: 0 bytes (0 B) copied, 0.0273896 s, 0.0 kB/s 
sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2) 
trevis-52vm1: dd: failed to truncate to 13631488 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error
trevis-52vm1: dd: failed to truncate to 4194304 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown 
sanityn test_77a: @@@@@@ FAIL: dd at 13MB on client failed (2)  
sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2) 
trevis-52vm1: 1+0 records in
trevis-52vm1: 1+0 records out
trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.00703154 s, 149 MB/s  
Trace dump:  = /usr/lib64/lustre/tests/test-framework.sh:5764:error()  
= /usr/lib64/lustre/tests/sanityn.sh:3084:nrs_write_read()  
= /usr/lib64/lustre/tests/sanityn.sh:3113:test_77a() 
 = /usr/lib64/lustre/tests/test-framework.sh:6040:run_one()  
= /usr/lib64/lustre/tests/test-framework.sh:6079:run_one_logged()  
= /usr/lib64/lustre/tests/test-framework.sh:5926:run_test()  
= /usr/lib64/lustre/tests/sanityn.sh:3117:main()

In the dmesg log on the OSS, we see a stack trace and complaint that there are too many service threads and not enough resources

[11362.460839] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 05:12:32 (1521263552)
[11362.701936] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_policies=fifo
[11407.968024] LNet: Service thread pid 8563 was inactive for 40.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[11407.973431] Pid: 8563, comm: ll_ost_io00_012
[11407.975284] Call Trace:
[11407.978482]  [<ffffffff816b40e9>] schedule+0x29/0x70
[11407.980373]  [<ffffffff816b1984>] schedule_timeout+0x174/0x2c0
[11407.982265]  [<ffffffff8109c080>] ? process_timeout+0x0/0x10
[11407.984109]  [<ffffffffc0ab65f0>] ? ldlm_expired_completion_wait+0x0/0x220 [ptlrpc]
[11407.986141]  [<ffffffffc0ab6dc1>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[11407.988081]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
[11407.989996]  [<ffffffffc0ab7c13>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
[11407.991934]  [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[11407.993914]  [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[11407.995835]  [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc]
[11407.997834]  [<ffffffffc0b4b007>] tgt_extent_lock+0xe7/0x290 [ptlrpc]
[11407.999761]  [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[11408.001742]  [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[11408.003644]  [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc]
[11408.005520]  [<ffffffffc0e71c7f>] ofd_punch_hdl+0x30f/0x9c0 [ofd]
[11408.007296]  [<ffffffffc0b51eea>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
[11408.009176]  [<ffffffffc0af7823>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
[11408.011048]  [<ffffffffc0afafd2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[11408.012880]  [<ffffffffc0afa540>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[11408.014611]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[11408.016294]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
[11408.017934]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[11408.019664]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0 
[11408.022731] LustreError: dumping log to /tmp/lustre-log.1521263598.8563
[11467.000052] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 10.9.6.82@tcp  ns: filter-lustre-OST0004_UUID lock: ffff880060b46240/0x6aadc925cc163a6a lrc: 3/0,0 mode: PW/PW res: [0x54e0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000400000020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d1f4 expref: 6 pid: 13081 timeout: 11467 lvb_type: 0[11467.059728] Lustre: lustre-OST0004: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp)
[11467.145213] LNet: Service thread pid 8563 completed after 99.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[11467.488810] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed \(2\) 
[11467.509827] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed \(2\) 
[11467.702702] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2)
[11467.723042] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2)
[11467.967245] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log;         dmesg > /home/auto
[11468.064728] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log;         dmesg > /home/auto
[11472.082190] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0        fail_val=0 2>/dev/null
[11472.179737] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0        fail_val=0 2>/dev/null
[11567.000038] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.9.6.82@tcp  ns: filter-lustre-OST0005_UUID lock: ffff88004a400240/0x6aadc925cc163b74 lrc: 3/0,0 mode: PW/PW res: [0x54c0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 1048576->2097151) flags: 0x60000400010020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d2e2 expref: 6 pid: 9531 timeout: 11567 lvb_type: 0
[11567.050198] LustreError: 9187:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.9.6.82@tcp arrived at 1521263757 with bad export cookie 7687021303166253843
[11567.055145] Lustre: lustre-OST0005: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp)
[11567.441575] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed \(2\)

We’ve seen these tests fail before; see ATM-60/~~LU-8668~~. I’ve been told that pdsh still uses mrsh for non-selinux test sessions.

All of the following test sessions ran on trevis except for one on 10/13. Some of the following failures may be due to the patch being reviewed except for the failures on 2018-03-17 which is a no-op patch.

2017-07-27 06:48:11 UTC – https://testing.hpdd.intel.com/test_sets/ff1b47b8-72d4-11e7-a0a2-5254006e85c2

[3 more failures between these dates]

2017-09-27 04:24:43 UTC - https://testing.hpdd.intel.com/test_sets/e142bd58-a393-11e7-b786-5254006e85c2

2017-10-01 18:36:53 UTC - https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2

2017-10-13 02:50:27 UTC on onyx - https://testing.hpdd.intel.com/test_sets/e7457ab4-b019-11e7-8d8d-5254006e85c2

2017-10-18 19:29:07 UTC - https://testing.hpdd.intel.com/test_sets/dd0fe636-b48f-11e7-9eee-5254006e85c2

[16 more failures between these dates]

2018-02-12 11:34:28 UTC - https://testing.hpdd.intel.com/test_sets/7e8bb5f0-1005-11e8-a7cd-52540065bddc

2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc

2018-03-22 11:10:49 UTC - https://testing.hpdd.intel.com/test_sets/524dbdfe-2df0-11e8-b74b-52540065bddc

2018-03-27 22:47:40 UTC - https://testing.hpdd.intel.com/test_sets/d5f150e8-3242-11e8-b6a0-52540065bddc

2018-04-04 18:36:13 UTC - https://testing.hpdd.intel.com/test_sets/6b2266f8-385f-11e8-b45c-52540065bddc

2018-04-05 17:30:51 UTC - https://testing.hpdd.intel.com/test_sets/c91ef600-3921-11e8-960d-52540065bddc

Attachments

Issue Links

is related to

LU-10059 sanityn test_32a: wrong file size

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.

Activity

[LU-10891] sanityn test 77a, 77b, 77c, 77d, 77e, 77f, 77j and 77k all fail after 32a with 'dd at *MB on client failed (2)'

Patrick Farrell (Inactive) added a comment - 20/Jan/19 7:31 PM

See https://review.whamcloud.com/#/c/34070/

Patrick Farrell (Inactive) added a comment - 20/Jan/19 7:31 PM See https://review.whamcloud.com/#/c/34070/

Andreas Dilger added a comment - 20/Jan/19 7:11 PM

It wouldn't be terrible to change test 32n to use stack_trap so that it cleans up properly after a failure. While it doesn't fix the root problem, it does avoid lots of fallout that can confuse the issue, as can be seen here.

Andreas Dilger added a comment - 20/Jan/19 7:11 PM It wouldn't be terrible to change test 32n to use stack_trap so that it cleans up properly after a failure. While it doesn't fix the root problem, it does avoid lots of fallout that can confuse the issue, as can be seen here.

Patrick Farrell (Inactive) added a comment - 20/Jan/19 4:02 PM

Appears to be fallout from ~~LU-10059~~.

Patrick Farrell (Inactive) added a comment - 20/Jan/19 4:02 PM Appears to be fallout from LU-10059 .

Patrick Farrell (Inactive) added a comment - 20/Jan/19 3:28 PM

Putting this here as well as in ~~LU-10059~~: 32n isn't cleaning up on failure - it's leaving lockless truncate enabled, which could cause this sort of problem.

Patrick Farrell (Inactive) added a comment - 20/Jan/19 3:28 PM Putting this here as well as in LU-10059 : 32n isn't cleaning up on failure - it's leaving lockless truncate enabled, which could cause this sort of problem.

Andreas Dilger added a comment - 15/May/18 11:03 PM

I checked out a bunch of these failures, and all of them happened after sanityn test_32a failed with ~~LU-10059~~. It might be that this test is leaving behind cruft that is causing follow on failures?

I didn't check into the details, but it seems like this is a failure in NRS, so it might have been introduced with a change to the NRS code (if that happened around the time it started failing).

Andreas Dilger added a comment - 15/May/18 11:03 PM I checked out a bunch of these failures, and all of them happened after sanityn test_32a failed with LU-10059 . It might be that this test is leaving behind cruft that is causing follow on failures? I didn't check into the details, but it seems like this is a failure in NRS, so it might have been introduced with a change to the NRS code (if that happened around the time it started failing).

Andreas Dilger added a comment - 15/May/18 10:55 PM

+1 on master with patch https://review.whamcloud.com/16682 "~~LU-7236~~ ptlrpc: idle connections can disconnect"
https://testing.hpdd.intel.com/test_sets/919420c8-5890-11e8-93e6-52540065bddc

Andreas Dilger added a comment - 15/May/18 10:55 PM +1 on master with patch https://review.whamcloud.com/16682 " LU-7236 ptlrpc: idle connections can disconnect" https://testing.hpdd.intel.com/test_sets/919420c8-5890-11e8-93e6-52540065bddc

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Apr/18 4:08 PM

Updated:: 02/Aug/19 7:11 PM

Resolved:: 20/Jan/19 4:02 PM