Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.12.0
-
3
-
9223372036854775807
Description
Looking at the client test_log for the test session
2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc
we see several calls to dd fail
trevis-52vm2: 1048576 bytes (1.0 MB) copied, 0.00186497 s, 562 MB/s trevis-52vm1: dd: error writing ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error trevis-52vm1: 1+0 records in trevis-52vm1: 0+0 records out trevis-52vm1: 0 bytes (0 B) copied, 99.1917 s, 0.0 kB/s trevis-52vm1: 1+0 records in trevis-52vm1: 1+0 records out trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.0330184 s, 31.8 MB/s sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2) trevis-52vm1: dd: cannot fstat ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2) … trevis-52vm1: 0 bytes (0 B) copied, 0.0273896 s, 0.0 kB/s sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2) trevis-52vm1: dd: failed to truncate to 13631488 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error trevis-52vm1: dd: failed to truncate to 4194304 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown sanityn test_77a: @@@@@@ FAIL: dd at 13MB on client failed (2) sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2) trevis-52vm1: 1+0 records in trevis-52vm1: 1+0 records out trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.00703154 s, 149 MB/s Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5764:error() = /usr/lib64/lustre/tests/sanityn.sh:3084:nrs_write_read() = /usr/lib64/lustre/tests/sanityn.sh:3113:test_77a() = /usr/lib64/lustre/tests/test-framework.sh:6040:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6079:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5926:run_test() = /usr/lib64/lustre/tests/sanityn.sh:3117:main()
In the dmesg log on the OSS, we see a stack trace and complaint that there are too many service threads and not enough resources
[11362.460839] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 05:12:32 (1521263552) [11362.701936] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_policies=fifo [11407.968024] LNet: Service thread pid 8563 was inactive for 40.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [11407.973431] Pid: 8563, comm: ll_ost_io00_012 [11407.975284] Call Trace: [11407.978482] [<ffffffff816b40e9>] schedule+0x29/0x70 [11407.980373] [<ffffffff816b1984>] schedule_timeout+0x174/0x2c0 [11407.982265] [<ffffffff8109c080>] ? process_timeout+0x0/0x10 [11407.984109] [<ffffffffc0ab65f0>] ? ldlm_expired_completion_wait+0x0/0x220 [ptlrpc] [11407.986141] [<ffffffffc0ab6dc1>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [11407.988081] [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20 [11407.989996] [<ffffffffc0ab7c13>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc] [11407.991934] [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [11407.993914] [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [11407.995835] [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc] [11407.997834] [<ffffffffc0b4b007>] tgt_extent_lock+0xe7/0x290 [ptlrpc] [11407.999761] [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [11408.001742] [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [11408.003644] [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc] [11408.005520] [<ffffffffc0e71c7f>] ofd_punch_hdl+0x30f/0x9c0 [ofd] [11408.007296] [<ffffffffc0b51eea>] tgt_request_handle+0x92a/0x13b0 [ptlrpc] [11408.009176] [<ffffffffc0af7823>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc] [11408.011048] [<ffffffffc0afafd2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [11408.012880] [<ffffffffc0afa540>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [11408.014611] [<ffffffff810b4031>] kthread+0xd1/0xe0 [11408.016294] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 [11408.017934] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [11408.019664] [<ffffffff810b3f60>] ? kthread+0x0/0xe0 [11408.022731] LustreError: dumping log to /tmp/lustre-log.1521263598.8563 [11467.000052] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 10.9.6.82@tcp ns: filter-lustre-OST0004_UUID lock: ffff880060b46240/0x6aadc925cc163a6a lrc: 3/0,0 mode: PW/PW res: [0x54e0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000400000020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d1f4 expref: 6 pid: 13081 timeout: 11467 lvb_type: 0[11467.059728] Lustre: lustre-OST0004: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp) [11467.145213] LNet: Service thread pid 8563 completed after 99.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [11467.488810] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed \(2\) [11467.509827] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed \(2\) [11467.702702] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2) [11467.723042] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2) [11467.967245] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log; dmesg > /home/auto [11468.064728] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log; dmesg > /home/auto [11472.082190] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [11472.179737] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [11567.000038] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.9.6.82@tcp ns: filter-lustre-OST0005_UUID lock: ffff88004a400240/0x6aadc925cc163b74 lrc: 3/0,0 mode: PW/PW res: [0x54c0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 1048576->2097151) flags: 0x60000400010020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d2e2 expref: 6 pid: 9531 timeout: 11567 lvb_type: 0 [11567.050198] LustreError: 9187:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.9.6.82@tcp arrived at 1521263757 with bad export cookie 7687021303166253843 [11567.055145] Lustre: lustre-OST0005: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp) [11567.441575] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed \(2\)
We’ve seen these tests fail before; see ATM-60/LU-8668. I’ve been told that pdsh still uses mrsh for non-selinux test sessions.
All of the following test sessions ran on trevis except for one on 10/13. Some of the following failures may be due to the patch being reviewed except for the failures on 2018-03-17 which is a no-op patch.
2017-07-27 06:48:11 UTC – https://testing.hpdd.intel.com/test_sets/ff1b47b8-72d4-11e7-a0a2-5254006e85c2
[3 more failures between these dates]
2017-09-27 04:24:43 UTC - https://testing.hpdd.intel.com/test_sets/e142bd58-a393-11e7-b786-5254006e85c2
2017-10-01 18:36:53 UTC - https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2
2017-10-13 02:50:27 UTC on onyx - https://testing.hpdd.intel.com/test_sets/e7457ab4-b019-11e7-8d8d-5254006e85c2
2017-10-18 19:29:07 UTC - https://testing.hpdd.intel.com/test_sets/dd0fe636-b48f-11e7-9eee-5254006e85c2
[16 more failures between these dates]
2018-02-12 11:34:28 UTC - https://testing.hpdd.intel.com/test_sets/7e8bb5f0-1005-11e8-a7cd-52540065bddc
2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc
2018-03-22 11:10:49 UTC - https://testing.hpdd.intel.com/test_sets/524dbdfe-2df0-11e8-b74b-52540065bddc
2018-03-27 22:47:40 UTC - https://testing.hpdd.intel.com/test_sets/d5f150e8-3242-11e8-b6a0-52540065bddc
2018-04-04 18:36:13 UTC - https://testing.hpdd.intel.com/test_sets/6b2266f8-385f-11e8-b45c-52540065bddc
2018-04-05 17:30:51 UTC - https://testing.hpdd.intel.com/test_sets/c91ef600-3921-11e8-960d-52540065bddc