Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10891

sanityn test 77a, 77b, 77c, 77d, 77e, 77f, 77j and 77k all fail after 32a with 'dd at *MB on client failed (2)'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      Looking at the client test_log for the test session

      2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc

      we see several calls to dd fail

       

      trevis-52vm2: 1048576 bytes (1.0 MB) copied, 0.00186497 s, 562 MB/s
      trevis-52vm1: dd: error writing ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error
      trevis-52vm1: 1+0 records in
      trevis-52vm1: 0+0 records out
      trevis-52vm1: 0 bytes (0 B) copied, 99.1917 s, 0.0 kB/s
      trevis-52vm1: 1+0 records in
      trevis-52vm1: 1+0 records out
      trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.0330184 s, 31.8 MB/s 
      sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2) 
      trevis-52vm1: dd: cannot fstat ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown 
      sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2)
      
      …
      
      trevis-52vm1: 0 bytes (0 B) copied, 0.0273896 s, 0.0 kB/s 
      sanityn test_77a: @@@@@@ FAIL: dd at 6MB on client failed (2) 
      trevis-52vm1: dd: failed to truncate to 13631488 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Input/output error
      trevis-52vm1: dd: failed to truncate to 4194304 bytes in output file ‘/mnt/lustre/d77a.sanityn/nrs_w_trevis-52vm1.trevis.hpdd.intel.com’: Cannot send after transport endpoint shutdown 
      sanityn test_77a: @@@@@@ FAIL: dd at 13MB on client failed (2)  
      sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed (2) 
      trevis-52vm1: 1+0 records in
      trevis-52vm1: 1+0 records out
      trevis-52vm1: 1048576 bytes (1.0 MB) copied, 0.00703154 s, 149 MB/s  
      Trace dump:  = /usr/lib64/lustre/tests/test-framework.sh:5764:error()  
      = /usr/lib64/lustre/tests/sanityn.sh:3084:nrs_write_read()  
      = /usr/lib64/lustre/tests/sanityn.sh:3113:test_77a() 
       = /usr/lib64/lustre/tests/test-framework.sh:6040:run_one()  
      = /usr/lib64/lustre/tests/test-framework.sh:6079:run_one_logged()  
      = /usr/lib64/lustre/tests/test-framework.sh:5926:run_test()  
      = /usr/lib64/lustre/tests/sanityn.sh:3117:main()
      

       

      In the dmesg log on the OSS, we see a stack trace and complaint that there are too many service threads and not enough resources

      [11362.460839] Lustre: DEBUG MARKER: == sanityn test 77a: check FIFO NRS policy =========================================================== 05:12:32 (1521263552)
      [11362.701936] Lustre: DEBUG MARKER: lctl set_param ost.OSS.*.nrs_policies=fifo
      [11407.968024] LNet: Service thread pid 8563 was inactive for 40.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [11407.973431] Pid: 8563, comm: ll_ost_io00_012
      [11407.975284] Call Trace:
      [11407.978482]  [<ffffffff816b40e9>] schedule+0x29/0x70
      [11407.980373]  [<ffffffff816b1984>] schedule_timeout+0x174/0x2c0
      [11407.982265]  [<ffffffff8109c080>] ? process_timeout+0x0/0x10
      [11407.984109]  [<ffffffffc0ab65f0>] ? ldlm_expired_completion_wait+0x0/0x220 [ptlrpc]
      [11407.986141]  [<ffffffffc0ab6dc1>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [11407.988081]  [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
      [11407.989996]  [<ffffffffc0ab7c13>] ldlm_cli_enqueue_local+0x233/0x860 [ptlrpc]
      [11407.991934]  [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      [11407.993914]  [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [11407.995835]  [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc]
      [11407.997834]  [<ffffffffc0b4b007>] tgt_extent_lock+0xe7/0x290 [ptlrpc]
      [11407.999761]  [<ffffffffc0abc070>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [11408.001742]  [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      [11408.003644]  [<ffffffffc0ab9360>] ? ldlm_glimpse_ast+0x0/0x10 [ptlrpc]
      [11408.005520]  [<ffffffffc0e71c7f>] ofd_punch_hdl+0x30f/0x9c0 [ofd]
      [11408.007296]  [<ffffffffc0b51eea>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [11408.009176]  [<ffffffffc0af7823>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [11408.011048]  [<ffffffffc0afafd2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [11408.012880]  [<ffffffffc0afa540>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [11408.014611]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [11408.016294]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [11408.017934]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [11408.019664]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0 
      [11408.022731] LustreError: dumping log to /tmp/lustre-log.1521263598.8563
      [11467.000052] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 10.9.6.82@tcp  ns: filter-lustre-OST0004_UUID lock: ffff880060b46240/0x6aadc925cc163a6a lrc: 3/0,0 mode: PW/PW res: [0x54e0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x60000400000020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d1f4 expref: 6 pid: 13081 timeout: 11467 lvb_type: 0[11467.059728] Lustre: lustre-OST0004: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp)
      [11467.145213] LNet: Service thread pid 8563 completed after 99.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [11467.488810] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed \(2\) 
      [11467.509827] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed \(2\) 
      [11467.702702] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 5MB on client failed (2)
      [11467.723042] Lustre: DEBUG MARKER: sanityn test_77a: @@@@@@ FAIL: dd at 7MB on client failed (2)
      [11467.967245] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log;         dmesg > /home/auto
      [11468.064728] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest2/autotest/logs/test_logs/2018-03-17/lustre-reviews-el7-x86_64--review-dne-part-1--1_11_1__55338___3714d0fe-68b1-4332-9b84-47eab7aaf171/sanityn.test_77a.debug_log.$(hostname -s).1521263658.log;         dmesg > /home/auto
      [11472.082190] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0        fail_val=0 2>/dev/null
      [11472.179737] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0        fail_val=0 2>/dev/null
      [11567.000038] LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 10.9.6.82@tcp  ns: filter-lustre-OST0005_UUID lock: ffff88004a400240/0x6aadc925cc163b74 lrc: 3/0,0 mode: PW/PW res: [0x54c0:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 1048576->2097151) flags: 0x60000400010020 nid: 10.9.6.82@tcp remote: 0x9b5c5b055d72d2e2 expref: 6 pid: 9531 timeout: 11567 lvb_type: 0
      [11567.050198] LustreError: 9187:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from 10.9.6.82@tcp arrived at 1521263757 with bad export cookie 7687021303166253843
      [11567.055145] Lustre: lustre-OST0005: Connection restored to 822495a8-31b5-9951-2f9c-71e8bb3032cb (at 10.9.6.82@tcp)
      [11567.441575] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_77a: @@@@@@ FAIL: dd at 4MB on client failed \(2\)
      

       

      We’ve seen these tests fail before; see ATM-60/LU-8668. I’ve been told that pdsh still uses mrsh for non-selinux test sessions.

      All of the following test sessions ran on trevis except for one on 10/13. Some of the following failures may be due to the patch being reviewed except for the failures on 2018-03-17 which is a no-op patch.

       

      2017-07-27 06:48:11 UTC – https://testing.hpdd.intel.com/test_sets/ff1b47b8-72d4-11e7-a0a2-5254006e85c2

      [3 more failures between these dates]

      2017-09-27 04:24:43 UTC - https://testing.hpdd.intel.com/test_sets/e142bd58-a393-11e7-b786-5254006e85c2

      2017-10-01 18:36:53 UTC - https://testing.hpdd.intel.com/test_sets/12f0d24e-a732-11e7-b786-5254006e85c2

      2017-10-13 02:50:27 UTC on onyx - https://testing.hpdd.intel.com/test_sets/e7457ab4-b019-11e7-8d8d-5254006e85c2

      2017-10-18 19:29:07 UTC - https://testing.hpdd.intel.com/test_sets/dd0fe636-b48f-11e7-9eee-5254006e85c2

      [16 more failures between these dates]

      2018-02-12 11:34:28 UTC - https://testing.hpdd.intel.com/test_sets/7e8bb5f0-1005-11e8-a7cd-52540065bddc

      2018-03-17 04:24:28 UTC - https://testing.hpdd.intel.com/test_sets/24378326-29c9-11e8-b6a0-52540065bddc

      2018-03-22 11:10:49 UTC - https://testing.hpdd.intel.com/test_sets/524dbdfe-2df0-11e8-b74b-52540065bddc

      2018-03-27 22:47:40 UTC - https://testing.hpdd.intel.com/test_sets/d5f150e8-3242-11e8-b6a0-52540065bddc

      2018-04-04 18:36:13 UTC - https://testing.hpdd.intel.com/test_sets/6b2266f8-385f-11e8-b45c-52540065bddc

      2018-04-05 17:30:51 UTC - https://testing.hpdd.intel.com/test_sets/c91ef600-3921-11e8-960d-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              pfarrell Patrick Farrell (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: