Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12860

recovery-small test 55 fails with 'dd should be finished!'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3
    • failover testing
    • 3
    • 9223372036854775807

    Description

      recovery-small test_55 fails with 'dd should be finished!' and has failed only twice this year with this error. See https://testing.whamcloud.com/test_sets/63e23090-eb2c-11e9-a197-52540065bddc on 09 OCT 2019 and https://testing.whamcloud.com/test_sets/5ad3e090-d5cc-11e9-a25b-52540065bddc 12 SEPT 2019 both for b2_12.

      Looking at the suite_log for the second failure listed above, we see

      == recovery-small test 55: ost_brw_read/write drops timed-out read/write request ===================== 12:07:50 (1570648070)
      step1: testing ......
       recovery-small test_55: @@@@@@ FAIL: dd should be finished! 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5864:error()
        = /usr/lib64/lustre/tests/recovery-small.sh:1336:test_55()
      

      Looking at the OSS (vm5) console log, we see

      [ 3438.494717] Lustre: DEBUG MARKER: == recovery-small test 55: ost_brw_read/write drops timed-out read/write request ===================== 12:07:50 (1570648070)
      [ 3501.778339] LNet: Service thread pid 6405 was inactive for 40.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 3501.781156] Pid: 6405, comm: ll_ost00_053 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 3501.782806] Call Trace:
      [ 3501.783449]  [<ffffffffc02e1cf5>] do_get_write_access+0x285/0x4d0 [jbd2]
      [ 3501.784645]  [<ffffffffc02e1f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      [ 3501.785941]  [<ffffffffc0ebfc5b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
      [ 3501.787604]  [<ffffffffc0f75099>] osd_ldiskfs_write_record+0xd9/0x410 [osd_ldiskfs]
      [ 3501.789090]  [<ffffffffc0f75510>] osd_write+0x140/0x5c0 [osd_ldiskfs]
      [ 3501.790333]  [<ffffffffc09784b6>] dt_record_write+0x36/0x120 [obdclass]
      [ 3501.791667]  [<ffffffffc0ca8614>] tgt_server_data_write+0x224/0x370 [ptlrpc]
      [ 3501.793309]  [<ffffffffc0ca9d8e>] tgt_server_data_update+0x41e/0x510 [ptlrpc]
      [ 3501.794591]  [<ffffffffc0cabd0d>] tgt_client_del+0x29d/0x6b0 [ptlrpc]
      [ 3501.795814]  [<ffffffffc108e13c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
      [ 3501.797038]  [<ffffffffc0c12077>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 3501.798431]  [<ffffffffc0cb2708>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 3501.799587]  [<ffffffffc0cb736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 3501.800860]  [<ffffffffc0c5e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 3501.802264]  [<ffffffffc0c61bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 3501.803477]  [<ffffffffb7ec2e81>] kthread+0xd1/0xe0
      [ 3501.804387]  [<ffffffffb8577c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 3501.805550]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 3501.806504] LustreError: dumping log to /tmp/lustre-log.1570648140.6405
      [ 3502.924108] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_55: @@@@@@ FAIL: dd should be finished! 
      [ 3503.127572] Lustre: DEBUG MARKER: recovery-small test_55: @@@@@@ FAIL: dd should be finished!
      [ 3503.376291] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-sles12sp4-x86_64--failover--1_17__52___0455cc7e-bf9e-4287-89df-0e7b8e9d7341/recovery-small.test_55.debug_log.$(hostname -s).1570648135.log;
      [ 3503.376291]          dmesg > /auto
      [ 3503.570329] LNet: Service thread pid 6183 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 3503.573227] Pid: 6183, comm: ll_ost00_023 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 3503.574895] Call Trace:
      [ 3503.575364]  [<ffffffffc02e1cf5>] do_get_write_access+0x285/0x4d0 [jbd2]
      [ 3503.576655]  [<ffffffffc02e1f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      [ 3503.578032]  [<ffffffffc0ebfc5b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
      [ 3503.579502]  [<ffffffffc0f75099>] osd_ldiskfs_write_record+0xd9/0x410 [osd_ldiskfs]
      [ 3503.580837]  [<ffffffffc0f75510>] osd_write+0x140/0x5c0 [osd_ldiskfs]
      [ 3503.581984]  [<ffffffffc09784b6>] dt_record_write+0x36/0x120 [obdclass]
      [ 3503.583173]  [<ffffffffc0ca8614>] tgt_server_data_write+0x224/0x370 [ptlrpc]
      [ 3503.584458]  [<ffffffffc0ca9d8e>] tgt_server_data_update+0x41e/0x510 [ptlrpc]
      [ 3503.585736]  [<ffffffffc0cabd0d>] tgt_client_del+0x29d/0x6b0 [ptlrpc]
      [ 3503.586885]  [<ffffffffc108e13c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
      [ 3503.588272]  [<ffffffffc0c12077>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 3503.590003]  [<ffffffffc0cb2708>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 3503.591263]  [<ffffffffc0cb736a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 3503.592500]  [<ffffffffc0c5e24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 3503.594004]  [<ffffffffc0c61bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 3503.595138]  [<ffffffffb7ec2e81>] kthread+0xd1/0xe0
      [ 3503.596115]  [<ffffffffb8577c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 3503.597267]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 3503.598221] LustreError: dumping log to /tmp/lustre-log.1570648142.6183
      [ 3505.208266] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [ 3514.343779] Lustre: 6183:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (45:5s); client may timeout.  req@ffff9e61e0f0e050 x1646941249245056/t0(0) o9->93068985-9bd0-151c-638d-51557c0cba70@10.9.5.186@tcp:507/0 lens 224/192 e 0 to 0 dl 1570648147 ref 1 fl Complete:/0/0 rc 0/0
      [ 3514.344067] LNet: Service thread pid 6405 completed after 52.61s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 3514.351372] Lustre: 6183:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
      

      On the other OSS (vm6) console log, we see

      [ 7059.072359] Lustre: DEBUG MARKER: == recovery-small test 55: ost_brw_read/write drops timed-out read/write request ===================== 12:07:50 (1570648070)
      [ 7122.391257] LNet: Service thread pid 20783 was inactive for 40.07s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 7122.394181] Pid: 20783, comm: ll_ost00_001 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 7122.395901] Call Trace:
      [ 7122.396464]  [<ffffffffc02d1cf5>] do_get_write_access+0x285/0x4d0 [jbd2]
      [ 7122.397807]  [<ffffffffc02d1f67>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      [ 7122.399208]  [<ffffffffc0eabc5b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
      [ 7122.400818]  [<ffffffffc0f61099>] osd_ldiskfs_write_record+0xd9/0x410 [osd_ldiskfs]
      [ 7122.402240]  [<ffffffffc0f61510>] osd_write+0x140/0x5c0 [osd_ldiskfs]
      [ 7122.403504]  [<ffffffffc09644b6>] dt_record_write+0x36/0x120 [obdclass]
      [ 7122.404960]  [<ffffffffc0c93a37>] tgt_client_data_write.isra.18+0x167/0x180 [ptlrpc]
      [ 7122.406800]  [<ffffffffc0c97229>] tgt_client_data_update+0x3a9/0x5e0 [ptlrpc]
      [ 7122.408105]  [<ffffffffc0c97eb9>] tgt_client_del+0x449/0x6b0 [ptlrpc]
      [ 7122.409465]  [<ffffffffc107a13c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
      [ 7122.410677]  [<ffffffffc0bfe077>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 7122.411993]  [<ffffffffc0c9e708>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 7122.413292]  [<ffffffffc0ca336a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 7122.414557]  [<ffffffffc0c4a24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 7122.416009]  [<ffffffffc0c4dbac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 7122.417248]  [<ffffffffa28c2e81>] kthread+0xd1/0xe0
      [ 7122.418162]  [<ffffffffa2f77c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 7122.419471]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 7122.420442] LustreError: dumping log to /tmp/lustre-log.1570648139.20783
      [ 7123.504819] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_55: @@@@@@ FAIL: dd should be finished! 
      [ 7123.703794] Lustre: DEBUG MARKER: recovery-small test_55: @@@@@@ FAIL: dd should be finished!
      [ 7123.955355] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64-vs-lustre-b2_12-sles12sp4-x86_64--failover--1_17__52___0455cc7e-bf9e-4287-89df-0e7b8e9d7341/recovery-small.test_55.debug_log.$(hostname -s).1570648135.log;
      [ 7123.955355]          dmesg > /auto
      [ 7125.207282] LNet: Service thread pid 24813 was inactive for 62.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 7125.210221] Pid: 24813, comm: ll_ost_io00_008 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 7125.211990] Call Trace:
      [ 7125.212538]  [<ffffffffc0f47e75>] osd_trans_stop+0x265/0x8e0 [osd_ldiskfs]
      [ 7125.213833]  [<ffffffffc1081c75>] ofd_trans_stop+0x25/0x60 [ofd]
      [ 7125.215002]  [<ffffffffc1088c34>] ofd_commitrw_write+0x9d4/0x1d40 [ofd]
      [ 7125.216398]  [<ffffffffc108d10c>] ofd_commitrw+0x48c/0x9e0 [ofd]
      [ 7125.217505]  [<ffffffffc0ca760b>] tgt_brw_write+0x10cb/0x1cf0 [ptlrpc]
      [ 7125.218795]  [<ffffffffc0ca336a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 7125.220109]  [<ffffffffc0c4a24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 7125.221611]  [<ffffffffc0c4dbac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 7125.222775]  [<ffffffffa28c2e81>] kthread+0xd1/0xe0
      [ 7125.223719]  [<ffffffffa2f77c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 7125.224888]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 7125.225920] LustreError: dumping log to /tmp/lustre-log.1570648142.24813
      [ 7125.314852] Pid: 22522, comm: ll_ost_io00_003 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 7125.317005] Call Trace:
      [ 7125.317553]  [<ffffffffc0f47e75>] osd_trans_stop+0x265/0x8e0 [osd_ldiskfs]
      [ 7125.319059]  [<ffffffffc1081c75>] ofd_trans_stop+0x25/0x60 [ofd]
      [ 7125.320408]  [<ffffffffc1088c34>] ofd_commitrw_write+0x9d4/0x1d40 [ofd]
      [ 7125.321779]  [<ffffffffc108d10c>] ofd_commitrw+0x48c/0x9e0 [ofd]
      [ 7125.323114]  [<ffffffffc0ca760b>] tgt_brw_write+0x10cb/0x1cf0 [ptlrpc]
      [ 7125.324597]  [<ffffffffc0ca336a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 7125.326068]  [<ffffffffc0c4a24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 7125.327744]  [<ffffffffc0c4dbac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 7125.329103]  [<ffffffffa28c2e81>] kthread+0xd1/0xe0
      [ 7125.330273]  [<ffffffffa2f77c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 7125.331641]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 7125.719257] Pid: 24812, comm: ll_ost_io00_007 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 7125.721175] Call Trace:
      [ 7125.721644]  [<ffffffffc0f47e75>] osd_trans_stop+0x265/0x8e0 [osd_ldiskfs]
      [ 7125.722893]  [<ffffffffc1081c75>] ofd_trans_stop+0x25/0x60 [ofd]
      [ 7125.723983]  [<ffffffffc1088c34>] ofd_commitrw_write+0x9d4/0x1d40 [ofd]
      [ 7125.725196]  [<ffffffffc108d10c>] ofd_commitrw+0x48c/0x9e0 [ofd]
      [ 7125.726663]  [<ffffffffc0ca760b>] tgt_brw_write+0x10cb/0x1cf0 [ptlrpc]
      [ 7125.727903]  [<ffffffffc0ca336a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 7125.729173]  [<ffffffffc0c4a24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 7125.730564]  [<ffffffffc0c4dbac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 7125.731757]  [<ffffffffa28c2e81>] kthread+0xd1/0xe0
      [ 7125.732671]  [<ffffffffa2f77c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 7125.733794]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 7125.734725] LustreError: dumping log to /tmp/lustre-log.1570648143.24812
      [ 7125.822770] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [ 7126.231240] LNet: Service thread pid 24810 was inactive for 62.20s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 7126.234161] LNet: Skipped 2 previous similar messages
      [ 7126.235065] Pid: 24810, comm: ll_ost_io00_005 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Tue Oct 8 13:52:18 UTC 2019
      [ 7126.236828] Call Trace:
      [ 7126.237412]  [<ffffffffc0f47e75>] osd_trans_stop+0x265/0x8e0 [osd_ldiskfs]
      [ 7126.238787]  [<ffffffffc1081c75>] ofd_trans_stop+0x25/0x60 [ofd]
      [ 7126.239919]  [<ffffffffc1088c34>] ofd_commitrw_write+0x9d4/0x1d40 [ofd]
      [ 7126.241178]  [<ffffffffc108d10c>] ofd_commitrw+0x48c/0x9e0 [ofd]
      [ 7126.242314]  [<ffffffffc0ca760b>] tgt_brw_write+0x10cb/0x1cf0 [ptlrpc]
      [ 7126.243546]  [<ffffffffc0ca336a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 7126.244851]  [<ffffffffc0c4a24b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 7126.246289]  [<ffffffffc0c4dbac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [ 7126.247487]  [<ffffffffa28c2e81>] kthread+0xd1/0xe0
      [ 7126.248498]  [<ffffffffa2f77c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 7126.249669]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 7126.250647] LNet: Service thread pid 24811 was inactive for 62.22s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 7146.199252] LNet: Service thread pid 22011 was inactive for 62.05s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 7146.201583] LNet: Skipped 3 previous similar messages
      [ 7146.202481] LustreError: dumping log to /tmp/lustre-log.1570648163.22011
      [ 7149.016236] LNet: Service thread pid 24817 was inactive for 62.03s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 7149.018478] LustreError: dumping log to /tmp/lustre-log.1570648166.24817
      [ 7157.692329] LNetError: 2982:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.191@tcp added to recovery queue. Health = 0
      [ 7157.694496] LNetError: 2982:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages
      [ 7158.698091] Lustre: 22011:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (6:69s); client may timeout.  req@ffff968afb207050 x1646941249245040/t0(0) o9->93068985-9bd0-151c-638d-51557c0cba70@10.9.5.186@tcp:467/0 lens 224/192 e 0 to 0 dl 1570648107 ref 1 fl Complete:/0/0 rc 0/0
      [ 7158.698315] LNet: Service thread pid 20783 completed after 76.38s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 7158.705670] Lustre: 22011:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1 previous similar message
      [ 7171.117593] LNet: Service thread pid 24813 completed after 108.07s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 7171.121595] LNet: Skipped 1 previous similar message
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: