Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4286

OST - still busy with 1 active RPCs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.4.1
    • None
    • EL6.4
    • 3
    • 11764

    Description

      We have hit this at least once a week for the last month. A seemingly random OST gets stuck refusing reconnection with a busy RPC after an ll_ost stack trace e.g.

      Nov 15 07:23:21 boss4 kernel: LustreError: 28716:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
      Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
      Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message
      Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
      Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 6 previous similar messages
      Nov 15 07:44:24 boss4 kernel: LNet: Service thread pid 5365 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov 15 07:44:24 boss4 kernel: Pid: 5365, comm: ll_ost02_083
      Nov 15 07:44:24 boss4 kernel:
      Nov 15 07:44:24 boss4 kernel: Call Trace:
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a6523>] lu_object_find_at+0xb3/0x360 [obdclass]
      Nov 15 07:44:24 boss4 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a67e6>] lu_object_find+0x16/0x20 [obdclass]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e015c5>] ofd_object_find+0x35/0xf0 [ofd]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a73ee>] ? lu_env_init+0x1e/0x30 [obdclass]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e11549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0df9fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dc7c1c>] ost_setattr+0x31c/0x990 [ost]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dcb746>] ost_handle+0x21e6/0x48e0 [ost]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa07423c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa043cd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa074375e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Nov 15 07:44:24 boss4 kernel:
      Nov 15 07:44:24 boss4 kernel: LustreError: dumping log to /tmp/lustre-log.1384501464.5365
      Nov 15 07:57:11 boss4 kernel: Lustre: 5360:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-367), not sending early reply
      Nov 15 07:57:11 boss4 kernel:  req@ffff88324df9f400 x1450789664028660/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1384502236 ref 2 fl Interpret:/0/0 rc 0/0
      Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      

      The filesystem grinds to a halt and after rebooting the OSS, recovery brings the OST back to the same place - 1 active RPC. We have to mount and abort_recov to bring things back to life. Another example (different OSS/OST):

      Nov  9 01:16:37 boss1 kernel: LNet: Service thread pid 16551 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  9 01:16:37 boss1 kernel: Pid: 16551, comm: ll_ost02_038
      Nov  9 01:16:37 boss1 kernel:
      Nov  9 01:16:37 boss1 kernel: Call Trace:
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass]
      Nov  9 01:16:37 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:16:37 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Nov  9 01:16:37 boss1 kernel:
      Nov  9 01:16:37 boss1 kernel: LustreError: dumping log to /tmp/lustre-log.1383959797.16551
      Nov  9 01:26:19 boss1 kernel: INFO: task ll_ost02_038:16551 blocked for more than 120 seconds.
      Nov  9 01:26:19 boss1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Nov  9 01:26:19 boss1 kernel: ll_ost02_038  D 0000000000000007     0 16551      2 0x00000080
      Nov  9 01:26:19 boss1 kernel: ffff883f44a259b0 0000000000000046 0000000000000000 ffff883f44a25974
      Nov  9 01:26:19 boss1 kernel: ffff883f44a25960 ffffc901077e7030 0000000000000246 0000000000000246
      Nov  9 01:26:19 boss1 kernel: ffff883f44491ab8 ffff883f44a25fd8 000000000000fb88 ffff883f44491ab8
      Nov  9 01:26:19 boss1 kernel: Call Trace:
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass]
      Nov  9 01:26:19 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov  9 01:26:19 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Nov  9 01:26:44 boss1 kernel: Lustre: 16528:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
      Nov  9 01:26:44 boss1 kernel:  req@ffff883e8281f400 x1450784570036048/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1383960409 ref 2 fl Interpret:/0/0 rc 0/0
      Nov  9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:28:49 boss1 kernel: Lustre: Skipped 568 previous similar messages
      Nov  9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      Nov  9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
      Nov  9 01:32:34 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
      

      If I can provide you with any more information next time it happens then let me know.

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            daire Daire Byrne (Inactive)
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: