Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.4.1
-
None
-
EL6.4
-
3
-
11764
Description
We have hit this at least once a week for the last month. A seemingly random OST gets stuck refusing reconnection with a busy RPC after an ll_ost stack trace e.g.
Nov 15 07:23:21 boss4 kernel: LustreError: 28716:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 6 previous similar messages Nov 15 07:44:24 boss4 kernel: LNet: Service thread pid 5365 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 15 07:44:24 boss4 kernel: Pid: 5365, comm: ll_ost02_083 Nov 15 07:44:24 boss4 kernel: Nov 15 07:44:24 boss4 kernel: Call Trace: Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a6523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a67e6>] lu_object_find+0x16/0x20 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e015c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a73ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e11549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0df9fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dc7c1c>] ost_setattr+0x31c/0x990 [ost] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dcb746>] ost_handle+0x21e6/0x48e0 [ost] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa07423c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa043cd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 15 07:44:24 boss4 kernel: [<ffffffffa074375e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 15 07:44:24 boss4 kernel: Nov 15 07:44:24 boss4 kernel: LustreError: dumping log to /tmp/lustre-log.1384501464.5365 Nov 15 07:57:11 boss4 kernel: Lustre: 5360:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-367), not sending early reply Nov 15 07:57:11 boss4 kernel: req@ffff88324df9f400 x1450789664028660/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1384502236 ref 2 fl Interpret:/0/0 rc 0/0 Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
The filesystem grinds to a halt and after rebooting the OSS, recovery brings the OST back to the same place - 1 active RPC. We have to mount and abort_recov to bring things back to life. Another example (different OSS/OST):
Nov 9 01:16:37 boss1 kernel: LNet: Service thread pid 16551 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 9 01:16:37 boss1 kernel: Pid: 16551, comm: ll_ost02_038 Nov 9 01:16:37 boss1 kernel: Nov 9 01:16:37 boss1 kernel: Call Trace: Nov 9 01:16:37 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 9 01:16:37 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 9 01:16:37 boss1 kernel: Nov 9 01:16:37 boss1 kernel: LustreError: dumping log to /tmp/lustre-log.1383959797.16551 Nov 9 01:26:19 boss1 kernel: INFO: task ll_ost02_038:16551 blocked for more than 120 seconds. Nov 9 01:26:19 boss1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 9 01:26:19 boss1 kernel: ll_ost02_038 D 0000000000000007 0 16551 2 0x00000080 Nov 9 01:26:19 boss1 kernel: ffff883f44a259b0 0000000000000046 0000000000000000 ffff883f44a25974 Nov 9 01:26:19 boss1 kernel: ffff883f44a25960 ffffc901077e7030 0000000000000246 0000000000000246 Nov 9 01:26:19 boss1 kernel: ffff883f44491ab8 ffff883f44a25fd8 000000000000fb88 ffff883f44491ab8 Nov 9 01:26:19 boss1 kernel: Call Trace: Nov 9 01:26:19 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 9 01:26:19 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 9 01:26:44 boss1 kernel: Lustre: 16528:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Nov 9 01:26:44 boss1 kernel: req@ffff883e8281f400 x1450784570036048/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1383960409 ref 2 fl Interpret:/0/0 rc 0/0 Nov 9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:28:49 boss1 kernel: Lustre: Skipped 568 previous similar messages Nov 9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:32:34 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
If I can provide you with any more information next time it happens then let me know.