[LU-4286] OST - still busy with 1 active RPCs Created: 21/Nov/13 Updated: 02/Apr/14 Resolved: 02/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Daire Byrne (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Environment: |
EL6.4 |
||
| Severity: | 3 |
| Rank (Obsolete): | 11764 |
| Description |
|
We have hit this at least once a week for the last month. A seemingly random OST gets stuck refusing reconnection with a busy RPC after an ll_ost stack trace e.g. Nov 15 07:23:21 boss4 kernel: LustreError: 28716:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 6 previous similar messages Nov 15 07:44:24 boss4 kernel: LNet: Service thread pid 5365 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 15 07:44:24 boss4 kernel: Pid: 5365, comm: ll_ost02_083 Nov 15 07:44:24 boss4 kernel: Nov 15 07:44:24 boss4 kernel: Call Trace: Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a6523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a67e6>] lu_object_find+0x16/0x20 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e015c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a73ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e11549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0df9fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dc7c1c>] ost_setattr+0x31c/0x990 [ost] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dcb746>] ost_handle+0x21e6/0x48e0 [ost] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa07423c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa043cd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 15 07:44:24 boss4 kernel: [<ffffffffa074375e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 15 07:44:24 boss4 kernel: Nov 15 07:44:24 boss4 kernel: LustreError: dumping log to /tmp/lustre-log.1384501464.5365 Nov 15 07:57:11 boss4 kernel: Lustre: 5360:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-367), not sending early reply Nov 15 07:57:11 boss4 kernel: req@ffff88324df9f400 x1450789664028660/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1384502236 ref 2 fl Interpret:/0/0 rc 0/0 Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs The filesystem grinds to a halt and after rebooting the OSS, recovery brings the OST back to the same place - 1 active RPC. We have to mount and abort_recov to bring things back to life. Another example (different OSS/OST): Nov 9 01:16:37 boss1 kernel: LNet: Service thread pid 16551 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 9 01:16:37 boss1 kernel: Pid: 16551, comm: ll_ost02_038 Nov 9 01:16:37 boss1 kernel: Nov 9 01:16:37 boss1 kernel: Call Trace: Nov 9 01:16:37 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 9 01:16:37 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost] Nov 9 01:16:37 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:16:37 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 9 01:16:37 boss1 kernel: Nov 9 01:16:37 boss1 kernel: LustreError: dumping log to /tmp/lustre-log.1383959797.16551 Nov 9 01:26:19 boss1 kernel: INFO: task ll_ost02_038:16551 blocked for more than 120 seconds. Nov 9 01:26:19 boss1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 9 01:26:19 boss1 kernel: ll_ost02_038 D 0000000000000007 0 16551 2 0x00000080 Nov 9 01:26:19 boss1 kernel: ffff883f44a259b0 0000000000000046 0000000000000000 ffff883f44a25974 Nov 9 01:26:19 boss1 kernel: ffff883f44a25960 ffffc901077e7030 0000000000000246 0000000000000246 Nov 9 01:26:19 boss1 kernel: ffff883f44491ab8 ffff883f44a25fd8 000000000000fb88 ffff883f44491ab8 Nov 9 01:26:19 boss1 kernel: Call Trace: Nov 9 01:26:19 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Nov 9 01:26:19 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost] Nov 9 01:26:19 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Nov 9 01:26:19 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Nov 9 01:26:44 boss1 kernel: Lustre: 16528:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Nov 9 01:26:44 boss1 kernel: req@ffff883e8281f400 x1450784570036048/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1383960409 ref 2 fl Interpret:/0/0 rc 0/0 Nov 9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:28:49 boss1 kernel: Lustre: Skipped 568 previous similar messages Nov 9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Nov 9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Nov 9 01:32:34 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting If I can provide you with any more information next time it happens then let me know. |
| Comments |
| Comment by Daire Byrne (Inactive) [ 08/Jan/14 ] |
|
We are still seeing this around once/twice a week. We have to reboot the OSS server once it happens. Is there anything else we can do to help you root cause the issue? It seems to occur more often when we run a large deletion across the filesystem. Our workload is extremely metadata heavy with millions of links/unlinks per day. Jan 8 15:24:43 boss2 kernel: LustreError: 26897:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Jan 8 15:31:48 boss2 kernel: LustreError: 18797:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel Jan 8 15:46:17 boss2 kernel: LNet: Service thread pid 22361 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 8 15:46:17 boss2 kernel: Pid: 22361, comm: ll_ost02_039 Jan 8 15:46:17 boss2 kernel: Jan 8 15:46:17 boss2 kernel: Call Trace: Jan 8 15:46:17 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jan 8 15:46:17 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass] Jan 8 15:46:17 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:46:17 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass] Jan 8 15:46:17 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd] Jan 8 15:46:17 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass] Jan 8 15:46:17 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Jan 8 15:46:17 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Jan 8 15:46:17 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost] Jan 8 15:46:17 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost] Jan 8 15:46:17 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 8 15:46:17 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jan 8 15:46:17 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:46:17 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:46:17 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jan 8 15:46:17 boss2 kernel: Jan 8 15:46:17 boss2 kernel: LustreError: dumping log to /tmp/lustre-log.1389195977.22361 Jan 8 15:51:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds. Jan 8 15:51:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 15:51:58 boss2 kernel: ll_ost02_039 D 0000000000000015 0 22361 2 0x00000000 Jan 8 15:51:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974 Jan 8 15:51:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246 Jan 8 15:51:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058 Jan 8 15:51:58 boss2 kernel: Call Trace: Jan 8 15:51:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jan 8 15:51:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass] Jan 8 15:51:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:51:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass] Jan 8 15:51:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd] Jan 8 15:51:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass] Jan 8 15:51:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Jan 8 15:51:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Jan 8 15:51:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost] Jan 8 15:51:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost] Jan 8 15:51:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 8 15:51:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jan 8 15:51:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:51:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:51:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jan 8 15:53:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds. Jan 8 15:53:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 15:53:58 boss2 kernel: ll_ost02_039 D 0000000000000015 0 22361 2 0x00000000 Jan 8 15:53:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974 Jan 8 15:53:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246 Jan 8 15:53:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058 Jan 8 15:53:58 boss2 kernel: Call Trace: Jan 8 15:53:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jan 8 15:53:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass] Jan 8 15:53:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:53:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass] Jan 8 15:53:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd] Jan 8 15:53:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass] Jan 8 15:53:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Jan 8 15:53:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Jan 8 15:53:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost] Jan 8 15:53:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost] Jan 8 15:53:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 8 15:53:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jan 8 15:53:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:53:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:53:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jan 8 15:55:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds. Jan 8 15:55:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 15:55:58 boss2 kernel: ll_ost02_039 D 0000000000000015 0 22361 2 0x00000000 Jan 8 15:55:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974 Jan 8 15:55:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246 Jan 8 15:55:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058 Jan 8 15:55:58 boss2 kernel: Call Trace: Jan 8 15:55:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jan 8 15:55:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass] Jan 8 15:55:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:55:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass] Jan 8 15:55:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd] Jan 8 15:55:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass] Jan 8 15:55:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Jan 8 15:55:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Jan 8 15:55:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost] Jan 8 15:55:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost] Jan 8 15:55:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 8 15:55:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jan 8 15:55:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:55:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:55:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jan 8 15:56:24 boss2 kernel: Lustre: 22383:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply Jan 8 15:56:24 boss2 kernel: req@ffff883243230c00 x1456592052377504/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1389196589 ref 2 fl Interpret:/0/0 rc 0/0 Jan 8 15:57:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds. Jan 8 15:57:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 8 15:57:58 boss2 kernel: ll_ost02_039 D 0000000000000015 0 22361 2 0x00000000 Jan 8 15:57:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974 Jan 8 15:57:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246 Jan 8 15:57:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058 Jan 8 15:57:58 boss2 kernel: Call Trace: Jan 8 15:57:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Jan 8 15:57:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass] Jan 8 15:57:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:57:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass] Jan 8 15:57:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd] Jan 8 15:57:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass] Jan 8 15:57:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd] Jan 8 15:57:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd] Jan 8 15:57:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost] Jan 8 15:57:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost] Jan 8 15:57:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 8 15:57:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jan 8 15:57:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Jan 8 15:57:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jan 8 15:57:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jan 8 15:58:21 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Jan 8 15:58:21 boss2 kernel: Lustre: Skipped 1 previous similar message Jan 8 15:58:21 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Jan 8 15:58:46 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Jan 8 15:58:46 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Jan 8 15:59:12 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Jan 8 15:59:12 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs Jan 8 15:59:37 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting Jan 8 15:59:37 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs |
| Comment by Peter Jones [ 04/Mar/14 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Oleg Drokin [ 04/Mar/14 ] |
|
I think this is almost certainly a dup of |
| Comment by Hongchao Zhang [ 05/Mar/14 ] |
|
Yes, this issue should be the duplicate of Hi Daire, Could you please test with the patch in Thanks |
| Comment by Daire Byrne (Inactive) [ 05/Mar/14 ] |
|
Okay. I will wait for us to hit this again (to make sure something else hasn't changed) and then apply the patch. Cheers. |
| Comment by Daire Byrne (Inactive) [ 07/Mar/14 ] |
|
Hit the bug a couple of times overnight. I've patched in the fix. I will report back if we see this again. |
| Comment by Hongchao Zhang [ 11/Mar/14 ] |
|
Hi Daire, |
| Comment by Daire Byrne (Inactive) [ 11/Mar/14 ] |
|
Well there have been no further occurrences yet but it took ~3 weeks between instances last time so we will have to sit on this for a while and see what happens. Cheers. |
| Comment by Jason Hill (Inactive) [ 14/Mar/14 ] |
|
We are not seeing the ll_ost call traces, but are seeing frequent hits of this in our syslogs for oss nodes: [root@lustre-mgmt1 apps]# grep "Mar 9" lustrekernel | grep "Error -2 syncing" | wc -l what further debug can we provide? This is on the production filesystems here at ORNL. |
| Comment by Oleg Drokin [ 14/Mar/14 ] |
|
Error -2 syncing itself is a somewhat valid race, I would think, though I am nto exactly sure how would it happen. |
| Comment by Hongchao Zhang [ 14/Mar/14 ] |
|
this message in syslogs should be in "ost_blocking_ast", and it is not an error case, int ost_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc, void *data, int flag) { ... if (rc == 0 && flag == LDLM_CB_CANCELING && (lock->l_granted_mode & (LCK_PW|LCK_GROUP)) && (sync_lock_cancel == ALWAYS_SYNC_ON_CANCEL || (sync_lock_cancel == BLOCKING_SYNC_ON_CANCEL && lock->l_flags & LDLM_FL_CBPENDING))) { ... rc = obd_sync(&env, lock->l_export, oinfo, lock->l_policy_data.l_extent.start, lock->l_policy_data.l_extent.end, NULL); if (rc) CERROR("Error %d syncing data on lock cancel\n", rc); OBDO_FREE(oa); OBD_FREE_PTR(oinfo); } ... } when the object is being deleted, "LU_OBJECT_HEARD_BANSHEE" flag will be set, and ofd_sync will return -ENOENT (-2) static int ofd_sync(const struct lu_env *env, struct obd_export *exp, struct obd_info *oinfo, obd_size start, obd_size end, struct ptlrpc_request_set *set) { ... if (!ofd_object_exists(fo)) GOTO(put, rc = -ENOENT); ... } static inline int ofd_object_exists(struct ofd_object *obj) { LASSERT(obj != NULL); if (lu_object_is_dying(obj->ofo_obj.do_lu.lo_header)) return 0; return lu_object_exists(&obj->ofo_obj.do_lu); } static inline int lu_object_is_dying(const struct lu_object_header *h) { return test_bit(LU_OBJECT_HEARD_BANSHEE, &h->loh_flags); } this issue should be marked as a duplicate of |
| Comment by Peter Jones [ 21/Mar/14 ] |
|
Daire Have you had a chance to try out the Peter |
| Comment by Daire Byrne (Inactive) [ 22/Mar/14 ] |
|
Peter, All our servers have been running with the patch from Cheers, |
| Comment by Peter Jones [ 22/Mar/14 ] |
|
Fair enough - keep us posted! |
| Comment by Daire Byrne (Inactive) [ 02/Apr/14 ] |
|
Well we normally would have seen an instance of this by now (~3 weeks since we patched). Feel free to close this ticket and I'll re-open if required. Cheers. |
| Comment by Peter Jones [ 02/Apr/14 ] |
|
ok - thanks Daire! |