[LU-4286] OST - still busy with 1 active RPCs Created: 21/Nov/13  Updated: 02/Apr/14  Resolved: 02/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 2
Labels: None
Environment:

EL6.4


Severity: 3
Rank (Obsolete): 11764

 Description   

We have hit this at least once a week for the last month. A seemingly random OST gets stuck refusing reconnection with a busy RPC after an ll_ost stack trace e.g.

Nov 15 07:23:21 boss4 kernel: LustreError: 28716:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
Nov 15 07:23:36 boss4 kernel: LustreError: 28276:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 1 previous similar message
Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
Nov 15 07:43:31 boss4 kernel: LustreError: 16252:0:(ost_handler.c:1764:ost_blocking_ast()) Skipped 6 previous similar messages
Nov 15 07:44:24 boss4 kernel: LNet: Service thread pid 5365 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 15 07:44:24 boss4 kernel: Pid: 5365, comm: ll_ost02_083
Nov 15 07:44:24 boss4 kernel:
Nov 15 07:44:24 boss4 kernel: Call Trace:
Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a6523>] lu_object_find_at+0xb3/0x360 [obdclass]
Nov 15 07:44:24 boss4 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a67e6>] lu_object_find+0x16/0x20 [obdclass]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e015c5>] ofd_object_find+0x35/0xf0 [ofd]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa05a73ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0e11549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0df9fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dc7c1c>] ost_setattr+0x31c/0x990 [ost]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0dcb746>] ost_handle+0x21e6/0x48e0 [ost]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa07423c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa042b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa043cd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0739729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Nov 15 07:44:24 boss4 kernel: [<ffffffffa074375e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffffa0742c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 07:44:24 boss4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov 15 07:44:24 boss4 kernel:
Nov 15 07:44:24 boss4 kernel: LustreError: dumping log to /tmp/lustre-log.1384501464.5365
Nov 15 07:57:11 boss4 kernel: Lustre: 5360:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-367), not sending early reply
Nov 15 07:57:11 boss4 kernel:  req@ffff88324df9f400 x1450789664028660/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1384502236 ref 2 fl Interpret:/0/0 rc 0/0
Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov 15 07:59:32 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov 15 07:59:57 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov 15 08:00:22 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov 15 08:00:47 boss4 kernel: Lustre: bravo-OST0040: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs

The filesystem grinds to a halt and after rebooting the OSS, recovery brings the OST back to the same place - 1 active RPC. We have to mount and abort_recov to bring things back to life. Another example (different OSS/OST):

Nov  9 01:16:37 boss1 kernel: LNet: Service thread pid 16551 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  9 01:16:37 boss1 kernel: Pid: 16551, comm: ll_ost02_038
Nov  9 01:16:37 boss1 kernel:
Nov  9 01:16:37 boss1 kernel: Call Trace:
Nov  9 01:16:37 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass]
Nov  9 01:16:37 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Nov  9 01:16:37 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:16:37 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov  9 01:16:37 boss1 kernel:
Nov  9 01:16:37 boss1 kernel: LustreError: dumping log to /tmp/lustre-log.1383959797.16551
Nov  9 01:26:19 boss1 kernel: INFO: task ll_ost02_038:16551 blocked for more than 120 seconds.
Nov  9 01:26:19 boss1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  9 01:26:19 boss1 kernel: ll_ost02_038  D 0000000000000007     0 16551      2 0x00000080
Nov  9 01:26:19 boss1 kernel: ffff883f44a259b0 0000000000000046 0000000000000000 ffff883f44a25974
Nov  9 01:26:19 boss1 kernel: ffff883f44a25960 ffffc901077e7030 0000000000000246 0000000000000246
Nov  9 01:26:19 boss1 kernel: ffff883f44491ab8 ffff883f44a25fd8 000000000000fb88 ffff883f44491ab8
Nov  9 01:26:19 boss1 kernel: Call Trace:
Nov  9 01:26:19 boss1 kernel: [<ffffffffa03ca6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0545523>] lu_object_find_at+0xb3/0x360 [obdclass]
Nov  9 01:26:19 boss1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Nov  9 01:26:19 boss1 kernel: [<ffffffffa05457e6>] lu_object_find+0x16/0x20 [obdclass]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0da05c5>] ofd_object_find+0x35/0xf0 [ofd]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa05463ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0db0549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d98fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d66c1c>] ost_setattr+0x31c/0x990 [ost]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa0d6a746>] ost_handle+0x21e6/0x48e0 [ost]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06d8beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e13c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa03ca5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa03dbd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06d8729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffffa06e1c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov  9 01:26:19 boss1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov  9 01:26:44 boss1 kernel: Lustre: 16528:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
Nov  9 01:26:44 boss1 kernel:  req@ffff883e8281f400 x1450784570036048/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1383960409 ref 2 fl Interpret:/0/0 rc 0/0
Nov  9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:28:49 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:28:49 boss1 kernel: Lustre: Skipped 568 previous similar messages
Nov  9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:29:14 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:29:39 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:30:04 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:30:29 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:30:54 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:31:19 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Nov  9 01:31:44 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Nov  9 01:32:34 boss1 kernel: Lustre: bravo-OST000d: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting

If I can provide you with any more information next time it happens then let me know.



 Comments   
Comment by Daire Byrne (Inactive) [ 08/Jan/14 ]

We are still seeing this around once/twice a week. We have to reboot the OSS server once it happens. Is there anything else we can do to help you root cause the issue? It seems to occur more often when we run a large deletion across the filesystem. Our workload is extremely metadata heavy with millions of links/unlinks per day.

Jan  8 15:24:43 boss2 kernel: LustreError: 26897:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
Jan  8 15:31:48 boss2 kernel: LustreError: 18797:0:(ost_handler.c:1764:ost_blocking_ast()) Error -2 syncing data on lock cancel
Jan  8 15:46:17 boss2 kernel: LNet: Service thread pid 22361 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan  8 15:46:17 boss2 kernel: Pid: 22361, comm: ll_ost02_039
Jan  8 15:46:17 boss2 kernel: 
Jan  8 15:46:17 boss2 kernel: Call Trace:
Jan  8 15:46:17 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass]
Jan  8 15:46:17 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:46:17 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:46:17 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan  8 15:46:17 boss2 kernel: 
Jan  8 15:46:17 boss2 kernel: LustreError: dumping log to /tmp/lustre-log.1389195977.22361
Jan  8 15:51:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds.
Jan  8 15:51:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 15:51:58 boss2 kernel: ll_ost02_039  D 0000000000000015     0 22361      2 0x00000000
Jan  8 15:51:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974
Jan  8 15:51:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246
Jan  8 15:51:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058
Jan  8 15:51:58 boss2 kernel: Call Trace:
Jan  8 15:51:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass]
Jan  8 15:51:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:51:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:51:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan  8 15:53:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds.
Jan  8 15:53:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 15:53:58 boss2 kernel: ll_ost02_039  D 0000000000000015     0 22361      2 0x00000000
Jan  8 15:53:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974
Jan  8 15:53:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246
Jan  8 15:53:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058
Jan  8 15:53:58 boss2 kernel: Call Trace:
Jan  8 15:53:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass]
Jan  8 15:53:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:53:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:53:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan  8 15:55:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds.
Jan  8 15:55:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 15:55:58 boss2 kernel: ll_ost02_039  D 0000000000000015     0 22361      2 0x00000000
Jan  8 15:55:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974
Jan  8 15:55:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246
Jan  8 15:55:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058
Jan  8 15:55:58 boss2 kernel: Call Trace:
Jan  8 15:55:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass]
Jan  8 15:55:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:55:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:55:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan  8 15:56:24 boss2 kernel: Lustre: 22383:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
Jan  8 15:56:24 boss2 kernel:  req@ffff883243230c00 x1456592052377504/t0(0) o2->bravo-MDT0000-mdtlov_UUID@10.21.22.50@tcp:0/0 lens 560/432 e 5 to 0 dl 1389196589 ref 2 fl Interpret:/0/0 rc 0/0
Jan  8 15:57:58 boss2 kernel: INFO: task ll_ost02_039:22361 blocked for more than 120 seconds.
Jan  8 15:57:58 boss2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  8 15:57:58 boss2 kernel: ll_ost02_039  D 0000000000000015     0 22361      2 0x00000000
Jan  8 15:57:58 boss2 kernel: ffff882f016bf9b0 0000000000000046 0000000000000000 ffff882f016bf974
Jan  8 15:57:58 boss2 kernel: ffff882f016bf960 ffffc900cd15a030 0000000000000246 0000000000000246
Jan  8 15:57:58 boss2 kernel: ffff8837d33bd058 ffff882f016bffd8 000000000000fb88 ffff8837d33bd058
Jan  8 15:57:58 boss2 kernel: Call Trace:
Jan  8 15:57:58 boss2 kernel: [<ffffffffa04af6fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa060e523>] lu_object_find_at+0xb3/0x360 [obdclass]
Jan  8 15:57:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:57:58 boss2 kernel: [<ffffffffa060e7e6>] lu_object_find+0x16/0x20 [obdclass]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa0db85c5>] ofd_object_find+0x35/0xf0 [ofd]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa060f3ee>] ? lu_env_init+0x1e/0x30 [obdclass]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa0dc8549>] ofd_lvbo_update+0x6d9/0xea8 [ofd]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa0db0fe3>] ofd_setattr+0x7f3/0xbd0 [ofd]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa0d7ec1c>] ost_setattr+0x31c/0x990 [ost]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa0d82746>] ost_handle+0x21e6/0x48e0 [ost]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07a1beb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07aa3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa04af5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa04c0d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07a1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07ab75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffffa07aac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan  8 15:57:58 boss2 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan  8 15:58:21 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Jan  8 15:58:21 boss2 kernel: Lustre: Skipped 1 previous similar message
Jan  8 15:58:21 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Jan  8 15:58:46 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Jan  8 15:58:46 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Jan  8 15:59:12 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Jan  8 15:59:12 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Jan  8 15:59:37 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) reconnecting
Jan  8 15:59:37 boss2 kernel: Lustre: bravo-OST001a: Client bravo-MDT0000-mdtlov_UUID (at 10.21.22.50@tcp) refused reconnection, still busy with 1 active RPCs
Comment by Peter Jones [ 04/Mar/14 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Oleg Drokin [ 04/Mar/14 ]

I think this is almost certainly a dup of LU-4019, there's a patch available in there.

Comment by Hongchao Zhang [ 05/Mar/14 ]

Yes, this issue should be the duplicate of LU-4019.

Hi Daire,

Could you please test with the patch in LU-4019 (http://review.whamcloud.com/#/c/7795/)?

Thanks

Comment by Daire Byrne (Inactive) [ 05/Mar/14 ]

Okay. I will wait for us to hit this again (to make sure something else hasn't changed) and then apply the patch. Cheers.

Comment by Daire Byrne (Inactive) [ 07/Mar/14 ]

Hit the bug a couple of times overnight. I've patched in the fix. I will report back if we see this again.

Comment by Hongchao Zhang [ 11/Mar/14 ]

Hi Daire,
how is the test going, does the issue occur again after patching the fix?
Thanks

Comment by Daire Byrne (Inactive) [ 11/Mar/14 ]

Well there have been no further occurrences yet but it took ~3 weeks between instances last time so we will have to sit on this for a while and see what happens.

Cheers.

Comment by Jason Hill (Inactive) [ 14/Mar/14 ]

We are not seeing the ll_ost call traces, but are seeing frequent hits of this in our syslogs for oss nodes:

[root@lustre-mgmt1 apps]# grep "Mar 9" lustrekernel | grep "Error -2 syncing" | wc -l
508
[root@lustre-mgmt1 apps]# grep "Mar 10" lustrekernel | grep "Error -2 syncing" | wc -l
3420
[root@lustre-mgmt1 apps]# grep "Mar 11" lustrekernel | grep "Error -2 syncing" | wc -l
1076
[root@lustre-mgmt1 apps]# grep "Mar 12" lustrekernel | grep "Error -2 syncing" | wc -l
154
[root@lustre-mgmt1 apps]# grep "Mar 13" lustrekernel | grep "Error -2 syncing" | wc -l
2645

what further debug can we provide? This is on the production filesystems here at ORNL.

Comment by Oleg Drokin [ 14/Mar/14 ]

Error -2 syncing itself is a somewhat valid race, I would think, though I am nto exactly sure how would it happen.
You might want to apply 4019 patch just in case since it does fix a real bug of course.

Comment by Hongchao Zhang [ 14/Mar/14 ]

this message in syslogs should be in "ost_blocking_ast", and it is not an error case,

int ost_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
                     void *data, int flag)
{
        ...

        if (rc == 0 && flag == LDLM_CB_CANCELING &&
            (lock->l_granted_mode & (LCK_PW|LCK_GROUP)) &&
            (sync_lock_cancel == ALWAYS_SYNC_ON_CANCEL ||
             (sync_lock_cancel == BLOCKING_SYNC_ON_CANCEL &&
              lock->l_flags & LDLM_FL_CBPENDING))) {
 
                ...

                rc = obd_sync(&env, lock->l_export, oinfo,
                              lock->l_policy_data.l_extent.start,
                              lock->l_policy_data.l_extent.end, NULL);
                if (rc)
                        CERROR("Error %d syncing data on lock cancel\n", rc);

                OBDO_FREE(oa);
                OBD_FREE_PTR(oinfo);
        }

        ...
}

when the object is being deleted, "LU_OBJECT_HEARD_BANSHEE" flag will be set, and ofd_sync will return -ENOENT (-2)

static int ofd_sync(const struct lu_env *env, struct obd_export *exp,
                    struct obd_info *oinfo, obd_size start, obd_size end,
                    struct ptlrpc_request_set *set)
{                         
        ...
        if (!ofd_object_exists(fo))
                GOTO(put, rc = -ENOENT);
        ...
} 

static inline int ofd_object_exists(struct ofd_object *obj)
{               
        LASSERT(obj != NULL);
        if (lu_object_is_dying(obj->ofo_obj.do_lu.lo_header))
                return 0;          
        return lu_object_exists(&obj->ofo_obj.do_lu);
} 

static inline int lu_object_is_dying(const struct lu_object_header *h)
{               
        return test_bit(LU_OBJECT_HEARD_BANSHEE, &h->loh_flags);
}

this issue should be marked as a duplicate of LU-4019

Comment by Peter Jones [ 21/Mar/14 ]

Daire

Have you had a chance to try out the LU-4019 patch?

Peter

Comment by Daire Byrne (Inactive) [ 22/Mar/14 ]

Peter,

All our servers have been running with the patch from LU-4019 for over a week now. Sometimes it took a couple of weeks to see this bug so we are waiting a bit longer before we resolve this one.

Cheers,

Comment by Peter Jones [ 22/Mar/14 ]

Fair enough - keep us posted!

Comment by Daire Byrne (Inactive) [ 02/Apr/14 ]

Well we normally would have seen an instance of this by now (~3 weeks since we patched). Feel free to close this ticket and I'll re-open if required.

Cheers.

Comment by Peter Jones [ 02/Apr/14 ]

ok - thanks Daire!

Generated at Sat Feb 10 01:41:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.