[LU-1505] "disconnect stale client" hangs OSS Created: 11/Jun/12  Updated: 21/Mar/13  Resolved: 21/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.1.1
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Mahmoud Hanafi Assignee: Peter Jones
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 4046

 Description   

After recover "LustreError: 5866:0:(genops.c:1272:class_disconnect_stale_exports()) nbp6-OST004a: disconnect stale client 3e1a1726-5e53-1be8-2c0f-22adae5b4093@<unknown>" messages will hang system while it prints this for every ost and every client.

We also see
INFO: task ll_log_process:6350 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_log_proces D 0000000000000006 0 6350 2 0x00000080
ffff8808f59b1c10 0000000000000046 ffff8808f59b1ba0 ffffffff81060839
ffff8808f59b1b80 0000000000000282 0000000000800500 0000000000000000
ffff880c13fc5b38 ffff8808f59b1fd8 000000000000f5a0 ffff880c13fc5b38
Call Trace:
[<ffffffff81060839>] ? wake_up_new_task+0xd9/0x130
[<ffffffff81522215>] schedule_timeout+0x215/0x2e0
[<ffffffffa05ecd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff8100c0e2>] ? kernel_thread+0x82/0xe0
[<ffffffffa05ecd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81521e93>] wait_for_common+0x123/0x180
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20
[<ffffffffa0538c9a>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05f1280>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff81521fad>] wait_for_completion+0x1d/0x20
[<ffffffffa05ebb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0742b57>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05eeef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05edfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0aab6f0>] ? filter_recov_log_mds_ost_cb+0x0/0xb70 [obdfilter]
[<ffffffffa0aab6f0>] ? filter_recov_log_mds_ost_cb+0x0/0xb70 [obdfilter]
[<ffffffffa05ef056>] llog_cat_process+0x16/0x20 [obdclass]
[<ffffffffa05f02a1>] llog_cat_process_thread+0x121/0x850 [obdclass]
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task ll_log_process:6354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_log_proces D 0000000000000006 0 6354 2 0x00000080
ffff8808f5ac1c10 0000000000000046 ffff8808f5ac1bc0 ffffffff810097dc
ffff880bc2006b38 0000000000000000 0000000000ac1bd0 ffff880028393b40
ffff880bc2a8db38 ffff8808f5ac1fd8 000000000000f5a0 ffff880bc2a8db38
Call Trace:
[<ffffffff810097dc>] ? __switch_to+0x1ac/0x320
[<ffffffff815213ae>] ? thread_return+0x4e/0x760
[<ffffffff81522215>] schedule_timeout+0x215/0x2e0
[<ffffffffa05ecd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81521e93>] wait_for_common+0x123/0x180
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20
[<ffffffffa0538c9a>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05f1280>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff81521fad>] wait_for_completion+0x1d/0x20
[<ffffffffa05ebb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0742b57>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05eeef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05edfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0aab6f0>] ? filter_recov_log_mds_ost_cb+0x0/0xb70 [obdfilter]
[<ffffffffa0aab6f0>] ? filter_recov_log_mds_ost_cb+0x0/0xb70 [obdfilter]
[<ffffffffa05ef056>] llog_cat_process+0x16/0x20 [obdclass]
[<ffffffffa05f02a1>] llog_cat_process_thread+0x121/0x850 [obdclass]
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffffa05f0180>] ? llog_cat_process_thread+0x0/0x850 [obdclass]
[<ffffffff8100c140>] ? child_rip+0x0/0x20



 Comments   
Comment by Peter Jones [ 11/Jun/12 ]

Mahmoud

I believe that this issue has been dealt with as one of the cleanups tracked under LU-1095 and so should not appear in 2.1.2

Peter

Comment by Peter Jones [ 21/Mar/13 ]

We can reopen this if it appear on a more current release

Generated at Sat Feb 10 01:17:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.