[LU-1219] The connection is refused due to still busy with 1 active RPCs Created: 15/Mar/12  Updated: 20/Dec/12  Resolved: 20/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: None
Environment:

server : lustre-1.8.7, client : lustre-1.8.4.ddn2.2


Attachments: File messages.t2s007059    
Severity: 3
Rank (Obsolete): 6434

 Description   

We got the following call traces at the customer site, and one OST refuses to connect due to still busy with 1 active RPCs.

Mar 15 00:14:14 t2s007059 kernel: Pid: 13147, comm: ll_ost_io_229
Mar 15 00:14:14 t2s007059 kernel: 
Mar 15 00:14:14 t2s007059 kernel: Call Trace:
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8002e024>] __wake_up+0x38/0x4f
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff889fc7f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88b1390b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88b54144>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff886da3a6>] lnet_ni_send+0x96/0xe0 [lnet]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88aeeede>] ost_checksum_bulk+0x36e/0x5a0 [ost]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88af5d09>] ost_brw_write+0x1c99/0x2480 [ost]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ebac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887b68b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8008e7f9>] default_wake_function+0x0/0xe
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887f00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88af909e>] ost_handle+0x2bae/0x55b0 [ost]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88741d00>] class_handle2object+0xe0/0x170 [obdclass]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887aa19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887af168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ff6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ffe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff88800dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff887ffe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Mar 15 00:14:14 t2s007059 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Mar 15 00:14:14 t2s007059 kernel: 


 Comments   
Comment by Peter Jones [ 15/Mar/12 ]

Hongchao

Could you please help with this one?

Thanks

Peter

Comment by Andreas Dilger [ 15/Mar/12 ]

This is possibly a duplicate of LU-793, for which Oleg already has a patch.

Comment by Oleg Drokin [ 15/Mar/12 ]

The underlying issue is the write stuck in jbd somehow, LU-793 would not fix this, it just papers over some of the symptoms.

Comment by Hongchao Zhang [ 20/Mar/12 ]

the journal is stuck when committing the previous transaction, was the info of process in this node available?
the stack trace of "kjournald2" should give some info about where it was stuck. thanks!

Comment by Shuichi Ihara (Inactive) [ 20/Mar/12 ]

Hongchao,

Unfortunately, we only have /var/log/messages and /tmp/lustre-log.<timestamp>

When if the problem happens again, we can collect all information you want.
Please let us know, what should we run the commands before reboot servers.

Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ]

Hi,
the customer is waiting for what they should get them when the same problem happens at the next time.
please let me know what exactly we should do.

Comment by Hongchao Zhang [ 31/Mar/12 ]

Hi Shuichi

Sorry for delayed response.
please get the thread info of tasks by sysrq (echo "t" >/proc/sysrq-trigger) of the node at where the OST resides,
and it's better to get the current running address of process "kjournald2" ATM, Thanks!

Comment by Andreas Dilger [ 20/Dec/12 ]

Close this old issue until there is more information available.

Generated at Sat Feb 10 01:14:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.