[LU-1219] The connection is refused due to still busy with 1 active RPCs Created: 15/Mar/12 Updated: 20/Dec/12 Resolved: 20/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server : lustre-1.8.7, client : lustre-1.8.4.ddn2.2 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6434 |
| Description |
|
We got the following call traces at the customer site, and one OST refuses to connect due to still busy with 1 active RPCs. Mar 15 00:14:14 t2s007059 kernel: Pid: 13147, comm: ll_ost_io_229 Mar 15 00:14:14 t2s007059 kernel: Mar 15 00:14:14 t2s007059 kernel: Call Trace: Mar 15 00:14:14 t2s007059 kernel: [<ffffffff8002e024>] __wake_up+0x38/0x4f Mar 15 00:14:14 t2s007059 kernel: [<ffffffff889fc7f3>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff800a2dff>] autoremove_wake_function+0x0/0x2e Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88b1390b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88b54144>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff886da3a6>] lnet_ni_send+0x96/0xe0 [lnet] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88aeeede>] ost_checksum_bulk+0x36e/0x5a0 [ost] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88af5d09>] ost_brw_write+0x1c99/0x2480 [ost] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887ebac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887b68b0>] target_committed_to_req+0x40/0x120 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff8008e7f9>] default_wake_function+0x0/0xe Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887f00a8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88af909e>] ost_handle+0x2bae/0x55b0 [ost] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88741d00>] class_handle2object+0xe0/0x170 [obdclass] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887aa19a>] lock_res_and_lock+0xba/0xd0 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887af168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887ff6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887ffe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 Mar 15 00:14:14 t2s007059 kernel: [<ffffffff88800dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 15 00:14:14 t2s007059 kernel: [<ffffffff887ffe60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Mar 15 00:14:14 t2s007059 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Mar 15 00:14:14 t2s007059 kernel: |
| Comments |
| Comment by Peter Jones [ 15/Mar/12 ] |
|
Hongchao Could you please help with this one? Thanks Peter |
| Comment by Andreas Dilger [ 15/Mar/12 ] |
|
This is possibly a duplicate of |
| Comment by Oleg Drokin [ 15/Mar/12 ] |
|
The underlying issue is the write stuck in jbd somehow, |
| Comment by Hongchao Zhang [ 20/Mar/12 ] |
|
the journal is stuck when committing the previous transaction, was the info of process in this node available? |
| Comment by Shuichi Ihara (Inactive) [ 20/Mar/12 ] |
|
Hongchao, Unfortunately, we only have /var/log/messages and /tmp/lustre-log.<timestamp> When if the problem happens again, we can collect all information you want. |
| Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ] |
|
Hi, |
| Comment by Hongchao Zhang [ 31/Mar/12 ] |
|
Hi Shuichi Sorry for delayed response. |
| Comment by Andreas Dilger [ 20/Dec/12 ] |
|
Close this old issue until there is more information available. |