[LU-1934] still busy with active RPCs for days Created: 13/Sep/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

https://github.com/chaos/lustre/commits/2.1.2-3chaos


Issue Links:
Related
is related to LU-1949 SWL - mds wedges 'still busy with 1 R... Resolved
Severity: 3
Rank (Obsolete): 4414

 Description   

OST refuses reconnection from client with 'still busy with 10 active RPCs'. The OST is failing to time out the active RPCs and has been in this state for over a day. Client continually attempts to reconnect. We have a few client nodes in this state that can be used to investigate further.

LLNL-bugzilla-ID: 1495



 Comments   
Comment by Ned Bass [ 13/Sep/12 ]

Grepping a stuck client NID in /proc/fs/lustre/ost/OSS/ost/req_history shows

1269062278:172.18.102.20@tcp:12345-172.18.110.123@tcp:x1405404304029454:296:Interpret:1347406306:-1347406306s(-1347407432s) opc 101

I believe this is an LDLM lock enqueue.

Comment by Ned Bass [ 13/Sep/12 ]

A couple of service thread on the OSS just completed after about 48 hours and the client reconnected. I dumped all thread backtraces while they were hung, and both looked like this (hand-copied):

? ptl_send_rpc
schedule timeout
? process_timeout
cfs_waitq_timedwait
ptlrpc_set_wait
? default_wake_function
ptlrpc_queue_wait
ldlm_server_glimpse_ast
filter_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
ldlm_handle_enqueue
? ldlm_server_completion_ast
? ost_blocking_ast
? ldlm_server_glimpse_ast
ost_handle
? lustre_msg_get_transno
ptlrpc_main
child_rip
Comment by Oleg Drokin [ 14/Sep/12 ]

When the threads completed I bet there were some messages about how they took too long, and then before completion probably something about failed glimpse AST or some such. Can you please show an example of those messages?

Comment by Ned Bass [ 14/Sep/12 ]
dusk23 2012-09-13 16:47:29 ... ptlrpc_server_handle_request() @@@ Request took longer than estimated (940:172446s); client may timeout.
req@... x.../t0(0) o101->...@172.18.110.123@tcp:0/0 lens 296/352 e 1 to 0 dl 134707603 ref 1 fl Complete:/0/0 rc 301/301

I think I need to paint a more complete picture here. There seems to have been some interaction between multiple clients and OSTs:

OSS/OST: dusk23/OST0143 (where the above message was logged)
OSS/OST: dusk20/OST0166
client: graph555 (172.18.110.123 from above message)
client: cslic5

graph555 was unable to reconnect to dusk20/OST0143 (busy with 1 active RPCs) and to dusk23/OST0166 (busy with 4 active RPCs)
cslic5 was unable to reconnect to dusk23/OST0166 (busy with 10 active RPCs)

This had been going on for some days.

At 2012-09-13 16:42:30 dusk20 was powered off by it's failover partner (it seems to have locked up after I did a sysrq-t).

Exactly 5 minutes later (recovery window for dusk20's failover partner) the above message was logged, and client connections were restored to all OSTs. So it appears that the processing of cslic5's RPCs by dusk23 was blocked somehow by graph555. I confess I don't quite understand how such a dependency comes about. But I guess the question is, why did dusk20 fail to handle or time out the active RPC from graph555?

Comment by Ned Bass [ 14/Sep/12 ]

The req_history entry in my first comment was the active RPC from graph555 on dusk20. I find an entry for that xid in dusk20's logs (preceded several minutes earlier by watchdog stack trace)

2012-09-11 16:41:46 Lustre: Service thread pid 6655 was inactive for 600.00s ... dumping stack trace
Pid: 6655, comm: ll_ost_186
schedule_timeout
cfs_waitq_timedwait
ptlrpc_set_wait
ptlrpc_queue_wait
ldlm_server_glimpse_ast
filter_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
ldlm_handle_enqueue
ost_handle
ptlrpc_main
child_rip

2012-09-11 16:50:22 Lustre: 6866:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-516), not sending early reply
    req@... x1405404304029454/t0(0) o101->...@172.18.110.123@tcp:0/0 lens 296/352 e 4 to 0 dl 134707432 ref 2 fl Interpret:/0/0 rc 0/0

Later, when I dumped stack traces at 2012-09-13 16:41:46, thread ll_ost_186 has basically the exact same stack trace.

Comment by Peter Jones [ 14/Sep/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 29/May/17 ]

Client reconnection has been fixed.

Generated at Sat Feb 10 01:20:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.