[LU-1934] still busy with active RPCs for days Created: 13/Sep/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4414 | ||||||||
| Description |
|
OST refuses reconnection from client with 'still busy with 10 active RPCs'. The OST is failing to time out the active RPCs and has been in this state for over a day. Client continually attempts to reconnect. We have a few client nodes in this state that can be used to investigate further. LLNL-bugzilla-ID: 1495 |
| Comments |
| Comment by Ned Bass [ 13/Sep/12 ] |
|
Grepping a stuck client NID in /proc/fs/lustre/ost/OSS/ost/req_history shows 1269062278:172.18.102.20@tcp:12345-172.18.110.123@tcp:x1405404304029454:296:Interpret:1347406306:-1347406306s(-1347407432s) opc 101 I believe this is an LDLM lock enqueue. |
| Comment by Ned Bass [ 13/Sep/12 ] |
|
A couple of service thread on the OSS just completed after about 48 hours and the client reconnected. I dumped all thread backtraces while they were hung, and both looked like this (hand-copied): ? ptl_send_rpc schedule timeout ? process_timeout cfs_waitq_timedwait ptlrpc_set_wait ? default_wake_function ptlrpc_queue_wait ldlm_server_glimpse_ast filter_intent_policy ldlm_lock_enqueue ldlm_handle_enqueue0 ldlm_handle_enqueue ? ldlm_server_completion_ast ? ost_blocking_ast ? ldlm_server_glimpse_ast ost_handle ? lustre_msg_get_transno ptlrpc_main child_rip |
| Comment by Oleg Drokin [ 14/Sep/12 ] |
|
When the threads completed I bet there were some messages about how they took too long, and then before completion probably something about failed glimpse AST or some such. Can you please show an example of those messages? |
| Comment by Ned Bass [ 14/Sep/12 ] |
dusk23 2012-09-13 16:47:29 ... ptlrpc_server_handle_request() @@@ Request took longer than estimated (940:172446s); client may timeout. req@... x.../t0(0) o101->...@172.18.110.123@tcp:0/0 lens 296/352 e 1 to 0 dl 134707603 ref 1 fl Complete:/0/0 rc 301/301 I think I need to paint a more complete picture here. There seems to have been some interaction between multiple clients and OSTs: OSS/OST: dusk23/OST0143 (where the above message was logged) graph555 was unable to reconnect to dusk20/OST0143 (busy with 1 active RPCs) and to dusk23/OST0166 (busy with 4 active RPCs) This had been going on for some days. At 2012-09-13 16:42:30 dusk20 was powered off by it's failover partner (it seems to have locked up after I did a sysrq-t). Exactly 5 minutes later (recovery window for dusk20's failover partner) the above message was logged, and client connections were restored to all OSTs. So it appears that the processing of cslic5's RPCs by dusk23 was blocked somehow by graph555. I confess I don't quite understand how such a dependency comes about. But I guess the question is, why did dusk20 fail to handle or time out the active RPC from graph555? |
| Comment by Ned Bass [ 14/Sep/12 ] |
|
The req_history entry in my first comment was the active RPC from graph555 on dusk20. I find an entry for that xid in dusk20's logs (preceded several minutes earlier by watchdog stack trace) 2012-09-11 16:41:46 Lustre: Service thread pid 6655 was inactive for 600.00s ... dumping stack trace
Pid: 6655, comm: ll_ost_186
schedule_timeout
cfs_waitq_timedwait
ptlrpc_set_wait
ptlrpc_queue_wait
ldlm_server_glimpse_ast
filter_intent_policy
ldlm_lock_enqueue
ldlm_handle_enqueue0
ldlm_handle_enqueue
ost_handle
ptlrpc_main
child_rip
2012-09-11 16:50:22 Lustre: 6866:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (10/-516), not sending early reply
req@... x1405404304029454/t0(0) o101->...@172.18.110.123@tcp:0/0 lens 296/352 e 4 to 0 dl 134707432 ref 2 fl Interpret:/0/0 rc 0/0
Later, when I dumped stack traces at 2012-09-13 16:41:46, thread ll_ost_186 has basically the exact same stack trace. |
| Comment by Peter Jones [ 14/Sep/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Client reconnection has been fixed. |