[LU-2621] SIngle client timeout hangs MDS -related to LU-793 Created: 15/Jan/13  Updated: 05/Mar/13  Resolved: 05/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB
Environment:

Hyperion/RHEL6


Issue Links:
Related
is related to LU-793 Reconnections should not be refused w... Resolved
Severity: 3
Rank (Obsolete): 6132

 Description   

Running mdtest, file-per-process. A single client times out a request, then the MDS enter the 'waiting on 1 RPC' state, all clients eventually get -EBUSY. This bug is to show the sequence as I'm currently seeing it on Hyperion.
The first client error:

2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358283526/real 1358283526]  req@ffff88030ca02400 x1424249167482022/t0(0) o101->lustre-MDT0000-mdc-ffff88033c3dac00@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1358284286 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 1 previous similar message
2013-01-15 13:11:26 Lustre: lustre-MDT0000-mdc-ffff88033c3dac00: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-01-15 13:11:26 LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16

The MDS log

Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff88012a324850 x1424249167482022/t0(0) o101->ad6c708f-3715-63c3-9874-6577bf49a8f7@192.168.117.84@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff880161327850 x1424249159092332/t0(0) o101->7ae936ba-6abb-4279-4d8d-6075df2b44ca@192.168.116.112@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:09:10 hyperion-rst6 kernel: Lustre: 7251:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:10 hyperion-rst6 kernel: req@ffff88015b7d1050 x1424249191598931/t0(0) o35->67291746-09fb-9e08-cd08-b1a1abb10ad0@192.168.119.118@o2ib1:0/0 lens 392/2024 e 3 to 0 dl 1358284155 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client ad6c708f-3715-63c3-9874-6577bf49a8f7 (at 192.168.117.84@o2ib1) reconnecting
Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 7ae936ba-6abb-4279-4d8d-6075df2b44ca (at 192.168.116.112@o2ib1) refused reconnection, still busy with 1 active RPCs

Requires restart of MDS to clear.



 Comments   
Comment by Andreas Dilger [ 17/Jan/13 ]

Cliff, are there any debug logs dumped from this case? Also, is this the full MDS log from when the thread first gets stuck? Can you please describe the test load when the MDS thread first gets stuck?

Comment by Andreas Dilger [ 17/Jan/13 ]

I see this is mdtest, one directory per client, so there should be no contention in the filesystem or DLM between the client threads at all, so it is totally unexpected that some MDS thread should become stuck. Can you please also get stack traces from the MDS in this case?

Comment by Cliff White (Inactive) [ 22/Jan/13 ]

Yes, the load was mdtest. Sadly, the lustre-log was not retained - that bit should now be corrected. I will get stack traces if/when the problems repeats.

Comment by Jodi Levi (Inactive) [ 15/Feb/13 ]

Cliff, have you not seen this since 1/22? If so, can this ticket be closed?

Comment by Cliff White (Inactive) [ 15/Feb/13 ]

I'd rather leave it open for a bit, until i can repeat a full SWL test w/ldiskfs. This week has been mostly ZFS work.

Comment by Jodi Levi (Inactive) [ 05/Mar/13 ]

Cliff,
Have you seen this one again yet?

Comment by Jodi Levi (Inactive) [ 05/Mar/13 ]

Please reopen if this happens again.

Generated at Sat Feb 10 01:26:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.