[LU-2621] SIngle client timeout hangs MDS -related to LU-793 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
- MB
Environment:
Hyperion/RHEL6

Severity:
3
Rank (Obsolete):
6132

Description

Running mdtest, file-per-process. A single client times out a request, then the MDS enter the 'waiting on 1 RPC' state, all clients eventually get -EBUSY. This bug is to show the sequence as I'm currently seeing it on Hyperion.
The first client error:

2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358283526/real 1358283526]  req@ffff88030ca02400 x1424249167482022/t0(0) o101->lustre-MDT0000-mdc-ffff88033c3dac00@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1358284286 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 1 previous similar message
2013-01-15 13:11:26 Lustre: lustre-MDT0000-mdc-ffff88033c3dac00: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
2013-01-15 13:11:26 LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16

The MDS log

Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff88012a324850 x1424249167482022/t0(0) o101->ad6c708f-3715-63c3-9874-6577bf49a8f7@192.168.117.84@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff880161327850 x1424249159092332/t0(0) o101->7ae936ba-6abb-4279-4d8d-6075df2b44ca@192.168.116.112@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:09:10 hyperion-rst6 kernel: Lustre: 7251:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Jan 15 13:09:10 hyperion-rst6 kernel: req@ffff88015b7d1050 x1424249191598931/t0(0) o35->67291746-09fb-9e08-cd08-b1a1abb10ad0@192.168.119.118@o2ib1:0/0 lens 392/2024 e 3 to 0 dl 1358284155 ref 2 fl Interpret:/0/0 rc 0/0
Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client ad6c708f-3715-63c3-9874-6577bf49a8f7 (at 192.168.117.84@o2ib1) reconnecting
Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 7ae936ba-6abb-4279-4d8d-6075df2b44ca (at 192.168.116.112@o2ib1) refused reconnection, still busy with 1 active RPCs

Requires restart of MDS to clear.

Attachments

Issue Links

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

Activity

[LU-2621] SIngle client timeout hangs MDS -related to LU-793

Jodi Levi (Inactive) added a comment - 05/Mar/13 2:02 PM

Please reopen if this happens again.

Jodi Levi (Inactive) added a comment - 05/Mar/13 2:02 PM Please reopen if this happens again.

Jodi Levi (Inactive) added a comment - 05/Mar/13 10:40 AM

Cliff,
Have you seen this one again yet?

Jodi Levi (Inactive) added a comment - 05/Mar/13 10:40 AM Cliff, Have you seen this one again yet?

Cliff White (Inactive) added a comment - 15/Feb/13 5:06 PM

I'd rather leave it open for a bit, until i can repeat a full SWL test w/ldiskfs. This week has been mostly ZFS work.

Cliff White (Inactive) added a comment - 15/Feb/13 5:06 PM I'd rather leave it open for a bit, until i can repeat a full SWL test w/ldiskfs. This week has been mostly ZFS work.

Jodi Levi (Inactive) added a comment - 15/Feb/13 2:11 PM

Cliff, have you not seen this since 1/22? If so, can this ticket be closed?

Jodi Levi (Inactive) added a comment - 15/Feb/13 2:11 PM Cliff, have you not seen this since 1/22? If so, can this ticket be closed?

Cliff White (Inactive) added a comment - 22/Jan/13 1:38 PM

Yes, the load was mdtest. Sadly, the lustre-log was not retained - that bit should now be corrected. I will get stack traces if/when the problems repeats.

Cliff White (Inactive) added a comment - 22/Jan/13 1:38 PM Yes, the load was mdtest. Sadly, the lustre-log was not retained - that bit should now be corrected. I will get stack traces if/when the problems repeats.

Andreas Dilger added a comment - 17/Jan/13 1:30 PM

I see this is mdtest, one directory per client, so there should be no contention in the filesystem or DLM between the client threads at all, so it is totally unexpected that some MDS thread should become stuck. Can you please also get stack traces from the MDS in this case?

Andreas Dilger added a comment - 17/Jan/13 1:30 PM I see this is mdtest, one directory per client, so there should be no contention in the filesystem or DLM between the client threads at all, so it is totally unexpected that some MDS thread should become stuck. Can you please also get stack traces from the MDS in this case?

Andreas Dilger added a comment - 17/Jan/13 1:28 PM

Cliff, are there any debug logs dumped from this case? Also, is this the full MDS log from when the thread first gets stuck? Can you please describe the test load when the MDS thread first gets stuck?

Andreas Dilger added a comment - 17/Jan/13 1:28 PM Cliff, are there any debug logs dumped from this case? Also, is this the full MDS log from when the thread first gets stuck? Can you please describe the test load when the MDS thread first gets stuck?

People

Assignee:: Mikhail Pershin

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Jan/13 4:35 PM

Updated:: 05/Mar/13 2:02 PM

Resolved:: 05/Mar/13 2:02 PM