[LU-4579] Timeout system horribly broken - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: None
Labels:
None
Environment:
Lustre 2.4.0-21chaos

Severity:
3
Rank (Obsolete):
12505

Description

It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

We need to work out a plan to fix the timeouts in lustre.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sysrq-t.catalyst141.client.txt
826 kB
10/May/14 12:46 AM
sysrq-t.cider-mds1.txt
1.58 MB
10/May/14 12:46 AM

Issue Links

is related to

LU-4942 lock callback timeout is not per-export

Resolved

LU-5497 Many MDS service threads blocked in ldlm_completion_ast()

Closed

is related to

LU-4786 Apparent denial of service from client to mdt

Resolved

LU-4570 Metadata slowdowns on production filesystem at ORNL

Closed

Activity

[LU-4579] Timeout system horribly broken

Peter Jones added a comment - 23/Jun/14 9:27 PM

Landed for 2.6

Peter Jones added a comment - 23/Jun/14 9:27 PM Landed for 2.6

Oleg Drokin added a comment - 05/Jun/14 2:23 AM

I extracted timeout fixes to http://review.whamcloud.com/10601

My testing seems to show it works.

Without patch:

[ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0
[ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1
[ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

and with the patch that becomes:

[ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0
[ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1
[ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Oleg Drokin added a comment - 05/Jun/14 2:23 AM I extracted timeout fixes to http://review.whamcloud.com/10601 My testing seems to show it works. Without patch: [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0 [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1 [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). and with the patch that becomes: [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0 [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1 [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

James A Simmons added a comment - 02/Jun/14 10:47 PM

There is also ~~LU-5116~~ which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

James A Simmons added a comment - 02/Jun/14 10:47 PM There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

Oleg Drokin added a comment - 02/Jun/14 9:30 PM

I think this is like several issues in one.
First we have those undesirable lock evictions that are likely due to ~~LU-4584~~ / ~~LU-2827~~ and then once that condition hits the too long timeout due to ~~LU-4578~~ would also become an issue.

Oleg Drokin added a comment - 02/Jun/14 9:30 PM I think this is like several issues in one. First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

Christopher Morrone (Inactive) added a comment - 12/May/14 8:29 PM

James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

Christopher Morrone (Inactive) added a comment - 12/May/14 8:29 PM James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

James A Simmons added a comment - 12/May/14 7:28 PM

Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

James A Simmons added a comment - 12/May/14 7:28 PM Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

Christopher Morrone (Inactive) added a comment - 12/May/14 6:58 PM

As I just commented in ~~LU-4584~~ - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

Christopher Morrone (Inactive) added a comment - 12/May/14 6:58 PM As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

Oleg Drokin added a comment - 10/May/14 4:56 AM - edited

As I just commented in ~~LU-4584~~ - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out.
And the resends are happening because the clients provide too small of a buffer.

If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

Oleg Drokin added a comment - 10/May/14 4:56 AM - edited As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out. And the resends are happening because the clients provide too small of a buffer. If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

Ned Bass (Inactive) added a comment - 10/May/14 12:46 AM

Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

Ned Bass (Inactive) added a comment - 10/May/14 12:46 AM Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:26 PM

The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:26 PM The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

People

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 03/Feb/14 9:58 PM

Updated:: 19/Aug/14 9:58 PM

Resolved:: 23/Jun/14 9:27 PM