[LU-4579] Timeout system horribly broken - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: None
Labels:
None
Environment:
Lustre 2.4.0-21chaos

Severity:
3
Rank (Obsolete):
12505

Description

It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

We need to work out a plan to fix the timeouts in lustre.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sysrq-t.catalyst141.client.txt
826 kB
10/May/14 12:46 AM
sysrq-t.cider-mds1.txt
1.58 MB
10/May/14 12:46 AM

Issue Links

is related to

LU-4942 lock callback timeout is not per-export

Resolved

LU-5497 Many MDS service threads blocked in ldlm_completion_ast()

Closed

is related to

LU-4786 Apparent denial of service from client to mdt

Resolved

LU-4570 Metadata slowdowns on production filesystem at ORNL

Closed

Activity

[LU-4579] Timeout system horribly broken

Peter Jones added a comment - 23/Jun/14 9:27 PM

Landed for 2.6

Peter Jones added a comment - 23/Jun/14 9:27 PM Landed for 2.6

Oleg Drokin added a comment - 05/Jun/14 2:23 AM

I extracted timeout fixes to http://review.whamcloud.com/10601

My testing seems to show it works.

Without patch:

[ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0
[ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1
[ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

and with the patch that becomes:

[ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0
[ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1
[ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Oleg Drokin added a comment - 05/Jun/14 2:23 AM I extracted timeout fixes to http://review.whamcloud.com/10601 My testing seems to show it works. Without patch: [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0 [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1 [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). and with the patch that becomes: [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0 [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1 [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

James A Simmons added a comment - 02/Jun/14 10:47 PM

There is also ~~LU-5116~~ which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

James A Simmons added a comment - 02/Jun/14 10:47 PM There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

Oleg Drokin added a comment - 02/Jun/14 9:30 PM

I think this is like several issues in one.
First we have those undesirable lock evictions that are likely due to ~~LU-4584~~ / ~~LU-2827~~ and then once that condition hits the too long timeout due to ~~LU-4578~~ would also become an issue.

Oleg Drokin added a comment - 02/Jun/14 9:30 PM I think this is like several issues in one. First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

Christopher Morrone (Inactive) added a comment - 12/May/14 8:29 PM

James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

Christopher Morrone (Inactive) added a comment - 12/May/14 8:29 PM James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

James A Simmons added a comment - 12/May/14 7:28 PM

Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

James A Simmons added a comment - 12/May/14 7:28 PM Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

Timeout system horribly broken

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates