Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • None
    • Lustre 2.4.0-21chaos
    • 3
    • 12505

    Description

      It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

      The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

      The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

      We need to work out a plan to fix the timeouts in lustre.

      Attachments

        Issue Links

          Activity

            [LU-4579] Timeout system horribly broken
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            green Oleg Drokin added a comment -

            I extracted timeout fixes to http://review.whamcloud.com/10601

            My testing seems to show it works.

            Without patch:

            [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0
            [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1
            [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            

            and with the patch that becomes:

            [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0
            [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1
            [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            
            green Oleg Drokin added a comment - I extracted timeout fixes to http://review.whamcloud.com/10601 My testing seems to show it works. Without patch: [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0 [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1 [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). and with the patch that becomes: [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0 [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1 [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

            There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

            simmonsja James A Simmons added a comment - There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues
            green Oleg Drokin added a comment -

            I think this is like several issues in one.
            First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

            green Oleg Drokin added a comment - I think this is like several issues in one. First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

            James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

            morrone Christopher Morrone (Inactive) added a comment - James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

            Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

            simmonsja James A Simmons added a comment - Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

            As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

            Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

            morrone Christopher Morrone (Inactive) added a comment - As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.
            green Oleg Drokin added a comment - - edited

            As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

            With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out.
            And the resends are happening because the clients provide too small of a buffer.

            If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

            green Oleg Drokin added a comment - - edited As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out. And the resends are happening because the clients provide too small of a buffer. If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

            Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

            nedbass Ned Bass (Inactive) added a comment - Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

            The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

            morrone Christopher Morrone (Inactive) added a comment - The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: