Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • None
    • Lustre 2.4.0-21chaos
    • 3
    • 12505

    Description

      It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

      The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

      The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

      We need to work out a plan to fix the timeouts in lustre.

      Attachments

        Issue Links

          Activity

            [LU-4579] Timeout system horribly broken
            green Oleg Drokin added a comment -

            I extracted timeout fixes to http://review.whamcloud.com/10601

            My testing seems to show it works.

            Without patch:

            [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0
            [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1
            [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            

            and with the patch that becomes:

            [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0
            [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1
            [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            
            green Oleg Drokin added a comment - I extracted timeout fixes to http://review.whamcloud.com/10601 My testing seems to show it works. Without patch: [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0 [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1 [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). and with the patch that becomes: [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0 [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1 [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

            There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

            simmonsja James A Simmons added a comment - There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues
            green Oleg Drokin added a comment -

            I think this is like several issues in one.
            First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

            green Oleg Drokin added a comment - I think this is like several issues in one. First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

            James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

            morrone Christopher Morrone (Inactive) added a comment - James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

            Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

            simmonsja James A Simmons added a comment - Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

            As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

            Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

            morrone Christopher Morrone (Inactive) added a comment - As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.
            green Oleg Drokin added a comment - - edited

            As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

            With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out.
            And the resends are happening because the clients provide too small of a buffer.

            If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

            green Oleg Drokin added a comment - - edited As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out. And the resends are happening because the clients provide too small of a buffer. If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

            Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

            nedbass Ned Bass (Inactive) added a comment - Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

            The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

            morrone Christopher Morrone (Inactive) added a comment - The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.
            green Oleg Drokin added a comment -

            So with all your locks evisted being PR locks just like in LU-4570, let's try to isolate how come readdir locks are so stubborn.

            Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing.
            If you can pull off sysrq-t on the MDT to ensure that mdt_rdpg_ threads are not stuck somewhere - that would be great as well.
            My suspicion is there's some path on a client that forgets to release a reference to a matched lock (and the lock happening to be PR lock at all times somehow) and then it's never cancelled. I tried to review the code and I do not see anything, though. Likely happens on a failure path of some sort, so only triggers if you have other problems (And I see you also have client evictions for inactivity and what not).

            It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches.

            green Oleg Drokin added a comment - So with all your locks evisted being PR locks just like in LU-4570 , let's try to isolate how come readdir locks are so stubborn. Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing. If you can pull off sysrq-t on the MDT to ensure that mdt_rdpg_ threads are not stuck somewhere - that would be great as well. My suspicion is there's some path on a client that forgets to release a reference to a matched lock (and the lock happening to be PR lock at all times somehow) and then it's never cancelled. I tried to review the code and I do not see anything, though. Likely happens on a failure path of some sort, so only triggers if you have other problems (And I see you also have client evictions for inactivity and what not). It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches.

            The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds.

            In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change:

            2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs
            2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages
            2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago)
            2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264
            2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages
            2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816
            2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274
            2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149
            2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781
            2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages
            2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075
            2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024
            2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360
            2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786
            2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages
            2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp  ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0
            2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages
            2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107.
            2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
            2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail.
            2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo)
            2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages
            2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1
            2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo)
            2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0
            2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795
            2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages
            2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0
            2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message
            2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0
            2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages
            2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0
            2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages
            2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805
            2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages
            2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308
            2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0
            2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
            
            nedbass Ned Bass (Inactive) added a comment - The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds. In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change: 2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs 2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages 2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago) 2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264 2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages 2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816 2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274 2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781 2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075 2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024 2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360 2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786 2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages 2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0 2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo) 2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages 2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1 2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo) 2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795 2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages 2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805 2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: