[LU-4579] Timeout system horribly broken - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: None
Labels:
None
Environment:
Lustre 2.4.0-21chaos

Severity:
3
Rank (Obsolete):
12505

Description

It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

We need to work out a plan to fix the timeouts in lustre.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sysrq-t.catalyst141.client.txt
826 kB
10/May/14 12:46 AM
sysrq-t.cider-mds1.txt
1.58 MB
10/May/14 12:46 AM

Issue Links

is related to

LU-4942 lock callback timeout is not per-export

Resolved

LU-5497 Many MDS service threads blocked in ldlm_completion_ast()

Closed

is related to

LU-4786 Apparent denial of service from client to mdt

Resolved

LU-4570 Metadata slowdowns on production filesystem at ORNL

Closed

Activity

[LU-4579] Timeout system horribly broken

James A Simmons added a comment - 12/May/14 7:28 PM

Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

James A Simmons added a comment - 12/May/14 7:28 PM Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

Christopher Morrone (Inactive) added a comment - 12/May/14 6:58 PM

As I just commented in ~~LU-4584~~ - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

Christopher Morrone (Inactive) added a comment - 12/May/14 6:58 PM As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

Oleg Drokin added a comment - 10/May/14 4:56 AM - edited

As I just commented in ~~LU-4584~~ - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out.
And the resends are happening because the clients provide too small of a buffer.

If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

Oleg Drokin added a comment - 10/May/14 4:56 AM - edited As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there? With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out. And the resends are happening because the clients provide too small of a buffer. If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

Ned Bass (Inactive) added a comment - 10/May/14 12:46 AM

Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

Ned Bass (Inactive) added a comment - 10/May/14 12:46 AM Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:26 PM

The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

Christopher Morrone (Inactive) added a comment - 21/Apr/14 6:26 PM The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

Oleg Drokin added a comment - 10/Mar/14 11:03 PM

So with all your locks evisted being PR locks just like in ~~LU-4570~~, let's try to isolate how come readdir locks are so stubborn.

Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing.
If you can pull off sysrq-t on the MDT to ensure that mdt_rdpg_ threads are not stuck somewhere - that would be great as well.
My suspicion is there's some path on a client that forgets to release a reference to a matched lock (and the lock happening to be PR lock at all times somehow) and then it's never cancelled. I tried to review the code and I do not see anything, though. Likely happens on a failure path of some sort, so only triggers if you have other problems (And I see you also have client evictions for inactivity and what not).

It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches.

Oleg Drokin added a comment - 10/Mar/14 11:03 PM So with all your locks evisted being PR locks just like in LU-4570 , let's try to isolate how come readdir locks are so stubborn. Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing. If you can pull off sysrq-t on the MDT to ensure that mdt_rdpg_ threads are not stuck somewhere - that would be great as well. My suspicion is there's some path on a client that forgets to release a reference to a matched lock (and the lock happening to be PR lock at all times somehow) and then it's never cancelled. I tried to review the code and I do not see anything, though. Likely happens on a failure path of some sort, so only triggers if you have other problems (And I see you also have client evictions for inactivity and what not). It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches.

Ned Bass (Inactive) added a comment - 04/Mar/14 10:16 PM

The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds.

In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change:

2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs
2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages
2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago)
2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264
2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages
2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816
2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274
2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781
2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075
2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024
2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360
2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786
2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages
2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp  ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0
2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages
2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107.
2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail.
2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo)
2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages
2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1
2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo)
2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795
2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message
2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages
2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages
2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805
2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline

Ned Bass (Inactive) added a comment - 04/Mar/14 10:16 PM The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds. In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change: 2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs 2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages 2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago) 2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264 2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages 2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816 2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274 2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781 2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075 2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024 2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360 2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786 2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages 2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0 2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo) 2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages 2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1 2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo) 2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795 2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages 2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805 2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline

Vitaly Fertman added a comment - 20/Feb/14 11:42 PM

per export lock callback AT: http://review.whamcloud.com/#/c/9336/

Vitaly Fertman added a comment - 20/Feb/14 11:42 PM per export lock callback AT: http://review.whamcloud.com/#/c/9336/

Oleg Drokin added a comment - 12/Feb/14 4:18 AM

fid will be visible as resource id.

Oleg Drokin added a comment - 12/Feb/14 4:18 AM fid will be visible as resource id.

Christopher Morrone (Inactive) added a comment - 12/Feb/14 2:21 AM

Well, yes, they are mostly PR/PR. But there are a few CR/CR in there as well. But that also was true going back a year when we were running 2.1. The server doesn't mention fids, so I wouldn't know.

I don't recall what was happening on the clients, I'll keep an eye our for that the next time I see a problem.

Frankly I'm beginning to expect that ldiskfs caching is the root of our troubles, as it has been in the past. Something lustre related is definitely badly fragmenting memory on the servers. If we leave zone_reclaim_mode=1 (the default), the zone reclaim code goes out to lunch for tens of minutes, and possibly hours, making the servers unusable. If we set zone_reclaim_mode=1 then we fragment and exhaust all of the memory on the node, causing order:2 atomic allocation failures in the mlx4 (10GigE mellanox) driver. That driver doesn't handle failed allocations reliably which leaves the node in a bad state.

Of course, I'm making a bit of a guess about why the zone reclaim is such a problem based on my observations of the system with zone_reclaim_mode=0. There could be something else going on there.

But I don't really know if the memory problems that I am hunting now are the same root of all of our problems, or just one problem among many.

Christopher Morrone (Inactive) added a comment - 12/Feb/14 2:21 AM Well, yes, they are mostly PR/PR. But there are a few CR/CR in there as well. But that also was true going back a year when we were running 2.1. The server doesn't mention fids, so I wouldn't know. I don't recall what was happening on the clients, I'll keep an eye our for that the next time I see a problem. Frankly I'm beginning to expect that ldiskfs caching is the root of our troubles, as it has been in the past. Something lustre related is definitely badly fragmenting memory on the servers. If we leave zone_reclaim_mode=1 (the default), the zone reclaim code goes out to lunch for tens of minutes, and possibly hours, making the servers unusable. If we set zone_reclaim_mode=1 then we fragment and exhaust all of the memory on the node, causing order:2 atomic allocation failures in the mlx4 (10GigE mellanox) driver. That driver doesn't handle failed allocations reliably which leaves the node in a bad state. Of course, I'm making a bit of a guess about why the zone reclaim is such a problem based on my observations of the system with zone_reclaim_mode=0. There could be something else going on there. But I don't really know if the memory problems that I am hunting now are the same root of all of our problems, or just one problem among many.

Oleg Drokin added a comment - 11/Feb/14 8:04 PM

A quick follow on question based on stuff in ~~LU-4570~~ - are all/most timed out locks reported on MDT (with corresponding client eviction) happen to have PR/PR mode? Mostly same fid? If so, are there any errors reported on cient side coming from dir.c or otherwise hinting at readdir failures?

Oleg Drokin added a comment - 11/Feb/14 8:04 PM A quick follow on question based on stuff in LU-4570 - are all/most timed out locks reported on MDT (with corresponding client eviction) happen to have PR/PR mode? Mostly same fid? If so, are there any errors reported on cient side coming from dir.c or otherwise hinting at readdir failures?

People

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 03/Feb/14 9:58 PM

Updated:: 19/Aug/14 9:58 PM

Resolved:: 23/Jun/14 9:27 PM