[LU-4579] Timeout system horribly broken Created: 03/Feb/14 Updated: 19/Aug/14 Resolved: 23/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.4.0-21chaos |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 12505 | ||||||||||||||||||||
| Description |
|
It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds. The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable. The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time. We need to work out a plan to fix the timeouts in lustre. |
| Comments |
| Comment by Peter Jones [ 03/Feb/14 ] |
|
Oleg will dig into this one |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
So, looking into it from a simple angle: timeout = min_t(int, at_max, timeout + (timeout >> 1)); /* 150% */ So we take whatever at_timeout is ( timeout = at_get(ldlm_lock_to_ns_at(lock)) ) and then pick the maxumum of 150% of it or at_max. return max(timeout, ldlm_enqueue_min);
So we'll choose the bigger one of at_max or default value of ldlm_enqueue_min which is 100 (unless you override that with a module param which you probably don't). Overall all lock timeout increases go through __ldlm_add_waiting_lock that has a second parameter of seconds. It's pretty easy to add a check there to ensure we are not filing any outrageous timeouts anyway (which I suspect we don't). The next step here is in the actual lock timeout function waiting_locks_callback(). There before we actually evict the lock, we make a check if it's currently "busy" with ldlm_lock_busy(). I imagine, one more possible venue here is if we got a group lock here in the list, it would stop all waiting list processing, which sounds like a bug, but you are unlikely to have those? if (cfs_time_after(lock->l_callback_timeout, cfs_time_current()) || (lock->l_req_mode == LCK_GROUP)) break; to make sure that lock->l_callback_timeout is not say more than X*at_max in the future just in case would be good too. Now, when you do see the timeouts, do you know if the node that was evicted was actually alive at the time, by any chance? Do you have a system where we can try above mentioned debug to better narrow down the culprit (also would you like me to compose a patch?) |
| Comment by Christopher Morrone [ 04/Feb/14 ] |
|
What is a "group lock"? Many, many mdt theads get stuck. I rather suspect that in most cases the client nodes are alive, but I can't say that I have hard proof of that. Yes, please do make a patch. We can install it on the production MDS without too much delay (At worst we install it and wait for the next reboot. At best they let me reboot the MDS sooner onto the debug version.) |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
After another look through the code, I discovered that l_last_activity that is used to print that scary thousands of second number is only updated on lock enqueue, but not when we send the blocking AST, so the big number might be a red herring. As for stuck MDT threads, if the lock cancel was not delivered, it's only natural that everybody with conflicting locks was waiting for the cancel. The bigger question then is why the cancel is not coming assuming the client is alive? |
| Comment by James A Simmons [ 04/Feb/14 ] |
|
Just as a note since ORNL switched to 2.4 clients we are also seeing this as well. |
| Comment by Christopher Morrone [ 04/Feb/14 ] |
|
Sorry, the 1200 second warning is from lustre. We frequently see the lustre watchdog tripped and complain that "Service thread pid X was inactive for 1200.00s". Are you saying that l_last_activity is completely uninitialized, or that it will be a cummulative number? Either way, lets get a fix for that. At least we'll start moving in the direction of understanding what is going on. |
| Comment by Oleg Drokin [ 04/Feb/14 ] |
|
l_last_activity is initialiized when the lock is requested (also when converted). So it really shows how long that lock was alive. We probably should also init it every time we send something to the client so that it shows sensical numbers at other times. The 1200 seconds "inactivity" timeout is well explainable. it's due to ldlm enqueue timeout defined as ldlm_get_enq_timeout(lock) * 2 (== at_max * 2) in ldlm_completion_ast. Chances are 600 seconds before the inactivity timer there's some lock timeout and client eviciton, then another one shortly around the inactivity timeout hit? E.g. because there were two locks (from different clients) that somehow could not be canceled? |
| Comment by Christopher Morrone [ 04/Feb/14 ] |
|
No, I don't see evictions or timeouts 600s before the inactivity timeout. It does seem common for either a "lock callback timer expired after 1200s: evicting client" or a "lock timed out (enqueued at X, 1200s ago)" at the beginning of a bunch of 1200s inactivity warnings. |
| Comment by Oleg Drokin [ 05/Feb/14 ] |
|
Ok, I uploaded http://review.whamcloud.com/9127 that I tested lightly here. It fixes lock eviction message to actually specify how long was the wait since sending the RPC. This only makes sense to install on servers, primarily mds. Though I tested and clients don't die either in case you install it there by mistake too. I imagine it would be interesting to run with this and once the first 1200 timeout hits, have a log of this timeout + preceeding hour of messages. |
| Comment by Christopher Morrone [ 05/Feb/14 ] |
Great, thanks! I'll pull that in and see how quickly I can get the admins to install it. Maybe that will help us start to see where the real problems lie. |
| Comment by Oleg Drokin [ 11/Feb/14 ] |
|
A quick follow on question based on stuff in |
| Comment by Christopher Morrone [ 12/Feb/14 ] |
|
Well, yes, they are mostly PR/PR. But there are a few CR/CR in there as well. But that also was true going back a year when we were running 2.1. The server doesn't mention fids, so I wouldn't know. I don't recall what was happening on the clients, I'll keep an eye our for that the next time I see a problem. Frankly I'm beginning to expect that ldiskfs caching is the root of our troubles, as it has been in the past. Something lustre related is definitely badly fragmenting memory on the servers. If we leave zone_reclaim_mode=1 (the default), the zone reclaim code goes out to lunch for tens of minutes, and possibly hours, making the servers unusable. If we set zone_reclaim_mode=1 then we fragment and exhaust all of the memory on the node, causing order:2 atomic allocation failures in the mlx4 (10GigE mellanox) driver. That driver doesn't handle failed allocations reliably which leaves the node in a bad state. Of course, I'm making a bit of a guess about why the zone reclaim is such a problem based on my observations of the system with zone_reclaim_mode=0. There could be something else going on there. But I don't really know if the memory problems that I am hunting now are the same root of all of our problems, or just one problem among many. |
| Comment by Oleg Drokin [ 12/Feb/14 ] |
|
fid will be visible as resource id. |
| Comment by Vitaly Fertman [ 20/Feb/14 ] |
|
per export lock callback AT: http://review.whamcloud.com/#/c/9336/ |
| Comment by Ned Bass [ 04/Mar/14 ] |
|
The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds. In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change: 2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs 2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages 2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago) 2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264 2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages 2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816 2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274 2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781 2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages 2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075 2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024 2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360 2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786 2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages 2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0 2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). 2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail. 2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo) 2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages 2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1 2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo) 2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795 2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages 2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805 2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline 2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0 2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline |
| Comment by Oleg Drokin [ 10/Mar/14 ] |
|
So with all your locks evisted being PR locks just like in Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing. It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches. |
| Comment by Christopher Morrone [ 21/Apr/14 ] |
|
The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed. |
| Comment by Ned Bass [ 10/May/14 ] |
|
Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover. |
| Comment by Oleg Drokin [ 10/May/14 ] |
|
As I just commented in With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out. If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated. |
| Comment by Christopher Morrone [ 12/May/14 ] |
Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well. |
| Comment by James A Simmons [ 12/May/14 ] |
|
Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max. |
| Comment by Christopher Morrone [ 12/May/14 ] |
|
James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees. |
| Comment by Oleg Drokin [ 02/Jun/14 ] |
|
I think this is like several issues in one. |
| Comment by James A Simmons [ 02/Jun/14 ] |
|
There is also |
| Comment by Oleg Drokin [ 05/Jun/14 ] |
|
I extracted timeout fixes to http://review.whamcloud.com/10601 My testing seems to show it works. Without patch: [ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0 [ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1 [ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). and with the patch that becomes: [ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0 [ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1 [ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). |
| Comment by Peter Jones [ 23/Jun/14 ] |
|
Landed for 2.6 |