[LU-4579] Timeout system horribly broken Created: 03/Feb/14  Updated: 19/Aug/14  Resolved: 23/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Blocker
Reporter: Christopher Morrone Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.4.0-21chaos


Attachments: Text File sysrq-t.catalyst141.client.txt     Text File sysrq-t.cider-mds1.txt    
Issue Links:
Related
is related to LU-4786 Apparent denial of service from clien... Resolved
is related to LU-4570 Metadata slowdowns on production file... Closed
is related to LU-4942 lock callback timeout is not per-export Resolved
is related to LU-5497 Many MDS service threads blocked in l... Closed
Severity: 3
Rank (Obsolete): 12505

 Description   

It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

We need to work out a plan to fix the timeouts in lustre.



 Comments   
Comment by Peter Jones [ 03/Feb/14 ]

Oleg will dig into this one

Comment by Oleg Drokin [ 04/Feb/14 ]

So, looking into it from a simple angle:
The timeout set is determined by ldlm_get_enq_timeout() an that function is remarcably simple:

        timeout = min_t(int, at_max, timeout + (timeout >> 1)); /* 150% */

So we take whatever at_timeout is ( timeout = at_get(ldlm_lock_to_ns_at(lock)) ) and then pick the maxumum of 150% of it or at_max.
Even if timeout is for some miracle negative, taht would be caught by next step:

        return max(timeout, ldlm_enqueue_min);

So we'll choose the bigger one of at_max or default value of ldlm_enqueue_min which is 100 (unless you override that with a module param which you probably don't).

Overall all lock timeout increases go through __ldlm_add_waiting_lock that has a second parameter of seconds. It's pretty easy to add a check there to ensure we are not filing any outrageous timeouts anyway (which I suspect we don't).

The next step here is in the actual lock timeout function waiting_locks_callback().

There before we actually evict the lock, we make a check if it's currently "busy" with ldlm_lock_busy().
Checking the function, it calls hpreq_lock_match. Now at least in case of MDT there's no hpreq handler registered there, so the only way for it to trigger is if the request is cancel and it's stuck in the queue somehow. Now if this does indeed happen, we are in a much bigger trouble here because the client did send a cancel and it's actually sitting right at the server for thousands of seconds unable to be processed? Does not sound too plausible, I guess.
We probably can debug this by adding a "prolonged" counter or printign a warnong when the lock was prolonged past original timeout for way too long.

I imagine, one more possible venue here is if we got a group lock here in the list, it would stop all waiting list processing, which sounds like a bug, but you are unlikely to have those?
Or if a lock with a very long timeout has landed on the list somehow, it would prevent the list from being processed until it's time hits. This is probably artefact of old times when lock timeout time was fixed and we did not want to iterate through tons of locks in the list asuming they are in mostly time-ordered order.
Perhaps adding an extra debug at

                if (cfs_time_after(lock->l_callback_timeout,
                                   cfs_time_current()) ||
                    (lock->l_req_mode == LCK_GROUP))
                        break;

to make sure that lock->l_callback_timeout is not say more than X*at_max in the future just in case would be good too.

Now, when you do see the timeouts, do you know if the node that was evicted was actually alive at the time, by any chance?

Do you have a system where we can try above mentioned debug to better narrow down the culprit (also would you like me to compose a patch?)

Comment by Christopher Morrone [ 04/Feb/14 ]

What is a "group lock"?

Many, many mdt theads get stuck. I rather suspect that in most cases the client nodes are alive, but I can't say that I have hard proof of that.

Yes, please do make a patch. We can install it on the production MDS without too much delay (At worst we install it and wait for the next reboot. At best they let me reboot the MDS sooner onto the debug version.)

Comment by Oleg Drokin [ 04/Feb/14 ]

After another look through the code, I discovered that l_last_activity that is used to print that scary thousands of second number is only updated on lock enqueue, but not when we send the blocking AST, so the big number might be a red herring.

As for stuck MDT threads, if the lock cancel was not delivered, it's only natural that everybody with conflicting locks was waiting for the cancel. The bigger question then is why the cancel is not coming assuming the client is alive?

Comment by James A Simmons [ 04/Feb/14 ]

Just as a note since ORNL switched to 2.4 clients we are also seeing this as well.

Comment by Christopher Morrone [ 04/Feb/14 ]

Sorry, the 1200 second warning is from lustre. We frequently see the lustre watchdog tripped and complain that "Service thread pid X was inactive for 1200.00s".

Are you saying that l_last_activity is completely uninitialized, or that it will be a cummulative number? Either way, lets get a fix for that. At least we'll start moving in the direction of understanding what is going on.

Comment by Oleg Drokin [ 04/Feb/14 ]

l_last_activity is initialiized when the lock is requested (also when converted). So it really shows how long that lock was alive. We probably should also init it every time we send something to the client so that it shows sensical numbers at other times.

The 1200 seconds "inactivity" timeout is well explainable. it's due to ldlm enqueue timeout defined as ldlm_get_enq_timeout(lock) * 2 (== at_max * 2) in ldlm_completion_ast.

Chances are 600 seconds before the inactivity timer there's some lock timeout and client eviciton, then another one shortly around the inactivity timeout hit? E.g. because there were two locks (from different clients) that somehow could not be canceled?

Comment by Christopher Morrone [ 04/Feb/14 ]

No, I don't see evictions or timeouts 600s before the inactivity timeout.

It does seem common for either a "lock callback timer expired after 1200s: evicting client" or a "lock timed out (enqueued at X, 1200s ago)" at the beginning of a bunch of 1200s inactivity warnings.

Comment by Oleg Drokin [ 05/Feb/14 ]

Ok, I uploaded http://review.whamcloud.com/9127 that I tested lightly here.

It fixes lock eviction message to actually specify how long was the wait since sending the RPC.
Adds a few timeout checks just in case (I see some refresh calls from e.g. ofd may not use at_max.
Prints number of busy lock prolongations (a bit hackish, only for ibit locks - i.e. MDS)

This only makes sense to install on servers, primarily mds. Though I tested and clients don't die either in case you install it there by mistake too.

I imagine it would be interesting to run with this and once the first 1200 timeout hits, have a log of this timeout + preceeding hour of messages.

Comment by Christopher Morrone [ 05/Feb/14 ]

Ok, I uploaded http://review.whamcloud.com/9127 that I tested lightly here.

Great, thanks! I'll pull that in and see how quickly I can get the admins to install it. Maybe that will help us start to see where the real problems lie.

Comment by Oleg Drokin [ 11/Feb/14 ]

A quick follow on question based on stuff in LU-4570 - are all/most timed out locks reported on MDT (with corresponding client eviction) happen to have PR/PR mode? Mostly same fid? If so, are there any errors reported on cient side coming from dir.c or otherwise hinting at readdir failures?

Comment by Christopher Morrone [ 12/Feb/14 ]

Well, yes, they are mostly PR/PR. But there are a few CR/CR in there as well. But that also was true going back a year when we were running 2.1. The server doesn't mention fids, so I wouldn't know.

I don't recall what was happening on the clients, I'll keep an eye our for that the next time I see a problem.

Frankly I'm beginning to expect that ldiskfs caching is the root of our troubles, as it has been in the past. Something lustre related is definitely badly fragmenting memory on the servers. If we leave zone_reclaim_mode=1 (the default), the zone reclaim code goes out to lunch for tens of minutes, and possibly hours, making the servers unusable. If we set zone_reclaim_mode=1 then we fragment and exhaust all of the memory on the node, causing order:2 atomic allocation failures in the mlx4 (10GigE mellanox) driver. That driver doesn't handle failed allocations reliably which leaves the node in a bad state.

Of course, I'm making a bit of a guess about why the zone reclaim is such a problem based on my observations of the system with zone_reclaim_mode=0. There could be something else going on there.

But I don't really know if the memory problems that I am hunting now are the same root of all of our problems, or just one problem among many.

Comment by Oleg Drokin [ 12/Feb/14 ]

fid will be visible as resource id.

Comment by Vitaly Fertman [ 20/Feb/14 ]

per export lock callback AT: http://review.whamcloud.com/#/c/9336/

Comment by Ned Bass [ 04/Mar/14 ]

The symptoms of this bug returned today, where MDT service thread sleep for 1200 seconds, and clients get evicted for expired lock callback timers after several thousand seconds.

In desperation, I disabled adaptive timeouts and set the static OBD timeout to 15 seconds. This triggered a few thousand evictions for expired lock callback timers, and all threads stuck in ldlm_completion_ast() got unstuck. Here is some console output following the change:

2014-03-04 13:32:34 Lustre: lse-MDT0000: Client a53e9892-091f-a3a3-6afb-b33c0ef01937 (at 192.168.113.86@o2ib10) refused reconnection, still busy with 1 active RPCs
2014-03-04 13:32:34 Lustre: Skipped 15920 previous similar messages
2014-03-04 13:33:07 Lustre: lock timed out (enqueued at 1393967587, 1200s ago)
2014-03-04 13:33:07 LustreError: dumping log to /tmp/lustre-log.1393968787.15264
2014-03-04 13:33:07 Lustre: Skipped 76 previous similar messages
2014-03-04 13:33:12 LustreError: dumping log to /tmp/lustre-log.1393968792.37816
2014-03-04 13:33:16 LustreError: dumping log to /tmp/lustre-log.1393968796.15274
2014-03-04 13:33:19 LustreError: dumping log to /tmp/lustre-log.1393968799.15149
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2f8b36d2-dc81-d77b-ec4c-195d750b1914 (at 192.168.120.5@o2ib64) in 50 seconds. I think it's dead, and I am evicting it. exp ffff8810197a1c00, cur 1393968803 expire 1393968785 last 1393968753
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client 2db5e0b9-4e5e-15c5-b6dd-1abb3ccf2da7 (at 192.168.120.211@o2ib7) in 44 seconds. I think it's dead, and I am evicting it. exp ffff881017c07400, cur 1393968803 expire 1393968785 last 1393968759
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:23 Lustre: lse-MDT0000: haven't heard from client lse-MDT0000-lwp-OST0007_UUID (at 172.19.1.173@o2ib100) in 22 seconds. I think it's dead, and I am evicting it. exp ffff88075a2df800, cur 1393968803 expire 1393968785 last 1393968781
2014-03-04 13:33:23 Lustre: Skipped 461 previous similar messages
2014-03-04 13:33:23 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:24 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:25 LustreError: dumping log to /tmp/lustre-log.1393968805.15075
2014-03-04 13:33:26 LustreError: dumping log to /tmp/lustre-log.1393968806.51024
2014-03-04 13:33:32 LustreError: dumping log to /tmp/lustre-log.1393968812.55360
2014-03-04 13:33:33 Lustre: lse-MDT0000: haven't heard from client 3c63d47b-96bd-3e2e-955a-f2730eb5016b (at 192.168.112.131@o2ib6) in 27 seconds. I think it's dead, and I am evicting it. exp ffff88101918fc00, cur 1393968813 expire 1393968795 last 1393968786
2014-03-04 13:33:33 Lustre: Skipped 1665 previous similar messages
2014-03-04 13:33:33 LustreError: 0:0:(ldlm_lockd.c:403:waiting_locks_callback()) ### lock callback timer expired after 12563s: evicting client at 172.16.66.59@tcp  ns: mdt-lse-MDT0000_UUID lock: ffff8809ab1e6600/0x423cea3ce22f0633 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1739 type: IBT flags: 0x200400000020 nid: 172.16.66.59@tcp remote: 0x1279e3153f087cb3 expref: 81 pid: 15177 timeout: 5257100077 lvb_type: 0 used 0
2014-03-04 13:33:33 LNet: Service thread pid 15112 completed after 12530.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
2014-03-04 13:33:33 LNet: Skipped 8 previous similar messages
2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:34 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:36 LustreError: 11-0: MGC172.19.1.165@o2ib100: Communicating with 0@lo, operation obd_ping failed with -107.
2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection to lse-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
2014-03-04 13:33:36 LustreError: 167-0: lse-MDT0000-lwp-MDT0000: This client was evicted by lse-MDT0000; in progress operations using this service will fail.
2014-03-04 13:33:36 Lustre: lse-MDT0000-lwp-MDT0000: Connection restored to lse-MDT0000 (at 0@lo)
2014-03-04 13:33:36 LustreError: Skipped 2 previous similar messages
2014-03-04 13:33:36 LustreError: 166-1: MGC172.19.1.165@o2ib100: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
2014-03-04 13:33:36 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:36 Lustre: Evicted from MGS (at MGC172.19.1.165@o2ib100_0) after server handle changed from 0x423cea2638047ae8 to 0x423cea3cff2214d1
2014-03-04 13:33:36 Lustre: MGC172.19.1.165@o2ib100: Connection restored to MGS (at 0@lo)
2014-03-04 13:33:42 LustreError: 15108:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e5ac00 ns: mdt-lse-MDT0000_UUID lock: ffff880048cd1a00/0x423cea3ce30f0fb0 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1724 type: IBT flags: 0x200400000020 nid: 192.168.112.1@o2ib6 remote: 0x713cecc9e87499d expref: 4 pid: 15108 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:43 Lustre: lse-MDT0000: haven't heard from client a80b1a8c-c987-d7fb-6c1c-ce4e5b61f732 (at 192.168.112.60@o2ib6) in 28 seconds. I think it's dead, and I am evicting it. exp ffff881019014400, cur 1393968823 expire 1393968805 last 1393968795
2014-03-04 13:33:43 Lustre: Skipped 2379 previous similar messages
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881017e4d400 ns: mdt-lse-MDT0000_UUID lock: ffff880c765e8600/0x423cea3ce4515069 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1519 type: IBT flags: 0x200400000020 nid: 192.168.120.83@o2ib7 remote: 0xf25524bc0806ed2e expref: 2 pid: 15454 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:43 LustreError: 15454:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 1 previous similar message
2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880fff583400 ns: mdt-lse-MDT0000_UUID lock: ffff8809b72fac00/0x423cea3ce6120703 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1510 type: IBT flags: 0x200400000020 nid: 192.168.115.88@o2ib10 remote: 0xb08c1994fd203a3c expref: 6 pid: 15499 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:45 LustreError: 15499:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 16 previous similar messages
2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880786ee3000 ns: mdt-lse-MDT0000_UUID lock: ffff880068d1d600/0x423cea3ceb81db53 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1409 type: IBT flags: 0x200400000020 nid: 172.16.66.63@tcp remote: 0xb4e5531b81ec2e4b expref: 2 pid: 15083 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:46 LustreError: 15083:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) Skipped 31 previous similar messages
2014-03-04 13:33:53 Lustre: lse-MDT0000: haven't heard from client 99193826-5f5d-6804-2763-bb5bb50685cc (at 192.168.113.194@o2ib5) in 28 seconds. I think it's dead, and I am evicting it. exp ffff880c61172000, cur 1393968833 expire 1393968815 last 1393968805
2014-03-04 13:33:53 Lustre: Skipped 1771 previous similar messages
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:53 LustreError: dumping log to /tmp/lustre-log.1393968833.55308
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
2014-03-04 13:33:53 LustreError: 15476:0:(ldlm_lockd.c:1409:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88052089e400 ns: mdt-lse-MDT0000_UUID lock: ffff88055a52f800/0x423cea3cebaeeab9 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 1385 type: IBT flags: 0x200400000020 nid: 192.168.120.68@o2ib7 remote: 0x413fb76118f617e1 expref: 2 pid: 15476 timeout: 0 lvb_type: 0 used 0
2014-03-04 13:33:53 format at client.c:2860:ptlrpc_abort_cond_resched doesn't end in newline
Comment by Oleg Drokin [ 10/Mar/14 ]

So with all your locks evisted being PR locks just like in LU-4570, let's try to isolate how come readdir locks are so stubborn.

Next time you see a "refused reconnection, still busy with X RPCs" message, please do a sysrq-t on that client node to see what is it doing.
If you can pull off sysrq-t on the MDT to ensure that mdt_rdpg_ threads are not stuck somewhere - that would be great as well.
My suspicion is there's some path on a client that forgets to release a reference to a matched lock (and the lock happening to be PR lock at all times somehow) and then it's never cancelled. I tried to review the code and I do not see anything, though. Likely happens on a failure path of some sort, so only triggers if you have other problems (And I see you also have client evictions for inactivity and what not).

It's interesting that over at ORNL after applying lu4008 and lu4719 patches, similar symptoms went away, though not entirely clear why. So it still might be worth your while to pull in those patches.

Comment by Christopher Morrone [ 21/Apr/14 ]

The fix for the very-wrong timeouts in console messages was part of a larger debug patch that is not intended for landing. The timeout fix should be broken out into its own patch and landed.

Comment by Ned Bass [ 10/May/14 ]

Attaching sysrq-t output for client that was being refused reconnection, and for the MDS. The problem with many MDS service threads getting stuck in ldlm_completion_ast() happened again today on two of our lustre clusters. We rebooted the MDS nodes to recover.

Comment by Oleg Drokin [ 10/May/14 ]

As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

With ldlm resends not properly detected you can see this sort of behavior where clients are not dropping locks (that they do not have really) and get timed out.
And the resends are happening because the clients provide too small of a buffer.

If the patch removal is too hard, at least try to increase the minimal allocation size some more. I know it would result in harder work for kernel during allocations and such, but before we tackle this resending detection to be bullet proof, I am interested in seeing the results with resends totally eliminated.

Comment by Christopher Morrone [ 12/May/14 ]

As I just commented in LU-4584 - Is there any chance you can drop patch for lu-3338 from your tree and see if the issue is still there?

Oleg, I don't think that is likely. We need that. To take it out, you would need to provide us with a fix that works as well.

Comment by James A Simmons [ 12/May/14 ]

Chris have you tried http://review.whamcloud.com/#/c/10230? That patch addressed the issue of early replies ignoring at_max.

Comment by Christopher Morrone [ 12/May/14 ]

James, that is on my list of things to try. It will be another things that helps us make sense of what is going on by having saner numbers, but I suspect it won't address our root issue(s). There are so many issues though, it is sometimes hard to see the forest for the trees.

Comment by Oleg Drokin [ 02/Jun/14 ]

I think this is like several issues in one.
First we have those undesirable lock evictions that are likely due to LU-4584 / LU-2827 and then once that condition hits the too long timeout due to LU-4578 would also become an issue.

Comment by James A Simmons [ 02/Jun/14 ]

There is also LU-5116 which I saw a few times in testing as well. The patch for that ticket does even solve all the issues

Comment by Oleg Drokin [ 05/Jun/14 ]

I extracted timeout fixes to http://review.whamcloud.com/10601

My testing seems to show it works.

Without patch:

[ 3867.816073] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 2043s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008bc7adb8/0x4d6bf1df6fe16d9c lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x4d6bf1df6fe16d8e expref: 22 pid: 7000 timeout: 4295859101 lvb_type: 0
[ 3867.820031] LNet: Service thread pid 6487 completed after 150.60s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 4023.816076] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 308s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88009105ddb8/0x4d6bf1df6fe16f71 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0x4d6bf1df6fe16f6a expref: 4 pid: 7010 timeout: 4295898235 lvb_type: 1
[ 4023.820028] LNet: Service thread pid 6833 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

and with the patch that becomes:

[ 1514.816088] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff88008b395db8/0xd330a3da8f272c64 lrc: 3/0,0 mode: PR/PR res: [0x200000400:0x1:0x0].0 bits 0x13 rrc: 5 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0xd330a3da8f272c56 expref: 24 pid: 4414 timeout: 4295270824 lvb_type: 0
[ 1514.828220] LNet: Service thread pid 4604 completed after 150.71s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[ 1670.816067] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 0@lo  ns: filter-lustre-OST0000_UUID lock: ffff88008af02db8/0xd330a3da8f272e0f lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 0@lo remote: 0xd330a3da8f272e08 expref: 4 pid: 4222 timeout: 4295309985 lvb_type: 1
[ 1670.821319] LNet: Service thread pid 4423 completed after 150.06s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Comment by Peter Jones [ 23/Jun/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:44:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.