[LU-12774] Lustre client OST stuck in "Evicted" state Created: 17/Sep/19  Updated: 15/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Tim McMullan Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We had one of our lustre clients (this one is acting as an NFS gateway).  It got evicted from an OST and seems to be stuck in that state and never recovers.  Rebooting seems to be required to get it back in operation.

 

We were getting these out of the kernel log:

kernel: [ 7226.864597] LustreError: 11-0: lustre-OST0090-osc-ffff88103aeb4000: operation ost_read to node 10.11.200.13@o2ib failed: rc = -107
kernel: [ 7226.864606] Lustre: lustre-OST0090-osc-ffff88103aeb4000: Connection to lustre-OST0090 (at 10.11.200.13@o2ib) was lost; in progress operations using this service will wait for recovery to complete
kernel: [ 7226.864772] LustreError: 167-0: lustre-OST0090-osc-ffff88103aeb4000: This client was evicted by lustre-OST0090; in progress operations using this service will fail.

kernel: [ 7226.877968] LustreError: 6866:0:(ldlm_resource.c:1101:ldlm_resource_complain()) lustre-OST0090-osc-ffff88103aeb4000: namespace resource [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c9
5f880) refcount nonzero (1) after lock cleanup; forcing cleanup.
kernel: [ 7226.877972] LustreError: 6866:0:(ldlm_resource.c:1683:ldlm_resource_dump()) — Resource: [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c95f880) refcount = 2
lstgwbal837 kernel: [ 7226.877973] LustreError: 6866:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):
lstgwbal837 kernel: [ 7226.877978] LustreError: 6866:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x526400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1

kernel: [ 7460.227838] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) extent ffff8807ba25aa98@

{[6 -> 9/1023], [3|1|-|active|wiuY|ffff8816a6279180], [40960|4|+|-|ffff88100e7a1c00|1024| (null)]}

lustre-OST0090-osc-ffff88103aeb4000: wait ext to 0 timedout, recovery in progress?
kernel: [ 7460.227846] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) ### extent: ffff8807ba25aa98 ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x426400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1
kernel: [ 7460.227848] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) Skipped 1 previous similar message

 

We aren't sure whats going on there, but it looked to us like after getting evicted it tried to clean up locks and was failing to clean one up, which was preventing it from trying to recover?

We are currently running the 2.10.8 client and server.  Any help would be appreciated!

Thanks!



 Comments   
Comment by Tim McMullan [ 23/Sep/19 ]

We just hit a similar state, but we weren't actively evicted from a client.  

 

We were only getting these:
[Mon Sep 23 13:41:20 2019] LustreError: 7973:0:(osc_cache.c:952:osc_extent_wait()) extent ffff881972951e88@{[29846 -> 29856/30719], [4|1|-|active|wiuY|ffff8816e4114990], [69632|11|+|+|ffff88190cb0d800|1024| (null)]} lustre-OST008f-osc-ffff88103de9c000: wait ext to 0 timedout, recovery in progress?
[Mon Sep 23 13:41:20 2019] LustreError: 7973:0:(osc_cache.c:952:osc_extent_wait()) ### extent: ffff881972951e88 ns: lustre-OST008f-osc-ffff88103de9c000 lock: ffff88190cb0d800/0x390d8f3984002de2 lrc: 4/0,1 mode: PW/PW res: [0x35b5871:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->49151) flags: 0x20000020000 nid: local remote: 0x24410d5ca118deb2 expref: -99 pid: 7973 timeout: 0 lvb_type: 1

 

that pid showed this backtrace:
[<ffffffffa118648c>] osc_extent_wait+0x49c/0x7c0 [osc]
[<ffffffffa118cf64>] osc_cache_wait_range+0x314/0x970 [osc]
[<ffffffffa117909b>] osc_io_fsync_end+0x7b/0x80 [osc]
[<ffffffffa0cd959e>] cl_io_end+0x4e/0x130 [obdclass]
[<ffffffffa104abeb>] lov_io_end_wrapper+0xcb/0xd0 [lov]
[<ffffffffa104ae87>] lov_io_fsync_end+0x77/0x1a0 [lov]
[<ffffffffa0cd959e>] cl_io_end+0x4e/0x130 [obdclass]
[<ffffffffa0cdb940>] cl_io_loop+0x110/0xc30 [obdclass]
[<ffffffffa10af45a>] cl_sync_file_range+0x28a/0x310 [lustre]
[<ffffffffa10cfc03>] ll_writepages+0x73/0x1d0 [lustre]
[<ffffffff81197717>] __filemap_fdatawrite_range+0xa7/0xe0
[<ffffffff8119784c>] filemap_write_and_wait_range+0x3c/0x80
[<ffffffffa10af56b>] ll_fsync+0x8b/0x5d0 [lustre]
[<ffffffffa131cc84>] nfsd_commit+0x94/0xb0 [nfsd]
[<ffffffffa13253f3>] nfsd3_proc_commit+0x83/0xd0 [nfsd]
[<ffffffffa1317e0c>] nfsd_dispatch+0xcc/0x270 [nfsd]
[<ffffffffa125152f>] svc_process_common+0x42f/0x6c0 [sunrpc]
[<ffffffffa12518bd>] svc_process+0xfd/0x1c0 [sunrpc]
[<ffffffffa131787a>] nfsd+0xea/0x160 [nfsd]
[<ffffffff810a0e29>] kthread+0xc9/0xe0
[<ffffffff8161e1df>] ret_from_fork+0x3f/0x80
[<ffffffff810a0d60>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff

 

We are still trying to track down what exactly is getting us in this state.  

Comment by Makia Minich (Inactive) [ 01/Oct/19 ]

A quick question, was this system upgraded to 2.10.8, and if so from what? Have there been any further occurrences of this since reported?

Comment by Tim McMullan [ 01/Oct/19 ]

We have had this happen since reporting the issue multiple times.  At this point, we've moved the service off to a different host (still on the old client actually) and haven't seen it on the other, older client host.

The client was initially 2.10.0, but was upgraded to 2.10.8 which is when we started seeing problems.

The servers started off on 2.7 quite a while ago, the system was upgraded to 2.10.0, then to 2.10.4, then 2.10.8.  All of that is relatively ancient history at this point, the client is the only thing to change recently.

Comment by Makia Minich (Inactive) [ 01/Oct/19 ]

Thanks for this information, another question just to get a clearer picture. The above that happened multiple times, this was all on the same host, or different hosts? You migrated to a "new" host (older version) but all previous failures were on the same 2.10.8 host?

Comment by Tim McMullan [ 01/Oct/19 ]

sure thing, thank you for looking! Yes, all occurrences have been from the same host so far.

Comment by Makia Minich (Inactive) [ 01/Oct/19 ]

At this point, I have 2 running theories. The first is that because this has been seen on the same host, something is preventing recovery from fully completing, which then keeps popping back up. You may need to look at aborting recovery on mount if you haven't already. The expectation here is that it will "ignore" the previous states and move back on with life. Have you attempted to abort recovery on this host to see what happens?

The second theory is a bit more worrisome, and that this is some kind of issue introduced between 2.10.0 and 2.10.8 that is causing NFS services to evict. I don't think we're quite here yet, but writing it just in case.

Comment by Tim McMullan [ 01/Oct/19 ]

So, slightly more interesting bit: when we were digging around on the system while it was broken it looked like the client was trying to flush all its locks, but got stuck trying to flush a single lock and looked like the client never actually tried to enter recovery.

A little more information on what we have seen so far: An issue has cropped up 3 times, all have included the "osc_cache.c:952:osc_extent_wait()" Lustre error.  The first and third time, we were actually evicted by the server and we think never tried to enter recovery.  The second occurrence was a little different... It didn't look like we got evicted, we did get several of the "osc_cache.c:952:osc_extent_wait()" Lustre errors, but every nfsd process was locked up.  The trace for every nfsd process looked like the trace I included in my first comment.

We haven't yet tried to abort recovery on mount, though I was thinking about trying it if we saw the error again.

Comment by Makia Minich (Inactive) [ 01/Oct/19 ]

OK, this lends more to the first theory and needing to abort recovery. There had been some testing seen where a client wasn't fully releasing locks which was causing recovery to fail. By aborting recovery, it gets past that state and allows the system to basically quiet down and move on.

Comment by Tim McMullan [ 01/Oct/19 ]

I'll give that a try if we start seeing errors again, but if that solves the immediate issue I'm a little concerned that we got there in the first place

Comment by Makia Minich (Inactive) [ 01/Oct/19 ]

I will need to dig through some Jira to see if there are reported issues relating to this. Because this appears to be an issue related to unreleased locks and the client is specifically re-exporting via NFS we are likely seeing something in that path not quite finishing. This could be an untimely crash of this specific host or some transaction between the NFS server and it's clients not quite completing. We would need to trace back to the beginning of this issue (and less the later evictions) to understand if there was some kind of event at that time that we can narrow down to; crashes/reboots, networking errors/hiccups, that kind of thing. This definitely shouldn't be seen under "normal" conditions.

Comment by Tim McMullan [ 11/Oct/19 ]

we had a repro on a different lustre system:

 

2019-10-11T16:12:14.325450-04:00 lstgwbal850 kernel: [3460541.106751] LustreError: 67287:0:(ldlm_resource.c:1683:ldlm_resource_dump()) — Resource: [0x1240000400:0xd5d2c5:0x0].0x0 (ffff880cd3510400) refcount = 2

2019-10-11T16:12:14.325451-04:00 lstgwbal850 kernel: [3460541.106753] LustreError: 67287:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):

2019-10-11T16:12:14.325452-04:00 lstgwbal850 kernel: [3460541.106758] LustreError: 67287:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: lustre2-OST003d-osc-ffff8817d77a6800 lock: ffff880196a92600/0x42b2f97bf1c27aaf lrc: 4/0,1 mode: PW/PW res: [0x1240000400:0xd5d2c5:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x526400000000 nid: local remote: 0x686ffd0d6ebfa2f7 expref: -99 pid: 4105 timeout: 0 lvb_type: 1

2019-10-11T16:17:15.373431-04:00 lstgwbal850 kernel: [3460842.129241] LustreError: 4101:0:(osc_cache.c:952:osc_extent_wait()) extent ffff88156af6f408@{[2388 -> 2388/3071], [3|1|-|active|wiuY|ffff880dc0ec3180], [28672|1|+|-|ffff880196a92600|1024|          (null)]} lustre2-OST003d-osc-ffff8817d77a6800: wait ext to 0 timedout, recovery in progress?

2019-10-11T16:17:15.373434-04:00 lstgwbal850 kernel: [3460842.129248] LustreError: 4101:0:(osc_cache.c:952:osc_extent_wait()) ### extent: ffff88156af6f408 ns: lustre2-OST003d-osc-ffff8817d77a6800 lock: ffff880196a92600/0x42b2f97bf1c27aaf lrc: 4/0,1 mode: PW/PW res: [0x1240000400:0xd5d2c5:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x426400000000 nid: local remote: 0x686ffd0d6ebfa2f7 expref: -99 pid: 4105 timeout: 0 lvb_type: 1

 

Attempting to abort recovery hasn't seemed to help :/

Comment by Tim McMullan [ 11/Oct/19 ]

Our current (tentative) plan going forward is to build latest 2.12 client for ours system and install on the problem box.  If it recurs with the 2.12 client, we plan to start bisecting the 2.10 client, install it on our nfs server, and see where (rougly) we still see issues and the issues stop.

The thinking here is that our 2.10.0 client seems to run clean, so perhaps we can at least find the bug client side.

Comment by Makia Minich (Inactive) [ 12/Oct/19 ]

I'd say this is a good next step to see what happens. Thanks for the update.

Comment by Tim McMullan [ 05/Nov/19 ]

We've had a repo of this on the lustre 2.12.3 client:

LustreError: 4730:0:(osc_cache.c:955:osc_extent_wait()) extent ffff881fe90ce9b8@{[4153 -> 4178/5119], [3|1|-|active|wiuY|ffff881ff61c3b50], [131072|26|+|-|ffff881f7dd55b80|1024| (null)]} lustre-OST0084-osc-ffff88103b209000: wait ext to 0 timedout, recovery in progress?
LustreError: 4730:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff881fe90ce9b8 ns: lustre-OST0084-osc-ffff88103b209000 lock: ffff881f7dd55b80/0xa275852dbeedc19a lrc: 4/0,1 mode: PW/PW res: [0x39b9f46:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 16777216->17125375) flags: 0x20000020000 nid: local remote: 0x7517e8b592c8f8bb expref: -99 pid: 4729 timeout: 0 lvb_type: 1

The node didn't get evicted, but we had enough of these to hang up the nfs server last night.

Comment by Makia Minich (Inactive) [ 15/Nov/19 ]

You had mentioned that if the issue reoccurred you were going to test a 2.10.0 client; were you able to do that? On next occurrence of this, it would also be good to get a full set of lustre logs attached to see what fully is happening.

Generated at Sat Feb 10 02:55:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.