Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.8
-
None
-
3
-
9223372036854775807
Description
We had one of our lustre clients (this one is acting as an NFS gateway). It got evicted from an OST and seems to be stuck in that state and never recovers. Rebooting seems to be required to get it back in operation.
We were getting these out of the kernel log:
kernel: [ 7226.864597] LustreError: 11-0: lustre-OST0090-osc-ffff88103aeb4000: operation ost_read to node 10.11.200.13@o2ib
failed: rc = -107
kernel: [ 7226.864606] Lustre: lustre-OST0090-osc-ffff88103aeb4000: Connection to lustre-OST0090 (at 10.11.200.13@o2ib
) was lost; in progress operations using this service will wait for recovery to complete
kernel: [ 7226.864772] LustreError: 167-0: lustre-OST0090-osc-ffff88103aeb4000: This client was evicted by lustre-OST0090; in progress operations using this service will fail.
kernel: [ 7226.877968] LustreError: 6866:0:(ldlm_resource.c:1101:ldlm_resource_complain()) lustre-OST0090-osc-ffff88103aeb4000: namespace resource [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c9
5f880) refcount nonzero (1) after lock cleanup; forcing cleanup.
kernel: [ 7226.877972] LustreError: 6866:0:(ldlm_resource.c:1683:ldlm_resource_dump()) — Resource: [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c95f880) refcount = 2
lstgwbal837 kernel: [ 7226.877973] LustreError: 6866:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):
lstgwbal837 kernel: [ 7226.877978] LustreError: 6866:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x526400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1
kernel: [ 7460.227838] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) extent ffff8807ba25aa98@
{[6 -> 9/1023], [3|1|-|active|wiuY|ffff8816a6279180], [40960|4|+|-|ffff88100e7a1c00|1024| (null)]} lustre-OST0090-osc-ffff88103aeb4000: wait ext to 0 timedout, recovery in progress?
kernel: [ 7460.227846] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) ### extent: ffff8807ba25aa98 ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x426400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1
kernel: [ 7460.227848] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) Skipped 1 previous similar message
We aren't sure whats going on there, but it looked to us like after getting evicted it tried to clean up locks and was failing to clean one up, which was preventing it from trying to recover?
We are currently running the 2.10.8 client and server. Any help would be appreciated!
Thanks!