Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.8.0
-
None
-
centos 7.2; MDS * 2; OSS * 2;
-
3
-
9223372036854775807
Description
As can be seen from the log, the client is evicted by the OST0014. After that, this ost on that client can not write or read. Reading from that ost will give out errors like -108.
The corresponding obd is with status "IN" as can be seen from "lctl dl".
We failed to umount, unless with -l option.
And there are periodical "sluggish" warnings as can be seen from /var/log/messages around every 10mins.
However other clients are all normal.
This client can not recover even after several hours, even days I believe (I didn't try waiting that long).
OSS02:
Jun 10 13:12:21 oss02 kernel: LustreError: 10566:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 150s: evicting client at 10.3.28.26@o2ib ns: filter-stjfs-OST0014_UUID lock: ffff8831290af000/0xc76c97daaf26c922 lrc: 3/0,0 mode: PW/PW res: [0x1fe2965:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->32767) flags: 0x60000400010020 nid: 10.3.28.26@o2ib remote: 0x9b6b25718fa9bc69 expref: 250765 pid: 9550 timeout: 5051103759 lvb_type: 0
Jun 10 13:13:48 oss02 kernel: LustreError: 12072:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) ldlm_cancel from 10.3.28.26@o2ib arrived at 1528603939 with bad export cookie 14370027470008155739
Jun 10 13:13:48 oss02 kernel: LustreError: 12072:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) Skipped 10661 previous similar messages
Jun 10 13:16:02 oss02 kernel: Lustre: stjfs-OST0010: haven't heard from client 0a058e08-b52c-08ce-7512-d7389f48fc20 (at 10.3.28.21@o2ib) in 230 seconds. I think it's dead, and I am evicting it. exp ffff8808b1f10800, cur 1528604162 expire 1528604012 last 1528603932
client:
Jun 10 13:12:09 5A301-0308-G5500-25 kernel: LustreError: 167-0: stjfs-OST0014-osc-ffff883fbd33e000: This client was evicted by stjfs-OST0014; in progress operations using this service will fail.