Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.12.8
-
None
-
3
-
9223372036854775807
Description
Am running a large number of deletes on clients and after a while they get evicted, the error on the client is:
/bin/rm: fts_read failed: Cannot send after transport endpoint shutdown
On the MDS, the error is:
un 6 19:28:59 fmds1 kernel: LustreError: 9744:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.21.22.31@tcp ns: mdt-foxtrot-MDT0000_UUID lock: ffff94f72a408480/0xb4442ee3e798319c lrc: 3/0,0 mode: PR/PR res: [0x20009b3c6:0x29eb:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.21.22.31@tcp remote: 0x40ff70b2e6a5419f expref: 147862 pid: 61992 timeout: 6578337 lvb_type: 0
I'm running maybe 10-15 recursive rm on 3 clients, so 30-45 in total at once.
I've set debugging params as follows:
lctl set_param debug_mb=1024 lctl set_param debug="+dlmtrace +info +rpctrace" lctl set_param dump_on_eviction=1
on clients and the MDS.
Lustre version is 2.12.8_6_g5457c37
Hi Peter, was just about to post an update. No evictions since the patch was applied ealier in the week (Tuesday), so good news on that front. Will keep an eye on it over the weekend. We get the odd soft lockup (e.g., Nov 9 03:11:25 foxtrot3 kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [ptlrpcd_01_10:3531]). I can open a separate ticket for that issue if you like