Cray just hit this in our 2.5 with SLES11SP3 clients on CentOS 6.4 servers, twice.
Both times, it was immediately preceded by a 'double eviction' of a particular client, two callback timers expiring within the same second for the same client:
2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8803fa547980/0xd460be1680c1b564 lrc: 3/0,0 mode: PW/PW res: [0x424b478:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e0e5 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0
2014-02-01 13:44:35 LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 54@gni1 ns: filter-esfprod-OST0004_UUID lock: ffff8802c669d100/0xd460be1680c1b580 lrc: 3/0,0 mode: PW/PW res: [0x424b479:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 54@gni1 remote: 0xf4283d6d3305e155 expref: 2679 pid: 22955 timeout: 4994587934 lvb_type: 0
And the other instance of it (on a different system, also running Cray's 2.5):
LustreError: 3489:0:(ldlm_lockd.c:433:ldlm_add_waiting_lock()) Skipped 1 previous similar message
LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880445551180/0xa9b58315ab2d4d0a lrc: 3/0,0 mode: PW/PW res: [0xb5da7e1:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47938
expref: 1024 pid: 16057 timeout: 4781830324 lvb_type: 0
LustreError: 53:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 113s: evicting client at 62@gni
ns: filter-scratch-OST0001_UUID lock: ffff880387977bc0/0xa9b58315ab2d4d6c lrc: 3/0,0 mode: PW/PW res: [0xb5da7e6:0x0:0x0].0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 62@gni remote: 0x330a8b1d9fd47bfb
expref: 1030 pid: 16082 timeout: 4781830324 lvb_type: 0
I'm just speculating, but I suspect this is key.
For some reason, the nid_stats struct pointer in the obd_export is zero, so I wasn't able to confirm if this export was actually the double-evicted client. (Perhaps this is always the case?)
[Edited to clarify server and client versions. Also, we've seen this several times since. Not sure why we're suddenly seeing it now, as indications are the underlying bug has been in the code for some time.]
I think
LU-5116might be related here to explan resend across eviction