Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.5
-
None
-
CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5
-
3
-
9223372036854775807
Description
Hi folks,
We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that's resulting in some clients being hung/evicted and requiring a reboot.
What we're seeing are entries like this:
Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0
At the time of first investigating it appears that FID was indeed not accessible:
root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0 /fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini
ls'ing this file hung and resulted in:
Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766] req@ffff88ba05c35100 x1684505381509824/t0(0) o101->dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
This file did not show up as being open, per:
[warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files
So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT's and remounting will result in the file/dir becoming accessible again.
What steps would you like us to perform to provide additional information to you?
Cheers,
Simon