[LU-14171] Lock timed out & hung clients Created: 02/Dec/20  Updated: 04/Dec/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: SC Admin (Inactive) Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5


Attachments: Text File ldlm_request.log.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi folks,

We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that's resulting in some clients being hung/evicted and requiring a reboot.

What we're seeing are entries like this:

Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0

At the time of first investigating it appears that FID was indeed not accessible:

root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0
/fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini

ls'ing this file hung and resulted in:

Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766]  req@ffff88ba05c35100 x1684505381509824/t0(0) o101->dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1

This file did not show up as being open, per:

[warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files

So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT's and remounting will result in the file/dir becoming accessible again.

What steps would you like us to perform to provide additional information to you?

Cheers,
Simon



 Comments   
Comment by Etienne Aujames [ 02/Dec/20 ]

Hello,

Maybe the LU-14031 could help you?

Comment by SC Admin (Inactive) [ 03/Dec/20 ]

Hi Etienne,

thanks for looking at the ticket.

I've attached the ~2 days of ldlm_request looping that we saw on the hung FID.

after that the MDS stopped responding totally to all requests, the fs was hung, and we were forced to umount, lustre_rmmod, mount the MDT to recover. there weren't any ldlm_lockd.c::expired_lock_main in those 2 days.

on the client side, our main symptom was that there were D state processes on several clients - all associated with that single FID. when we rebooted those clients, it didn't affect the hung processes on other nodes.

the MDT totally failed at approx Dec 1 14:50:49
MDTs were "umount complete" at Dec 1 15:01:19

after the total fail we did see a couple of ldlm_lockd.c:256:expired_lock_main, but not during the preceedng 2 days.
greps for "ldlm_request.c|ldlm_lockd" at the end of the event is below ->

...
/var/log/messages-20201202.gz:Dec  1 14:19:32 warble2 kernel: LustreError: 42988:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606792472, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec603688b40/0xe4be9cdddb855e96 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 224 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42988 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606793921, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1e6b402d00/0xe4be9cdde0306879 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43125 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 1 previous similar message
/var/log/messages-20201202.gz:Dec  1 14:45:48 warble2 kernel: LustreError: 43123:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606794048, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1b19948900/0xe4be9cdde086763d lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43123 timeout: 0 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:52:30 arkle8 kernel: LustreError: 54274:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 192.168.44.110@o2ib44  ns: filter-dagg-OST000e_UUID lock: ffff8f1d76277bc0/0xc9ea3a0ff004f28e lrc: 3/0,0 mode: PW/PW res: [0xc7844cc:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400020020 nid: 192.168.44.110@o2ib44 remote: 0x6745004c01094c2f expref: 269 pid: 24386 timeout: 226556 lvb_type: 0
/var/log/messages-20201202.gz:Dec  1 14:52:33 arkle8 kernel: LustreError: 163439:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 192.168.44.110@o2ib44 arrived at 1606794753 with bad export cookie 14549505386225834195
/var/log/messages-20201202.gz:Dec  1 15:01:18 warble2 kernel: LustreError: 43038:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8ec9ee5d6800 ns: mdt-dagg-MDT0000_UUID lock: ffff8ec41c8e1440/0xe4be9cdcf8e07aae lrc: 3/0,0 mode: PR/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x1b/0x0 rrc: 227 type: IBT flags: 0x50200400000020 nid: 192.168.44.137@o2ib44 remote: 0xfc2c8df7a2d04940 expref: 7 pid: 43038 timeout: 0 lvb_type: 0

does LU-14031 match those symptoms? TBH it doesn't really look like it to me.

cheers,
robin

Comment by Oleg Drokin [ 03/Dec/20 ]

I don't think LU-14031 matches.

This needs a collection of debug logs (with elevated debug level) from Lustre on both a client(s) and servers affected.

Something like this: https://jira.whamcloud.com/browse/LU-13500?focusedCommentId=269199&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-269199

LU-13500 is a good example ticket to read through to see how the debug info is gathered and what it needs to have

Generated at Sat Feb 10 03:07:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.