[LU-14171] Lock timed out & hung clients Created: 02/Dec/20 Updated: 04/Dec/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | SC Admin (Inactive) | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi folks, We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that's resulting in some clients being hung/evicted and requiring a reboot. What we're seeing are entries like this: Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0 At the time of first investigating it appears that FID was indeed not accessible: root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0 /fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini ls'ing this file hung and resulted in: Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766] req@ffff88ba05c35100 x1684505381509824/t0(0) o101->dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1 This file did not show up as being open, per: [warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT's and remounting will result in the file/dir becoming accessible again. What steps would you like us to perform to provide additional information to you? Cheers, |
| Comments |
| Comment by Etienne Aujames [ 02/Dec/20 ] |
|
Hello, Maybe the |
| Comment by SC Admin (Inactive) [ 03/Dec/20 ] |
|
Hi Etienne, thanks for looking at the ticket. I've attached the ~2 days of ldlm_request looping that we saw on the hung FID. after that the MDS stopped responding totally to all requests, the fs was hung, and we were forced to umount, lustre_rmmod, mount the MDT to recover. there weren't any ldlm_lockd.c::expired_lock_main in those 2 days. on the client side, our main symptom was that there were D state processes on several clients - all associated with that single FID. when we rebooted those clients, it didn't affect the hung processes on other nodes. the MDT totally failed at approx Dec 1 14:50:49 after the total fail we did see a couple of ldlm_lockd.c:256:expired_lock_main, but not during the preceedng 2 days. ... /var/log/messages-20201202.gz:Dec 1 14:19:32 warble2 kernel: LustreError: 42988:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606792472, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec603688b40/0xe4be9cdddb855e96 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 224 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42988 timeout: 0 lvb_type: 0 /var/log/messages-20201202.gz:Dec 1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606793921, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1e6b402d00/0xe4be9cdde0306879 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43125 timeout: 0 lvb_type: 0 /var/log/messages-20201202.gz:Dec 1 14:43:41 warble2 kernel: LustreError: 43125:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 1 previous similar message /var/log/messages-20201202.gz:Dec 1 14:45:48 warble2 kernel: LustreError: 43123:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606794048, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8f1b19948900/0xe4be9cdde086763d lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 228 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 43123 timeout: 0 lvb_type: 0 /var/log/messages-20201202.gz:Dec 1 14:52:30 arkle8 kernel: LustreError: 54274:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 192.168.44.110@o2ib44 ns: filter-dagg-OST000e_UUID lock: ffff8f1d76277bc0/0xc9ea3a0ff004f28e lrc: 3/0,0 mode: PW/PW res: [0xc7844cc:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400020020 nid: 192.168.44.110@o2ib44 remote: 0x6745004c01094c2f expref: 269 pid: 24386 timeout: 226556 lvb_type: 0 /var/log/messages-20201202.gz:Dec 1 14:52:33 arkle8 kernel: LustreError: 163439:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 192.168.44.110@o2ib44 arrived at 1606794753 with bad export cookie 14549505386225834195 /var/log/messages-20201202.gz:Dec 1 15:01:18 warble2 kernel: LustreError: 43038:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8ec9ee5d6800 ns: mdt-dagg-MDT0000_UUID lock: ffff8ec41c8e1440/0xe4be9cdcf8e07aae lrc: 3/0,0 mode: PR/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x1b/0x0 rrc: 227 type: IBT flags: 0x50200400000020 nid: 192.168.44.137@o2ib44 remote: 0xfc2c8df7a2d04940 expref: 7 pid: 43038 timeout: 0 lvb_type: 0 does cheers, |
| Comment by Oleg Drokin [ 03/Dec/20 ] |
|
I don't think This needs a collection of debug logs (with elevated debug level) from Lustre on both a client(s) and servers affected. Something like this: https://jira.whamcloud.com/browse/LU-13500?focusedCommentId=269199&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-269199 LU-13500 is a good example ticket to read through to see how the debug info is gathered and what it needs to have |