Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14171

Lock timed out & hung clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.18.0
    • Lustre 2.12.5
    • None
    • CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      Hi folks,

      We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that's resulting in some clients being hung/evicted and requiring a reboot.

      What we're seeing are entries like this:

      Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0
      

      At the time of first investigating it appears that FID was indeed not accessible:

      root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0
      /fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini
      

      ls'ing this file hung and resulted in:

      Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766]  req@ffff88ba05c35100 x1684505381509824/t0(0) o101->dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
      

      This file did not show up as being open, per:

      [warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files
      

      So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT's and remounting will result in the file/dir becoming accessible again.

      What steps would you like us to perform to provide additional information to you?

      Cheers,
      Simon

      Attachments

        Activity

          People

            pjones Peter Jones
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: