Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14171

Lock timed out & hung clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.5
    • None
    • CentOS 7.8, ZFS 0.8.5, Lustre 2.12.5
    • 3
    • 9223372036854775807

    Description

      Hi folks,

      We seem to be hitting a lock timeout issue related to some parts of our 2.12.5 filesystems that's resulting in some clients being hung/evicted and requiring a reboot.

      What we're seeing are entries like this:

      Nov 30 10:53:51 warble2 kernel: LustreError: 42898:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1606693731, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff8ec1cc05a400/0xe4be9cdd1627e166 lrc: 3/1,0 mode: --/PR res: [0x200054b1e:0xfc06:0x0].0x0 bits 0x13/0x48 rrc: 72 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 42898 timeout: 0 lvb_type: 0
      

      At the time of first investigating it appears that FID was indeed not accessible:

      root@farnarkle1 ~]# lfs fid2path /fred 0x200054b1e:0xfc06:0x0
      /fred/oz002/bgoncharov/ppta_data_analysis/Datasets/j0437_pdfb234_caspsr_20200928/chains_i6_g10/B_40CM/J0437-4715/chains/B_40CM.properties.ini
      

      ls'ing this file hung and resulted in:

      Nov 30 11:32:47 farnarkle1 kernel: Lustre: 94436:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1606695766/real 1606695766]  req@ffff88ba05c35100 x1684505381509824/t0(0) o101->dagg-MDT0000-mdc-ffff88b8f27e7000@192.168.33.22@o2ib33:12/10 lens 3584/960 e 23 to 1 dl 1606696367 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
      

      This file did not show up as being open, per:

      [warble2]root: grep 0x200054b1e:0xfc06:0x0 /proc/fs/lustre/mdt/*/exports/*/open_files
      

      So far there is one particular workflow that seems to trigger this. Subsequent investigation shows that unmounting the MDT's and remounting will result in the file/dir becoming accessible again.

      What steps would you like us to perform to provide additional information to you?

      Cheers,
      Simon

      Attachments

        Activity

          People

            pjones Peter Jones
            scadmin SC Admin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: