Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12708

Cannot access directory and lock timed out

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6 Lustre 2.12.0+patches
    • 3
    • 9223372036854775807

    Description

      On our 2.12 Fir filesystem, it looks like a directory is not accessible anymore, it's hosted on MDT0000:

      /fir/users/bjing/caspposes/CASP11/taskdir1
      FID: 0x200029d02:0x1b59c:0x0

      [root@fir-rbh01 ~]# lfs fid2path /fir 0x200029d02:0x1b59c:0x0
      /fir/users/bjing/caspposes/CASP11/taskdir1
      [root@fir-rbh01 ~]# lfs getdirstripe /fir/users/bjing/caspposes/CASP11/taskdir1
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      

      strace of ls:

      stat("/fir/users/bjing/caspposes/CASP11/", {st_mode=S_IFDIR|S_ISGID|0775, st_size=12288, ...}) = 0
      openat(AT_FDCWD, "/fir/users/bjing/caspposes/CASP11/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
      getdents(3,
      

      Logs showing the FID on the MDS of MDT0000:

      [root@fir-md1-s1 ~]# journalctl -n 100000 -k | grep 0x200029d02:0x1b59c:0x0
      Aug 11 19:29:07 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 57s: evicting client at 10.8.26.28@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f32cb2f7740/0x5d9ee6c5054b1779 lrc: 4/0,0 mode: PR/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 40 type: IBT flags: 0x60200400000020 nid: 10.8.26.28@o2ib6 remote: 0xff0b1f607b1120a1 expref: 3155 pid: 97646 timeout: 4692007 lvb_type: 0
      Aug 11 19:29:47 fir-md1-s1 kernel: LustreError: 23597:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576897, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3483a0ec00/0x5d9ee6c5055796b8 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 34 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23597 timeout: 0 lvb_type: 0
      Aug 11 19:30:50 fir-md1-s1 kernel: LustreError: 21003:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1565576960, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f3245e35580/0x5d9ee6c5057392f2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21003 timeout: 0 lvb_type: 0
      Aug 27 13:10:04 fir-md1-s1 kernel: LustreError: 21452:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566936514, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2d9e083180/0x5d9ee6e65c38e54b lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 28 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21452 timeout: 0 lvb_type: 0
      Aug 27 14:34:43 fir-md1-s1 kernel: LustreError: 23645:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941593, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f28934e18c0/0x5d9ee6e686d8d27d lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23645 timeout: 0 lvb_type: 0
      Aug 27 14:35:13 fir-md1-s1 kernel: LustreError: 50442:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566941623, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ee5499f80/0x5d9ee6e686e6af45 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 30 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50442 timeout: 0 lvb_type: 0
      Aug 27 22:18:01 fir-md1-s1 kernel: LustreError: 10504:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969391, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f1275822640/0x5d9ee6e70bada6a3 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 31 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 10504 timeout: 0 lvb_type: 0
      Aug 27 22:20:00 fir-md1-s1 kernel: LustreError: 20457:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969510, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f07c5f1ba80/0x5d9ee6e70bdaec42 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20457 timeout: 0 lvb_type: 0
      Aug 27 22:20:30 fir-md1-s1 kernel: LustreError: 23607:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1566969540, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ac3114a40/0x5d9ee6e70be681a2 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23607 timeout: 0 lvb_type: 0
      Aug 28 10:28:39 fir-md1-s1 kernel: LustreError: 21681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013229, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f4136e41b00/0x5d9ee6e786713af9 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21681 timeout: 0 lvb_type: 0
      Aug 28 10:29:09 fir-md1-s1 kernel: LustreError: 23681:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567013259, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ace24c140/0x5d9ee6e78694ca10 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 35 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23681 timeout: 0 lvb_type: 0
      Aug 28 10:57:40 fir-md1-s1 kernel: LustreError: 23603:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014969, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2ab0598480/0x5d9ee6e78df1fd5f lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 23603 timeout: 0 lvb_type: 0
      Aug 28 10:58:10 fir-md1-s1 kernel: LustreError: 50447:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567014999, 90s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0000_UUID lock: ffff8f2920b8cec0/0x5d9ee6e78e1a0d54 lrc: 3/1,0 mode: --/PR res: [0x200029d02:0x1b59c:0x0].0x0 bits 0x13/0x0 rrc: 37 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 50447 timeout: 0 lvb_type: 0
      

      My guess is that a client is still holding the lock on it. It there a way to know which client (knowing the FID)?

      Thanks!
      Stephane

      Attachments

        1. dlmtrace_ls.log.gz
          10.12 MB
        2. dlmtrace.big.log.gz
          59.58 MB
        3. dlmtrace.log.gz
          4.36 MB
        4. fir-md1-s1.full.log
          9.34 MB

        Activity

          People

            pjones Peter Jones
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: