Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4607

OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.4.1
    • None
    • 3
    • 12604

    Description

      These messages appear every few hours on the oss nodes:
      oss6 kernel: : LustreError:
      0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 192.168.224.14@o2ib ns: filter-scratch-OST000b_UUID lock: ffff8804a321f000/0xaa2e9b983dbd2233 lrc: 3/0,0 mode: PW/PW res: [0x4a3a76:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 nid: 192.168.224.14@o2ib remote: 0x55c758a593d4fc6 expref: 24 pid: 12551timeout: 5161946610 lvb_type: 0

      On the client:

      pod24b14 kernel: : LustreError: 11-0:
      scratch-OST000b-osc-ffff880312dff800: Communicating with 192.168.254.36@o2ib, operation obd_ping failed with -107.
      pod24b14 kernel: : Lustre:
      scratch-OST000b-osc-ffff880312dff800: Connection to scratch-OST000b (at
      192.168.254.36@o2ib) was lost; in progress operations using this service will wait for recovery to complete Feb 3 12:21:45 pod24b14 kernel: : Lustre: Skipped 1 previous pod24b14 kernel: : LustreError: 167-0:
      scratch-OST000b-osc-ffff880312dff800: This client was evicted by scratch-OST000b; in progress operations using this service will fail.
      pod24b14 kernel: : Lustre:
      6039:0:(llite_lib.c:2506:ll_dirty_page_discard_warn()) scratch: dirty page
      discard: 192.168.254.41@o2ib:192.168.254.42@o2ib:/scratch/fid:
      [0x2000020a6:0x130dc:0x0]/ may get corrupted (rc -108)pod24b14 kernel: : LustreError: 16480:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 0 of inodeffff880476e1a638 failed -108
      pod24b14 kernel: : LustreError:
      16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) lock@ffff8806063297b8[2
      3 0 1 1 00000000] W(2):[0,
      18446744073709551615]@[0x1000b0000:0x4a3a76:0x0]

      { pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) lovsub@ffff8804c673e620: [0 ffff880471ca03a0 W(2):[0, 18446744073709551615]@[0x2000020a6:0x14f4a:0x0]] pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) osc@ffff8804901eaf00: ffff8805596246c0 0x20040000001 0x55c758a593d4fc6 3 ffff88044b08cc70 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) }

      lock@ffff8806063297b8
      pod24b14 kernel: : LustreError:
      16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) dlmlock returned -5
      od24b14 kernel: : LustreError:
      16480:0:(cl_lock.c:1420:cl_unuse_try()) result = -5, this is unlikely!
      pod24b14 kernel: : LustreError:
      16480:0:(cl_lock.c:1435:cl_unuse_locked()) lock@ffff880606329978[1 0 0 1 0 00000000] W(2):[0, 18446744073709551615]@[0x2000020a6:0x14f4a:0x0]

      { pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) vvp@ffff8805f0daa678: pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) lov@ffff880471ca03a0: 1 pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) 0 0: --- pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) }

      lock@ffff880606329978
      pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) unuse return -5

      Attachments

        1. do_IRQ-errors.txt
          3 kB
        2. kern.log.1
          139 kB
        3. lustre-log.partaa
          0.2 kB
        4. lustre-log.partab
          0.2 kB

        Activity

          People

            cliffw Cliff White (Inactive)
            orentas Oz Rentas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: