Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6898

ldlm_resource_dump()) Granted locks (in reverse order)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.3
    • None
    • CENTOS6 Lustre2.5.3 Server MOFED2.4
      SLES11 Lustre2.5.3 Client MOFED3.0
    • 2
    • 9223372036854775807

    Description

      On the client we see errors

      [1437673998.848774] LustreError: 11-0: nbp8-MDT0000-mdc-ffff8806cb247400: Communicating with 10.151.27.60@o2ib, operation obd_ping failed with -107.^M
      [1437673998.860774] Lustre: nbp8-MDT0000-mdc-ffff8806cb247400: Connection to nbp8-MDT0000 (at 10.151.27.60@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
      [1437673998.880773] LustreError: 167-0: nbp8-MDT0000-mdc-ffff8806cb247400: This client was evicted by nbp8-MDT0000; in progress operations using this service will fail.^M
      [1437673998.916773] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) nbp8-MDT0000-mdc-ffff8806cb247400: namespace resource [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
      [1437673998.940772] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) Skipped 2587 previous similar messages^M
      [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount = 2^M
      [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):^M
      [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: nbp8-MDT0000-mdc-ffff8806cb247400 lock: ffff8fc0fa26dbc0/0x7f099458bb92bf52 lrc: 2/0,0 mode: PR/PR res: [0x360375393:0xe66d:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x12e400000000 nid: local remote: 0x551d423294fa4bce expref: -99 pid: 46426 timeout: 0 lvb_type: 3^M
      [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) Skipped 3648 previous similar messages^M
      [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x3603755cc:0x6454:0x0].0 (ffff8b075a9a8bc0) refcount = 2^M
      

      Server

      Jul 23 10:53:08 nbp8-mds1 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 226s: evicting client at 10.151.63.50@o2ib  ns: mdt-nbp8-MDT0000_UUID lock: ffff882a2c6794c0/0x551d4232f4dfbb5e lrc: 3/0,0 mode: PR/PR res: [0x4976d01:0xe2d4a4f3:0x0].0 bits 0x13 rrc: 848 type: IBT flags: 0x60200000000020 nid: 10.151.63.50@o2ib remote: 0x7f099458bb9c07c4 expref: 9 pid: 9672 timeout: 8029699438 lvb_type: 0
      Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Aborting message for 12345-10.151.12.174@o2ib: LNetM[DE]Unlink() already called on the MD/ME.
      Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Skipped 41 previous similar messages
      Jul 23 10:53:39 nbp8-mds1 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline
      

      On the client all ldlm_bl threads are stuck

      0xffff8a0739b06080    21185        2  1  711   R  0xffff8a0739b066f0  ldlm_bl_110^M
       [<ffffffff814760e8>] _raw_spin_unlock_irqrestore+0x8/0x10^M
       [<ffffffffa0d7a807>] osc_page_delete+0xe7/0x360 [osc]^M
       [<ffffffffa0ad14d5>] cl_page_delete0+0xc5/0x4e0 [obdclass]^M
       [<ffffffffa0ad192a>] cl_page_delete+0x3a/0x120 [obdclass]^M
       [<ffffffffa0ee16a6>] ll_invalidatepage+0x96/0x160 [lustre]^M
       [<ffffffffa0ef314d>] vvp_page_discard+0x8d/0x120 [lustre]^M
       [<ffffffffa0acda58>] cl_page_invoid+0x78/0x170 [obdclass]^M
       [<ffffffffa0ad490c>] discard_cb+0xbc/0x1e0 [obdclass]^M
       [<ffffffffa0ad2467>] cl_page_gang_lookup+0x1f7/0x3f0 [obdclass]^M
       [<ffffffffa0ad471a>] cl_lock_discard_pages+0xfa/0x1d0 [obdclass]^M
       [<ffffffffa0d7c0d2>] osc_lock_flush+0xf2/0x260 [osc]^M
       [<ffffffffa0d7c339>] osc_lock_cancel+0xf9/0x1e0 [osc]^M
       [<ffffffffa0ad2bd5>] cl_lock_cancel0+0x65/0x150 [obdclass]^M
       [<ffffffffa0ad394b>] cl_lock_cancel+0x14b/0x150 [obdclass]^M
       [<ffffffffa0d7cc1d>] osc_lock_blocking+0x5d/0xf0 [osc]^M
       [<ffffffffa0d7dff9>] osc_dlm_blocking_ast0+0xf9/0x210 [osc]^M
       [<ffffffffa0d7e15c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]^M
       [<ffffffffa0be4eef>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]^M
       [<ffffffffa0bf380f>] ldlm_cli_cancel_local+0x7f/0x480 [ptlrpc]^M
       [<ffffffffa0bf6b82>] ldlm_cli_cancel_list_local+0xf2/0x290 [ptlrpc]^M
       [<ffffffffa0bfba07>] ldlm_bl_thread_main+0xf7/0x450 [ptlrpc]^M
       [<ffffffff81083ae6>] kthread+0x96/0xa0^M
       [<ffffffff8147f164>] kernel_thread_helper+0x4/0x10^M
      

      These events will cause the MDS IO to stop for a few minutes.

      Attachments

        1. btall.gz
          353 kB
          Mahmoud Hanafi
        2. debug.out.mofed.withpatch.1438631502.bz2
          0.3 kB
          Mahmoud Hanafi
        3. debug.out.withpatch.mofed.secondrun.1438631826.bz2
          0.3 kB
          Mahmoud Hanafi
        4. debug.out.withpatch.ofed3.5.2.1438633632.bz2
          0.3 kB
          Mahmoud Hanafi
        5. dmesg.out.gz
          417 kB
          Mahmoud Hanafi

        Activity

          People

            jay Jinshan Xiong (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: