Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.3
    • None
    • RHEL6 w/ patched kernel
    • 3
    • 14915

    Description

      We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.

      In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.

      crash> bt 5231
      PID: 5231   TASK: ffff881518308b00  CPU: 2   COMMAND: "code2"
       #0 [ffff88171cb43188] schedule at ffffffff81528a52
       #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e
       #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab                  <=== Requires a new lock
       #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc]
       #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv]
       #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
       #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre]
       #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre]
       #8 [ffff88171cb43470] clear_inode at ffffffff811a626c
       #9 [ffff88171cb43490] dispose_list at ffffffff811a6340
      #10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694
      #11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a
      #12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e
      #13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc
      #14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443
      #15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca
      #16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696
      #17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d
      #18 [ffff88171cb43a10] vmalloc at ffffffff8115985c
      #19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs]
      #20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc]
      #21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc]
      #22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc]
      #23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc]
      #24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc]
      #25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc]   <=== PID has the lock
      #26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc]
      #27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv]
      #28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
      #29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre]
      #30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre]
      #31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre]
      #32 [ffff88171cb43ec0] __fput at ffffffff8118ad55
      #33 [ffff88171cb43f10] fput at ffffffff8118ae95
      #34 [ffff88171cb43f20] filp_close at ffffffff811861bd
      #35 [ffff88171cb43f50] sys_close at ffffffff81186295
      #36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072
          RIP: 00002adaacdf26d0  RSP: 00007fff9665e238  RFLAGS: 00010246
          RAX: 0000000000000003  RBX: ffffffff8100b072  RCX: 0000000000002261
          RDX: 00000000044a24b0  RSI: 0000000000000001  RDI: 0000000000000005
          RBP: 0000000000000000   R8: 00002adaad0ac560   R9: 0000000000000001
          R10: 00000000000004fd  R11: 0000000000000246  R12: 00000000000004fc
          R13: 00000000ffffffff  R14: 00000000044a23d0  R15: 00000000ffffffff
          ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
      

      We have a recursive locking here, which is not permitted.

      Attachments

        Activity

          [LU-5349] Deadlock in mdc_close()
          pjones Peter Jones added a comment -

          Landed for 2.5.4 and 2.7

          pjones Peter Jones added a comment - Landed for 2.5.4 and 2.7
          green Oleg Drokin added a comment -

          also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.

          green Oleg Drokin added a comment - also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

          bfaccini Bruno Faccini (Inactive) added a comment - - edited I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

          Bruno, why does b2_5 version lack _GFP_ZERO flag in call to __vmalloc() (_OBD_VMALLOC_VERBOSE macro)?

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Bruno, why does b2_5 version lack _ GFP_ZERO flag in call to __vmalloc() ( _OBD_VMALLOC_VERBOSE macro)?
          bfaccini Bruno Faccini (Inactive) added a comment - Master patch http://review.whamcloud.com/11190 has landed. b2_5 version is now at http://review.whamcloud.com/11739 .
          bfaccini Bruno Faccini (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/11183/ has been abandoned.

          There are several Lustre filesystems mounted on this client:

          • a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs.

          This Lustre client is a login node, with many user working interactively.

          You can find in the attached file the outputs of sar -B and sar -R.

          Hope this helps.

          bruno.travouillon Bruno Travouillon (Inactive) added a comment - - edited There are several Lustre filesystems mounted on this client: a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping. a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs. This Lustre client is a login node, with many user working interactively. You can find in the attached file the outputs of sar -B and sar -R . Hope this helps.

          People

            bfaccini Bruno Faccini (Inactive)
            bruno.travouillon Bruno Travouillon (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: