Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.3
    • None
    • RHEL6 w/ patched kernel
    • 3
    • 14915

    Description

      We had to crash/dump one of our Lustre clients because of a deadlock issue in mdc_close(). The PID 5231 was waiting for a lock that it already owned. BTW, we had a lot of process waiting for this lock.

      In the backtrace of the process, we can see two calls to mdc_close(). The second is due to the system reclaiming memory.

      crash> bt 5231
      PID: 5231   TASK: ffff881518308b00  CPU: 2   COMMAND: "code2"
       #0 [ffff88171cb43188] schedule at ffffffff81528a52
       #1 [ffff88171cb43250] __mutex_lock_slowpath at ffffffff8152a20e
       #2 [ffff88171cb432c0] mutex_lock at ffffffff8152a0ab                  <=== Requires a new lock
       #3 [ffff88171cb432e0] mdc_close at ffffffffa09176db [mdc]
       #4 [ffff88171cb43330] lmv_close at ffffffffa0b9bcb8 [lmv]
       #5 [ffff88171cb43380] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
       #6 [ffff88171cb43400] ll_md_real_close at ffffffffa0a81afa [lustre]
       #7 [ffff88171cb43430] ll_clear_inode at ffffffffa0a92dee [lustre]
       #8 [ffff88171cb43470] clear_inode at ffffffff811a626c
       #9 [ffff88171cb43490] dispose_list at ffffffff811a6340
      #10 [ffff88171cb434d0] shrink_icache_memory at ffffffff811a6694
      #11 [ffff88171cb43530] shrink_slab at ffffffff81138b7a
      #12 [ffff88171cb43590] zone_reclaim at ffffffff8113b77e
      #13 [ffff88171cb436b0] get_page_from_freelist at ffffffff8112d8dc
      #14 [ffff88171cb437e0] __alloc_pages_nodemask at ffffffff8112f443
      #15 [ffff88171cb43920] alloc_pages_current at ffffffff811680ca
      #16 [ffff88171cb43950] __vmalloc_area_node at ffffffff81159696
      #17 [ffff88171cb439b0] __vmalloc_node at ffffffff8115953d
      #18 [ffff88171cb43a10] vmalloc at ffffffff8115985c
      #19 [ffff88171cb43a20] cfs_alloc_large at ffffffffa03b4b1e [libcfs]
      #20 [ffff88171cb43a30] null_alloc_repbuf at ffffffffa06c4961 [ptlrpc]
      #21 [ffff88171cb43a60] sptlrpc_cli_alloc_repbuf at ffffffffa06b2355 [ptlrpc]
      #22 [ffff88171cb43a90] ptl_send_rpc at ffffffffa068432c [ptlrpc]
      #23 [ffff88171cb43b50] ptlrpc_send_new_req at ffffffffa067879b [ptlrpc]
      #24 [ffff88171cb43bc0] ptlrpc_set_wait at ffffffffa067ddb6 [ptlrpc]
      #25 [ffff88171cb43c60] ptlrpc_queue_wait at ffffffffa067e0df [ptlrpc]   <=== PID has the lock
      #26 [ffff88171cb43c80] mdc_close at ffffffffa0917714 [mdc]
      #27 [ffff88171cb43cd0] lmv_close at ffffffffa0b9bcb8 [lmv]
      #28 [ffff88171cb43d20] ll_close_inode_openhandle at ffffffffa0a80c1e [lustre]
      #29 [ffff88171cb43da0] ll_md_real_close at ffffffffa0a81afa [lustre]
      #30 [ffff88171cb43dd0] ll_md_close at ffffffffa0a81d8a [lustre]
      #31 [ffff88171cb43e80] ll_file_release at ffffffffa0a8233b [lustre]
      #32 [ffff88171cb43ec0] __fput at ffffffff8118ad55
      #33 [ffff88171cb43f10] fput at ffffffff8118ae95
      #34 [ffff88171cb43f20] filp_close at ffffffff811861bd
      #35 [ffff88171cb43f50] sys_close at ffffffff81186295
      #36 [ffff88171cb43f80] system_call_fastpath at ffffffff8100b072
          RIP: 00002adaacdf26d0  RSP: 00007fff9665e238  RFLAGS: 00010246
          RAX: 0000000000000003  RBX: ffffffff8100b072  RCX: 0000000000002261
          RDX: 00000000044a24b0  RSI: 0000000000000001  RDI: 0000000000000005
          RBP: 0000000000000000   R8: 00002adaad0ac560   R9: 0000000000000001
          R10: 00000000000004fd  R11: 0000000000000246  R12: 00000000000004fc
          R13: 00000000ffffffff  R14: 00000000044a23d0  R15: 00000000ffffffff
          ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
      

      We have a recursive locking here, which is not permitted.

      Attachments

        Activity

          [LU-5349] Deadlock in mdc_close()
          pjones Peter Jones added a comment -

          Landed for 2.5.4 and 2.7

          pjones Peter Jones added a comment - Landed for 2.5.4 and 2.7
          green Oleg Drokin added a comment -

          also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.

          green Oleg Drokin added a comment - also GFP_ZERO is not really needed in b2_5 patch because we explicitly zero the allocation with memset() afterwards anyway.
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

          bfaccini Bruno Faccini (Inactive) added a comment - - edited I forgot to mention it/why, nice catch! It is because b2_5 uses vmalloc() when master used vzalloc(), and I wanted to challenge my future reviewers about this ...

          Bruno, why does b2_5 version lack _GFP_ZERO flag in call to __vmalloc() (_OBD_VMALLOC_VERBOSE macro)?

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Bruno, why does b2_5 version lack _ GFP_ZERO flag in call to __vmalloc() ( _OBD_VMALLOC_VERBOSE macro)?
          bfaccini Bruno Faccini (Inactive) added a comment - Master patch http://review.whamcloud.com/11190 has landed. b2_5 version is now at http://review.whamcloud.com/11739 .
          bfaccini Bruno Faccini (Inactive) added a comment - Patch http://review.whamcloud.com/#/c/11183/ has been abandoned.

          There are several Lustre filesystems mounted on this client:

          • a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs.
          • a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs.

          This Lustre client is a login node, with many user working interactively.

          You can find in the attached file the outputs of sar -B and sar -R.

          Hope this helps.

          bruno.travouillon Bruno Travouillon (Inactive) added a comment - - edited There are several Lustre filesystems mounted on this client: a Lustre 2.4 filesystem on the same LNET with 1 MDT and 480 OSTs. We do not use wide striping. a Lustre 2.1 filesystem on another LNET with 1 MDT and 224 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 56 OSTs. a Lustre 2.1 filesystem on another LNET with 1 MDT and 48 OSTs. This Lustre client is a login node, with many user working interactively. You can find in the attached file the outputs of sar -B and sar -R . Hope this helps.

          The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size.

          adilger Andreas Dilger added a comment - The other question here is how many OSTs are in this filesystem, and if you are using wide striping? I'm trying to figure out why this was using vmalloc() instead of kmalloc(), and if there is a separate bug to be addressed to reduce the allocation size.

          Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ...
          So what should I do in cfs_cpt_vzalloc(), call __vmalloc() and forget about the node specified (but this may imply performance issues with NUMA aware ptlrpcds which use this...) or leave it like this until Kernel exports an accurate entry-point and assume that at the moment no cfs_cpt_vzalloc() call occurs during any File-System operations ?

          I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case.

          bfaccini Bruno Faccini (Inactive) added a comment - Humm too bad, __vmalloc_node() is not exported by Kernels ... So I am stuck if I want to also fix cfs_cpt_vzalloc() about the fact that __GFP_FS is used by default by vzalloc_node() and continue to forward a NUMA node specification, since I think only __vmalloc_node() would permit ... So what should I do in cfs_cpt_vzalloc(), call __vmalloc() and forget about the node specified (but this may imply performance issues with NUMA aware ptlrpcds which use this...) or leave it like this until Kernel exports an accurate entry-point and assume that at the moment no cfs_cpt_vzalloc() call occurs during any File-System operations ? I just pushed patch-set #2 of http://review.whamcloud.com/11190 assuming this last case.

          I just pushed http://review.whamcloud.com/11190 in order to have vmalloc[_node]() based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting!

          bfaccini Bruno Faccini (Inactive) added a comment - I just pushed http://review.whamcloud.com/11190 in order to have vmalloc [_node] () based allocations to no longer use __GFP_FS by default. I also found that this will enable for a real NUMA node parameter setting!

          People

            bfaccini Bruno Faccini (Inactive)
            bruno.travouillon Bruno Travouillon (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: