Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10235

mkdir should check for directory existence on client before taking write lock

    XMLWordPrintable

Details

    • 9223372036854775807

    Description

      A "common" hang we have on our filesystems is when some clients experience temporary IB problems and other clients issue an occasional `mkdir -p` on a full tree.

      Taking an example `/mnt/lustre/path/to/dir`, basically all the active clients have a CR or PR lock on `/mnt/lustre`, but the client issuing `mkdir -p /mnt/lustre/path/to/dir` will try to get a CW lock on each intermediate directories, which makes the server recall all PR locks on very accessed base directories.

      Under bad weather, that recall can take time (as long as ldlm timeout), during which clients which did give the lock back will try to reestablish it and get blocked as well until that CW is granted and the no-op mkdir is done.. Which in turn can starve all the threads on the MDS in a situation with many clients and render the MDS unresponsive for a while even after resolving the IB issues.

      I think we would be much less likely to experience such hangs if mkdir() would, on the client side, first check with a read lock if the child directory exist before upgrading that to a write lock and sending the request to the server. If the directory already exists we can safely return EEXIST without sending mkdir to the server, in most other cases we will need to upgrade lock and send request as we currently do.
      This incurs a small overhead for the actual mkdir case but I believe it should be well worth it for large clusters, and in the common case a lock would already be held (if e.g. doing multiple operations on directory in a row)

       

      On last occurrence here we had 324 mdt0[0-3]_[0-9]* threads stuck on this trace:

      #0 schedule
      #1 schedule_timeout
      #2 ldlm_completion_ast
      #3 ldlm_cli_enqueue_local
      #4 mdt_object_lock0
      #5 mdt_object_lock
      #6 mdt_getattr_name_lock
      #7 mdt_intent_getattr
      #8 mdt_intent_policy
      #9 ldlm_lock_enqueue
      #10 ldlm_handle_enqueue0
      #11 tgt_enqueue
      #12 tgt_request_handle
      #3 ptlrpc_main
      #14 kthread
      #15 kernel_thread
      
      

      And 4 threads on this:

      #0 schedule
      #1 ldlm_completion_ast
      #2 ldlm_cli_enqueue_local
      #3 mdt_object_lock0
      #4 mdt_object_lock
      #5 mdt_object_find_lock
      #6 mdt_reint_create
      #7 mdt_reint_rec
      #8 mdt_reint_internal
      #9 mdt_reint
      #10 tgt_request_handle
      #11 ptlrpc_main
      #12 kthread
      #13 kernel_thread
      
      

      A finer examination lets us find a struct ldlm_resource in these threads, which point to a single directory (actually two in this case, we have a common directory to all users just below root here so it has just as much contention as the filesystem root itself).

      The ldlm_resource also has lists (lr_granted, lr_waiting) which show a few granted CR (not revoked) and a handful of PR that aren't giving lock back, and many waiting PR + a handful of CW waiting for the last granted PR to go away.

       

      I'm afraid I do not have exact traces of this anymore as it was on a live system and we didn't crash the MDT (+ it was on black site anyway), but if there are other things to look at I'm sure it will happen again eventually so please ask.

       

      Thanks,

      Dominique Martinet

      Attachments

        Issue Links

          Activity

            People

              utopiabound Nathaniel Clark
              cealustre CEA
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: