Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2827

mdt_intent_fixup_resent() cannot find the proper lock in hash

Details

    • 3
    • 6847

    Description

      If a successful reply is lost for an intent lock request, MDS will not correctly recover from this situation on resend.

      The cause of which seems to be the code of ldlm_handle_enqueue0() and mdt_intent_fixup_resent()

       int ldlm_handle_enqueue0(struct ldlm_namespace *ns,
                               struct ptlrpc_request *req,
                               const struct ldlm_request *dlm_req,
                               const struct ldlm_callback_suite *cbs)
      {
      ...
              /* The lock's callback data might be set in the policy function */
              lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
                                      dlm_req->lock_desc.l_resource.lr_type,
                                      dlm_req->lock_desc.l_req_mode,
                                      cbs, NULL, 0);
      ...
              lock->l_export = class_export_lock_get(req->rq_export, lock);
              if (lock->l_export->exp_lock_hash) {
                      cfs_hash_add(lock->l_export->exp_lock_hash,
                                   &lock->l_remote_handle,
                                   &lock->l_exp_hash);
              }
      ...
              err = ldlm_lock_enqueue(ns, &lock, cookie, &flags);
      ...
      }
      
      static void mdt_intent_fixup_resent(struct mdt_thread_info *info,
                                          struct ldlm_lock *new_lock,
                                          struct ldlm_lock **old_lock,
                                          struct mdt_lock_handle *lh)
      {
              struct ptlrpc_request  *req = mdt_info_req(info);
              struct obd_export      *exp = req->rq_export;
              struct lustre_handle    remote_hdl;
              struct ldlm_request    *dlmreq;
              struct ldlm_lock       *lock;
      
              if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                      return;
      
              dlmreq = req_capsule_client_get(info->mti_pill, &RMF_DLM_REQ);
              remote_hdl = dlmreq->lock_handle[0];
      
              lock = cfs_hash_lookup(exp->exp_lock_hash, &remote_hdl);
              if (lock) {
                      if (lock != new_lock) {
      ...
      }
      

      On resend, ldlm_handle_enqueue0() add the new lock into hash even though there's already a granted lock with the same remote handle. mdt_intent_fixup_resent() will find the newly added lock in hash and ignore it. This will cause to an enqueue request on the newly created lock, deadlock and client eviction.

      Alexey thinks that the problem has existed since we moved from the correct code:

      static void fixup_handle_for_resent_req(struct ptlrpc_request *req, int offset,
                                              struct ldlm_lock *new_lock,
                                              struct ldlm_lock **old_lock,
                                              struct lustre_handle *lockh)
      {
              struct obd_export *exp = req->rq_export;
              struct ldlm_request *dlmreq =
                      lustre_msg_buf(req->rq_reqmsg, offset, sizeof(*dlmreq));
              struct lustre_handle remote_hdl = dlmreq->lock_handle[0];
              struct list_head *iter;
      
              if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                      return;
      
              spin_lock(&exp->exp_ldlm_data.led_lock);
              list_for_each(iter, &exp->exp_ldlm_data.led_held_locks) {
                      struct ldlm_lock *lock;
                      lock = list_entry(iter, struct ldlm_lock, l_export_chain);
                      if (lock == new_lock)
                              continue; <==================== N.B.
                      if (lock->l_remote_handle.cookie == remote_hdl.cookie) {
                              lockh->cookie = lock->l_handle.h_cookie;
                              LDLM_DEBUG(lock, "restoring lock cookie");
                              DEBUG_REQ(D_DLMTRACE, req,"restoring lock cookie "LPX64,
                                        lockh->cookie);
                              if (old_lock)
                                      *old_lock = LDLM_LOCK_GET(lock);
                              spin_unlock(&exp->exp_ldlm_data.led_lock);
                              return;
                      }
              }
      ...
      }
      

      Logs for this issue will follow.

      Attachments

        Issue Links

          Activity

            [LU-2827] mdt_intent_fixup_resent() cannot find the proper lock in hash
            pjones Peter Jones made changes -
            Link New: This issue is related to NTAP-82 [ NTAP-82 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to LDEV-38 [ LDEV-38 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-39 [ LDEV-39 ]
            yujian Jian Yu made changes -
            Link New: This issue is related to LDEV-38 [ LDEV-38 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to SGI-99 [ SGI-99 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-8 [ LDEV-8 ]

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/14056
            Subject: LU-2827 tests: add version check to recovery-small test 113
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 43d451f52f487780c9c83340574a125c3b6506fe

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/14056 Subject: LU-2827 tests: add version check to recovery-small test 113 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 43d451f52f487780c9c83340574a125c3b6506fe

            Antoine,
            You are right this must be made clear here also, so to be complete, the full list of post LU-2827 related tickets/patches has been well documented and in detail by Oleg in its LU-4584 2x comments dated 11/Sep/14 for LU-4584.

            bfaccini Bruno Faccini (Inactive) added a comment - Antoine, You are right this must be made clear here also, so to be complete, the full list of post LU-2827 related tickets/patches has been well documented and in detail by Oleg in its LU-4584 2x comments dated 11/Sep/14 for LU-4584 .
            apercher Antoine Percher made changes -
            Link New: This issue is related to LU-5530 [ LU-5530 ]

            need to included fix from LU-5530 with fix from LU-2827

            apercher Antoine Percher added a comment - need to included fix from LU-5530 with fix from LU-2827

            People

              yujian Jian Yu
              panda Andrew Perepechko
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: