Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.4.3
    • 3
    • 13935

    Description

      While performing load testing on one of our filesystems this week we power cycled the OSSs to test recovery. To my surprise it ended up taking the OSS several hours to complete recovery and the vast majority of that time was spent in the lock replay stage.

      What I know for certain is that the OST has roughly 500,000 locks outstanding before it was power cycled. When it came up all the clients did properly reconnect to it and seems to have decided to replay all their locks, used and unused. I thought we fixed this years ago, so I verified that the tunables were set such that we shouldn't replay unused locks. They appeared to be set properly but those 500,000 locks were resent to the OST.

      After the recovery timed dropped to zero and I didn't quickly see recovery complete message I dumped some stacks from the OST. They showed that the tgt_recov thread was in stage two sequentially replaying all of those 500,000 locks. Because this was being done sequentially from a single thread the disk was hardly working and the system looked idle.

      This exact behavior has been reported on our production machines and I can easily understand why an administrator might think the system was hung/deadlocked and give up on it. Basically the recovery timer drops to zero and then recovery doesn't actually complete for several hours.

      You should be able to fairly easily reproduce this on any test system. Just ensure your server has a large number of locks enqueued and then power cycle it.

      Attachments

        Activity

          [LU-5042] Recovery Lock Replay
          pjones Peter Jones added a comment - b2_4 version: http://review.whamcloud.com/#/c/11920/
          bfaccini Bruno Faccini (Inactive) added a comment - b2_5 back-port is at http://review.whamcloud.com/11895/ .

          Version for 2.4/2.5?

          morrone Christopher Morrone (Inactive) added a comment - Version for 2.4/2.5?
          pjones Peter Jones added a comment -

          Landed for 2.7

          pjones Peter Jones added a comment - Landed for 2.7

          More testing of my patch found a flaw when a delayed LVB needs to be allocated+filled+sent back to a new Client as part of new/non-replayed and/but not immediately granted lock ...
          So, patch-set #4/#5 now also handle the case of new/not-replayed lock requests that can't be granted immediately but where LVB has to be sent to Client. I wonder why LVB is sent back for non-granted locks ? Will this not be better/optimized to only send LVB for granted locks or upon completion-AST ?

          bfaccini Bruno Faccini (Inactive) added a comment - More testing of my patch found a flaw when a delayed LVB needs to be allocated+filled+sent back to a new Client as part of new/non-replayed and/but not immediately granted lock ... So, patch-set #4/#5 now also handle the case of new/not-replayed lock requests that can't be granted immediately but where LVB has to be sent to Client. I wonder why LVB is sent back for non-granted locks ? Will this not be better/optimized to only send LVB for granted locks or upon completion-AST ?

          1st patch attempt, as testonly, is at http://review.whamcloud.com/10845.

          My local testing is ok, but as indicated in commit-msg of patch, it is still unclear for me if cases where a request that requires LVB update is received by Server between end of recovery/replays and before LVB is re-filled upon lock granting to a new Client ...

          bfaccini Bruno Faccini (Inactive) added a comment - 1st patch attempt, as testonly, is at http://review.whamcloud.com/10845 . My local testing is ok, but as indicated in commit-msg of patch, it is still unclear for me if cases where a request that requires LVB update is received by Server between end of recovery/replays and before LVB is re-filled upon lock granting to a new Client ...

          Jinshan, thanks for your comments and help, I was already hesitating to block lvb setup/fetch between lvbo_init() or lvbo_fill() calls, so ... Let's give i a try now!

          bfaccini Bruno Faccini (Inactive) added a comment - Jinshan, thanks for your comments and help, I was already hesitating to block lvb setup/fetch between lvbo_init() or lvbo_fill() calls, so ... Let's give i a try now!

          _ no longer fetch/fill/send lvb for/to Client for already granted and successfully replayed locks on Server side. Seems this has to occur mainly in ldlm_handle_enqueue0().

          I think we need to revise ldlm_resource_get() to delay lvbo_init() call for new resource. Instead, we can call lvbo_init() at the time when LVB is really necessary.

          _ not sure already on which flags/fields I will use to implement this and also if I need to only focus on Client/OST replays.

          Clients pack a flag into replay request to indicate if the lock is granted or blocked. See the code snippet from ldlm_lock_enqueue() below:

                  } else if (*flags & LDLM_FL_REPLAY) {
                          if (*flags & LDLM_FL_BLOCK_CONV) {
                                  ldlm_resource_add_lock(res, &res->lr_converting, lock);
                                  GOTO(out, rc = ELDLM_OK);
                          } else if (*flags & LDLM_FL_BLOCK_WAIT) {
                                  ldlm_resource_add_lock(res, &res->lr_waiting, lock);
                                  GOTO(out, rc = ELDLM_OK);
                          } else if (*flags & LDLM_FL_BLOCK_GRANTED) {
                                  ldlm_grant_lock(lock, NULL);
                                  GOTO(out, rc = ELDLM_OK);
                          }
                          /* If no flags, fall through to normal enqueue path. */
                  }
          

          For replay locks, it can skip the enqueue process and we can easily skip LVB packing in ldlm_lock_enqueue0().

          jay Jinshan Xiong (Inactive) added a comment - _ no longer fetch/fill/send lvb for/to Client for already granted and successfully replayed locks on Server side. Seems this has to occur mainly in ldlm_handle_enqueue0(). I think we need to revise ldlm_resource_get() to delay lvbo_init() call for new resource. Instead, we can call lvbo_init() at the time when LVB is really necessary. _ not sure already on which flags/fields I will use to implement this and also if I need to only focus on Client/OST replays. Clients pack a flag into replay request to indicate if the lock is granted or blocked. See the code snippet from ldlm_lock_enqueue() below: } else if (*flags & LDLM_FL_REPLAY) { if (*flags & LDLM_FL_BLOCK_CONV) { ldlm_resource_add_lock(res, &res->lr_converting, lock); GOTO(out, rc = ELDLM_OK); } else if (*flags & LDLM_FL_BLOCK_WAIT) { ldlm_resource_add_lock(res, &res->lr_waiting, lock); GOTO(out, rc = ELDLM_OK); } else if (*flags & LDLM_FL_BLOCK_GRANTED) { ldlm_grant_lock(lock, NULL); GOTO(out, rc = ELDLM_OK); } /* If no flags, fall through to normal enqueue path. */ } For replay locks, it can skip the enqueue process and we can easily skip LVB packing in ldlm_lock_enqueue0().

          Sorry I am a bit late on this.

          Based on earlier discussion/comments and after looking into related source files, and since I am not fully aware of the whole replay mechanism, here is what I am planning to implement :

          _ no longer fetch/fill/send lvb for/to Client for already granted and successfully replayed locks on Server side. Seems this has to occur mainly in ldlm_handle_enqueue0().

          _ on Client, upon already granted+relayed lock successful reply from Server, keep going with what we already got. Seems this has to occur mainly in ldlm_cli_enqueue_fini(). But other places, like ldlm_handle_cp_callback()/replay_one_lock()/... may also need to be investigated.

          _ not sure already on which flags/fields I will use to implement this and also if I need to only focus on Client/OST replays.

          _ also, what is still unclear for me is how the Server is able to detect the replayed lock was already granted to Client (mainly during recovery after a Server crash/reboot), and also how the Server will handle the situation where lvb content had not been fetched during replay and really needs to be at some point of time later.

          Jinshan, Brian, any comments/add-ons/no-go on this ?

          bfaccini Bruno Faccini (Inactive) added a comment - Sorry I am a bit late on this. Based on earlier discussion/comments and after looking into related source files, and since I am not fully aware of the whole replay mechanism, here is what I am planning to implement : _ no longer fetch/fill/send lvb for/to Client for already granted and successfully replayed locks on Server side. Seems this has to occur mainly in ldlm_handle_enqueue0(). _ on Client, upon already granted+relayed lock successful reply from Server, keep going with what we already got. Seems this has to occur mainly in ldlm_cli_enqueue_fini(). But other places, like ldlm_handle_cp_callback()/replay_one_lock()/... may also need to be investigated. _ not sure already on which flags/fields I will use to implement this and also if I need to only focus on Client/OST replays. _ also, what is still unclear for me is how the Server is able to detect the replayed lock was already granted to Client (mainly during recovery after a Server crash/reboot), and also how the Server will handle the situation where lvb content had not been fetched during replay and really needs to be at some point of time later. Jinshan, Brian, any comments/add-ons/no-go on this ?

          Ok, I currently try to implement Jinshan's idea. Will try to push a patch soon.

          bfaccini Bruno Faccini (Inactive) added a comment - Ok, I currently try to implement Jinshan's idea. Will try to push a patch soon.

          People

            bfaccini Bruno Faccini (Inactive)
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: