Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4582

After failing over Lustre MGS node to the secondary, client mount fails with -5

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0
    • 3
    • 12524

    Description

      Following are steps to reproduce the issue reliably:

      1. adjust obd_timeout from default 100 to 300
          lctl conf_param <fsname>.sys.timeout=300
      2. mount and umount the client
          mount - lustre <primary MGS ip>:<secondary MGS ip>:/<fsname> /mnt/lustre
      3. failover the MGS node to the secondary
      4. mount the client again using the same command as in step 2
      

      Then step 4 will fail with EIO.

      Attachments

        Activity

          [LU-4582] After failing over Lustre MGS node to the secondary, client mount fails with -5

          Very sorry about the delay, will investigate

          cliffw Cliff White (Inactive) added a comment - Very sorry about the delay, will investigate

          It has been a while since there has been any activity on this bug. Who is reviewing Cheng's patch?

          haasken Ryan Haasken added a comment - It has been a while since there has been any activity on this bug. Who is reviewing Cheng's patch?

          Reviewers have been assigned.

          cliffw Cliff White (Inactive) added a comment - Reviewers have been assigned.

          unfortunately Cheng left Xyratex, but we still need to get this landed.

          Could someone review Cheng's patch?

          denis_kondratenko Denis Kondratenko (Inactive) added a comment - unfortunately Cheng left Xyratex, but we still need to get this landed. Could someone review Cheng's patch?
          cheng_shao Cheng Shao (Inactive) added a comment - New patch is at http://review.whamcloud.com/#/c/9217/ .

          Master is definitely affected as well. Will abandon this patch and submit a new one against master.

          cheng_shao Cheng Shao (Inactive) added a comment - Master is definitely affected as well. Will abandon this patch and submit a new one against master.
          green Oleg Drokin added a comment -

          I wonder why is your patch against b2_5 and not master? Is master not affected?
          We generally prefer to land things to master first.

          green Oleg Drokin added a comment - I wonder why is your patch against b2_5 and not master? Is master not affected? We generally prefer to land things to master first.
          cheng_shao Cheng Shao (Inactive) added a comment - Patch is up for review at http://review.whamcloud.com/#/c/9141/
          haasken Ryan Haasken added a comment -

          Cheng, have you uploaded your patch to the whamcloud gerrit review site? If so, please post a link here. Thanks.

          haasken Ryan Haasken added a comment - Cheng, have you uploaded your patch to the whamcloud gerrit review site? If so, please post a link here. Thanks.

          Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op.

          +0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case.
          +0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state. 
          +5 The above req failed after the delayed sent expired. But this is not fatal.
          +5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds.
          +20 MGS_CONNECT timed out.
          +55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5.
          

          The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req.

          The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established.

          cheng_shao Cheng Shao (Inactive) added a comment - Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op. +0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case . +0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state. +5 The above req failed after the delayed sent expired. But this is not fatal. +5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds. +20 MGS_CONNECT timed out. +55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5. The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req. The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established.

          People

            cliffw Cliff White (Inactive)
            cheng_shao Cheng Shao (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: