[LU-4582] After failing over Lustre MGS node to the secondary, client mount fails with -5 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0
Labels:
- patch

Severity:
3
Rank (Obsolete):
12524

Description

Following are steps to reproduce the issue reliably:

1. adjust obd_timeout from default 100 to 300
    lctl conf_param <fsname>.sys.timeout=300
2. mount and umount the client
    mount - lustre <primary MGS ip>:<secondary MGS ip>:/<fsname> /mnt/lustre
3. failover the MGS node to the secondary
4. mount the client again using the same command as in step 2

Then step 4 will fail with EIO.

Attachments

Activity

[LU-4582] After failing over Lustre MGS node to the secondary, client mount fails with -5

Cliff White (Inactive) added a comment - 30/May/14 5:25 PM

Very sorry about the delay, will investigate

Cliff White (Inactive) added a comment - 30/May/14 5:25 PM Very sorry about the delay, will investigate

Ryan Haasken added a comment - 29/May/14 10:54 PM

It has been a while since there has been any activity on this bug. Who is reviewing Cheng's patch?

Ryan Haasken added a comment - 29/May/14 10:54 PM It has been a while since there has been any activity on this bug. Who is reviewing Cheng's patch?

Cliff White (Inactive) added a comment - 09/May/14 8:15 PM

Reviewers have been assigned.

Cliff White (Inactive) added a comment - 09/May/14 8:15 PM Reviewers have been assigned.

Denis Kondratenko (Inactive) added a comment - 25/Apr/14 8:22 AM

unfortunately Cheng left Xyratex, but we still need to get this landed.

Could someone review Cheng's patch?

Denis Kondratenko (Inactive) added a comment - 25/Apr/14 8:22 AM unfortunately Cheng left Xyratex, but we still need to get this landed. Could someone review Cheng's patch?

Cheng Shao (Inactive) added a comment - 11/Feb/14 3:46 AM

New patch is at http://review.whamcloud.com/#/c/9217/.

Cheng Shao (Inactive) added a comment - 11/Feb/14 3:46 AM New patch is at http://review.whamcloud.com/#/c/9217/ .

Cheng Shao (Inactive) added a comment - 10/Feb/14 7:02 PM

Master is definitely affected as well. Will abandon this patch and submit a new one against master.

Cheng Shao (Inactive) added a comment - 10/Feb/14 7:02 PM Master is definitely affected as well. Will abandon this patch and submit a new one against master.

Oleg Drokin added a comment - 07/Feb/14 6:28 PM

I wonder why is your patch against b2_5 and not master? Is master not affected?
We generally prefer to land things to master first.

Oleg Drokin added a comment - 07/Feb/14 6:28 PM I wonder why is your patch against b2_5 and not master? Is master not affected? We generally prefer to land things to master first.

Cheng Shao (Inactive) added a comment - 05/Feb/14 6:39 PM

Patch is up for review at http://review.whamcloud.com/#/c/9141/

Cheng Shao (Inactive) added a comment - 05/Feb/14 6:39 PM Patch is up for review at http://review.whamcloud.com/#/c/9141/

Ryan Haasken added a comment - 05/Feb/14 3:19 PM

Cheng, have you uploaded your patch to the whamcloud gerrit review site? If so, please post a link here. Thanks.

Ryan Haasken added a comment - 05/Feb/14 3:19 PM Cheng, have you uploaded your patch to the whamcloud gerrit review site? If so, please post a link here. Thanks.

Cheng Shao (Inactive) added a comment - 04/Feb/14 8:15 PM

Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op.

+0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case.
+0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state. 
+5 The above req failed after the delayed sent expired. But this is not fatal.
+5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds.
+20 MGS_CONNECT timed out.
+55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5.

The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req.

The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established.

Cheng Shao (Inactive) added a comment - 04/Feb/14 8:15 PM Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op. +0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case . +0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state. +5 The above req failed after the delayed sent expired. But this is not fatal. +5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds. +20 MGS_CONNECT timed out. +55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5. The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req. The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established.

People

Assignee:: Cliff White (Inactive)

Reporter:: Cheng Shao (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 04/Feb/14 8:12 PM

Updated:: 14/Mar/18 1:53 PM

Resolved:: 23/Jun/14 3:59 PM