[LU-4582] After failing over Lustre MGS node to the secondary, client mount fails with -5 Created: 04/Feb/14 Updated: 14/Mar/18 Resolved: 23/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cheng Shao (Inactive) | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12524 | ||||||||
| Description |
|
Following are steps to reproduce the issue reliably: 1. adjust obd_timeout from default 100 to 300
lctl conf_param <fsname>.sys.timeout=300
2. mount and umount the client
mount - lustre <primary MGS ip>:<secondary MGS ip>:/<fsname> /mnt/lustre
3. failover the MGS node to the secondary
4. mount the client again using the same command as in step 2
Then step 4 will fail with EIO. |
| Comments |
| Comment by Cheng Shao (Inactive) [ 04/Feb/14 ] |
|
Here is the timeline of events according to the Lustre debug log. The beginning num is relative to the start of the mount op. +0 Client sent MGS_CONNECT req to primary MGS node with timeout set to (obd_timeout/20 + adaptive_timeout), which was 20 seconds in our test case. +0 Client sent LDLM_ENQUEUE req to MGS node with rq_delay_limit set to 5 seconds. This is for sptlrpc. The send will be delayed because the import is still in connecting state. +5 The above req failed after the delayed sent expired. But this is not fatal. +5 Client sent another LDLM_ENQUEUE req to MGS node with rq_delay_limit set to MGC_ENQUEUE_LIMIT, which is hard coded to 50 seconds. +20 MGS_CONNECT timed out. +55 The second LDLM_ENQUEUE req failed after the delayed sent expired. This will fail the whole client mount with error -5. The problem here is that, after the MGS_CONNECT failed to connect to the primary MGS, it didn't get a chance to connect to the secondary before the mount fails. We know that selecting a different MGS node is triggered by the pinger, which works at (obd_timeout/4) interval. Since we increased obd_timeout to 300, the interval became 75 seconds now. So the connection to the secondary will not happened prior to failure of the second LDLM_ENQUEUE req. The solution we proposed here is to redefine MGS_ENQUEUE_LIMIT as relative to obd_timeout, instead of a hard-coded value. By doing that, the second LDLM_ENQUEUE will wait long enough to go through after the connection to the secondary MGS node is established. |
| Comment by Ryan Haasken [ 05/Feb/14 ] |
|
Cheng, have you uploaded your patch to the whamcloud gerrit review site? If so, please post a link here. Thanks. |
| Comment by Cheng Shao (Inactive) [ 05/Feb/14 ] |
|
Patch is up for review at http://review.whamcloud.com/#/c/9141/ |
| Comment by Oleg Drokin [ 07/Feb/14 ] |
|
I wonder why is your patch against b2_5 and not master? Is master not affected? |
| Comment by Cheng Shao (Inactive) [ 10/Feb/14 ] |
|
Master is definitely affected as well. Will abandon this patch and submit a new one against master. |
| Comment by Cheng Shao (Inactive) [ 11/Feb/14 ] |
|
New patch is at http://review.whamcloud.com/#/c/9217/. |
| Comment by Denis Kondratenko (Inactive) [ 25/Apr/14 ] |
|
unfortunately Cheng left Xyratex, but we still need to get this landed. Could someone review Cheng's patch? |
| Comment by Cliff White (Inactive) [ 09/May/14 ] |
|
Reviewers have been assigned. |
| Comment by Ryan Haasken [ 29/May/14 ] |
|
It has been a while since there has been any activity on this bug. Who is reviewing Cheng's patch? |
| Comment by Cliff White (Inactive) [ 30/May/14 ] |
|
Very sorry about the delay, will investigate |
| Comment by Ryan Haasken [ 04/Jun/14 ] |
|
Thanks, Cliff. http://review.whamcloud.com/#/c/9217/ has landed. |
| Comment by Cliff White (Inactive) [ 09/Jun/14 ] |
|
Is it okay to close this isse? |
| Comment by Ryan Haasken [ 09/Jun/14 ] |
|
Yes. |
| Comment by Gerrit Updater [ 10/Feb/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13718 |