Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
Send OSP_DISCONNECT only on health import. Otherwise, force local disconnect for unhealthy imports.
Attachments
Activity
Fix Version/s | New: Lustre 2.15.0 [ 14791 ] | |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Description |
Original:
This is based on observations during failover failback during testing and at customer sites. Often I would see MDT unmount hanging because a peer MDT was unavailable but the first was blocked waiting for a response to OSP_DISCONNECT.
John, 2:46 PM I have asked about this but haven't gotten a clear answer. It seems bad to have MDT umount block on an RPC (from osp_disconnect()). What it the point of the MDT/OST_DISCONNECT RPC here and can it be eliminated entirely here? Andreas, 2:53 PM it definitely shouldn't block forever. I know clients can/should be able to unmount even if the server is down, but newer unmount binaries (for whatever reason) do a 'stat(mountpoint)' and hang there before the unmount even is called in the kernel Andreas, 2:54 PM there is https://review.whamcloud.com/43706 which may also fix this issue for regular unmounts. yes, it is horrific, but I didn't have a better suggestion. I guess we could replace the "/sbin/umount" binary to not do "stat" John, 3:07 PM it definitely shouldn't block forever. I know clients can/should be able to unmount even if the server is down, but newer unmount binaries (for whatever reason) do a 'stat(mountpoint)' and hang there before the unmount even is called in the kernel But my question is "what is the point of osp_disconnect() sending a MDS/OST_DISCONNECT RPC?" Andreas, 3:07 PM so that the server can nicely clean up its resources instead of reporting the server being evicted? Umount --force avoids waiting on the disconnect but causes client eviction issues described in EX-3429. What resources are being released in server to server disconnect? What happens if we don't send server to server disconnect? What is the best way to avoid the hang described above? |
New: Send OSP_DISCONNECT only on health import. Otherwise, force local disconnect for unhealthy imports. |
Link | New: This issue is related to EX-3792 [ EX-3792 ] |
Component/s | New: Core Lustre [ 12687 ] |
Fix on master by https://review.whamcloud.com/#/c/44753/