[LU-15020] OSP_DISCONNECT blocking MDT unmount Created: 20/Aug/21  Updated: 17/Jan/22  Resolved: 17/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Send OSP_DISCONNECT only on health import. Otherwise, force local disconnect for unhealthy imports.



 Comments   
Comment by Mikhail Pershin [ 20/Aug/21 ]

From these reports about MDT hangs, what would be the easiest way to reproduce that issue?

Comment by John Hammond [ 20/Aug/21 ]

Setup a FS with MDTs spread over 2 VMs. Start the FS, do some cross MDT operations, destroy one VM (no unmount, no shutdown) and try to umount (no --force) an MDT on the other vm.

Comment by Mikhail Pershin [ 23/Aug/21 ]

While I am checking how to make server disconnect gracefully, possible way to go with --force umount is to set device read-only before that, in that case clients will be preserved on server I think.

Comment by Mikhail Pershin [ 26/Aug/21 ]

adilger, I am not sure about this hang while MDT unmounting is related to stat() call you've mentioned. That problem and related patch are for client unmount and client RPC, but here we have local mountpoint unmount on server, I doubt it causes inter MDT stat, though there can be some other RPC.

As for osp_disconnect() thing, simplest thing would be just to call ptlrpc_disconnect_import() with obd_force set if import is in recovery, so there will no waiting for import to recover and no disconnect RPC, if import is healthy then disconnect will be send, so other MDT could clean related resources.

Another my question is about whole situation as per description, it states that server hangs waiting for response to DISCONNECT RPC, at the same time this RPC is sent always with rq_no_resent flag, so it should fail after timeout but not hang forever. So was that hang observed by customers are just long in time or it never ends really?

Comment by Gerrit Updater [ 26/Aug/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44753
Subject: EX-3687 osp: do force disconnect if import is not ready
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1a5d067b340e0b62f5577a20779401427ca0adca

Comment by Gerrit Updater [ 17/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44753/
Subject: EX-3687 osp: do force disconnect if import is not ready
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8203c0f7a043aad9d087018119e278e4279ca8bc

Comment by Peter Jones [ 17/Jan/22 ]

Fix on master by https://review.whamcloud.com/#/c/44753/

Generated at Sat Feb 10 03:14:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.