[LU-15020] OSP_DISCONNECT blocking MDT unmount Created: 20/Aug/21 Updated: 17/Jan/22 Resolved: 17/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Send OSP_DISCONNECT only on health import. Otherwise, force local disconnect for unhealthy imports. |
| Comments |
| Comment by Mikhail Pershin [ 20/Aug/21 ] |
|
From these reports about MDT hangs, what would be the easiest way to reproduce that issue? |
| Comment by John Hammond [ 20/Aug/21 ] |
|
Setup a FS with MDTs spread over 2 VMs. Start the FS, do some cross MDT operations, destroy one VM (no unmount, no shutdown) and try to umount (no --force) an MDT on the other vm. |
| Comment by Mikhail Pershin [ 23/Aug/21 ] |
|
While I am checking how to make server disconnect gracefully, possible way to go with --force umount is to set device read-only before that, in that case clients will be preserved on server I think. |
| Comment by Mikhail Pershin [ 26/Aug/21 ] |
|
adilger, I am not sure about this hang while MDT unmounting is related to stat() call you've mentioned. That problem and related patch are for client unmount and client RPC, but here we have local mountpoint unmount on server, I doubt it causes inter MDT stat, though there can be some other RPC. As for osp_disconnect() thing, simplest thing would be just to call ptlrpc_disconnect_import() with obd_force set if import is in recovery, so there will no waiting for import to recover and no disconnect RPC, if import is healthy then disconnect will be send, so other MDT could clean related resources. Another my question is about whole situation as per description, it states that server hangs waiting for response to DISCONNECT RPC, at the same time this RPC is sent always with rq_no_resent flag, so it should fail after timeout but not hang forever. So was that hang observed by customers are just long in time or it never ends really? |
| Comment by Gerrit Updater [ 26/Aug/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44753 |
| Comment by Gerrit Updater [ 17/Sep/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44753/ |
| Comment by Peter Jones [ 17/Jan/22 ] |
|
Fix on master by https://review.whamcloud.com/#/c/44753/ |