[LU-16973] Busy device after successful umount Created: 21/Jul/23 Updated: 19/Dec/23 Resolved: 06/Sep/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Epic: | server | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We have a script that checks failover process and if it is delayed the node is crashed. Usually 30seconds after umount complete are enough to finish all working stuff and have a zero reference for a raid device. But with high IO and two MDTs on a node makes things worse. Delayed works takes much time, we have seen 20+ minutes after umount complete. The umount jobs are divided on 1) umount syscall md65 mount has 1747359 references and 2387523 files at fput list. [353410.384294] Lustre: server umount kjcf04-MDT0000 complete
[353454.532032] sysrq: SysRq : Trigger a crash
[0]: ffffd193c1a1f584
struct mnt_pcp {
mnt_count = 436042,
mnt_writers = 0
}
crash> mnt_pcp 0x3a1fd1e1f584:1
[1]: ffffd193c1a5f584
struct mnt_pcp {
mnt_count = 436156,
mnt_writers = 0
}
crash> mnt_pcp 0x3a1fd1e1f584:a
[0]: ffffd193c1a1f584
struct mnt_pcp {
mnt_count = 436042,
mnt_writers = 0
}
cat mount_count.pcp | tr -d ,|grep mnt_count|awk '{ sum += $3 } END { print sum }'
1747359
crash> file ffff978de3176300
struct file {
f_u = {
fu_llist = {
next = 0xffff9794e4529e00
},
fu_rcuhead = {
next = 0xffff9794e4529e00,
func = 0x0
}
},
f_path = {
mnt = 0xffff9793b5200da0,
dentry = 0xffff97912a254a80
},
f_inode = 0xffff979ddd8351c0,
f_op = 0xffffffffc1c2b640 <ldiskfs_dir_operations>,
list 0xffff9794e4529e00 | wc -l
2387523
|
| Comments |
| Comment by Gerrit Updater [ 21/Jul/23 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51731 |
| Comment by Andreas Dilger [ 22/Jul/23 ] |
|
Instead of deferring all of this work to unmount, does it make sense to (somehow) schedule this work more aggressively during runtime, so that it doesn't get backlogged? Having 2.7M files to clean up seems like it is getting benhind and unable to keep up with the workload. Can we put more threads to process the workqueue, or increase the thread priority? |
| Comment by Alexander Boyko [ 24/Jul/23 ] |
|
Unfortunately kernel is designed with a single delayed_fput_list, and delayed_fput() does not divide the list to kworker threads. Probably the reason is small amount of files at kernel level. The best approach is having percpu delayed list, by default kernel has two kworker thread per cpu already. adilger Is it possible to minimize call of alloc_file_pseudo()? |
| Comment by Andreas Dilger [ 26/Jul/23 ] |
|
The use of alloc_file_pseudo() was added for upstream kernel compatibility/cleanliness, but if this is causing problems by causing millions of delayed fput() calls at unmount time, I would rather fix the core problem rather than making "clean up 2M file descriptors at unmount" work better. Previously we just allocated "struct file" on the stack and filled in the required fields directly for the few interfaces that were needing a "struct file" argument to ioctl() or {{iterate_dir() where it is needed. In most cases, the struct file returned by alloc_file_pseudo() is only used for the duration of a single function. Rather than allocating and (delayed) freeing millions of file descriptors for handling these temporary calls, it would be much more efficient to avoid allocating these file descriptors entirely, if possible. Do you know from which function most of the file handles are coming from? Is it mostly the delayed xattr handling? Could we add something to the MDT service threads that is calling flush_workqueue(sbi->s_misc_wq) every 5s (or 60s or similar) to avoid accumulating so much delayed work? Looking through the callers of alloc_file_pseudo() I think some things can be done to improve this:
|
| Comment by Andrew Perepechko [ 26/Jul/23 ] |
|
Andreas, delayed fput is not related to the delayed iput hack in the xattr removal path. fput is designed to be delayed until exit from the syscall or as a separate work. I'm not sure if there's a need for fput to be delayed in Lustre and maybe can just call the postponed version ___fput() instead of fput(). I think one of the problems with this approach is that the postponed call is not exported. |
| Comment by Andreas Dilger [ 26/Jul/23 ] |
This is clearly a problem since the Lustre service threads never "exit from a syscall", so it looks like all fput() operations are being delayed. The "__fput_sync()" function was added in commit v3.5-rc6-284-g4a9d4b024a31 when delayed fput() was added, but was not exported until commit v5.17-rc5-285-gf00432063db1. We could convert (some/all?) users to use __fput_sync() (with a configure check): /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without * risking deadlocks), need to wait for completion of __fput() and know * for this specific struct file it won't involve anything that would * need them. Use only if you really need it - at the very least, * don't blindly convert fput() by kernel thread to that. */ void __fput_sync(struct file *file) For kernels where this is not exported, we could use kallsyms_lookup_symbol() to access __fput_sync(). There is also a way to force delayed fput() calls to be flushed immediately: /* * If kernel thread really needs to have the final fput() it has done * to complete, call this. Please, don't add more callers without * very good reasons; in particular, never call that with locks * held and never call that from a thread that might need to do * some work on any kind of umount. */ void flush_delayed_fput(void) { delayed_fput(NULL); } EXPORT_SYMBOL_GPL(flush_delayed_fput); This could be called from ptlrpc_main() before the thread goes to sleep, when there are no more RPCs to process. That will delay the fput work until the system is not totally overloaded, but should make RPC processing a tiny bit more efficient to do this in the background by an idle thread:
if (ptlrpc_server_request_pending(svcpt, false)) {
lu_context_enter(&env->le_ctx);
ptlrpc_server_handle_request(svcpt, thread);
lu_context_exit(&env->le_ctx);
- }
+ } else {
+ flush_delayed_fput();
+ }
|
| Comment by Gerrit Updater [ 29/Jul/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51805 |
| Comment by Gerrit Updater [ 19/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51805/ |
| Comment by Gerrit Updater [ 06/Sep/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51731/ |
| Comment by Peter Jones [ 06/Sep/23 ] |
|
Landed for 2.16 |