Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.6
-
None
-
CentOS 7.9 clients (3.10.0-1160.24.1.el7.x86_64)
-
3
-
9223372036854775807
Description
We noticed today on a cluster running 2.12.6 clients that a large number of them were blocked at umount when applying a security patch. No jobs were running anymore on these clients, but umount doesn't finish.
Dumping the tasks on a client has revealed that 3 ll_sa threads are still running:
PID: 25857 TASK: ffff9a78cd32b180 CPU: 36 COMMAND: "ll_sa_24325" #0 [ffff9a7519997c98] __schedule at ffffffffa2f8899f #1 [ffff9a7519997d28] schedule at ffffffffa2f88eb9 #2 [ffff9a7519997d38] ll_statahead_thread at ffffffffc133787e [lustre] #3 [ffff9a7519997ec8] kthread at ffffffffa28c5da1 #4 [ffff9a7519997f50] ret_from_fork_nospec_begin at ffffffffa2f95de4 PID: 34631 TASK: ffff9b754a283180 CPU: 10 COMMAND: "ll_sa_30018" #0 [ffff99fd4ebd7c98] __schedule at ffffffffa2f8899f #1 [ffff99fd4ebd7d28] schedule at ffffffffa2f88eb9 #2 [ffff99fd4ebd7d38] ll_statahead_thread at ffffffffc133787e [lustre] #3 [ffff99fd4ebd7ec8] kthread at ffffffffa28c5da1 #4 [ffff99fd4ebd7f50] ret_from_fork_nospec_begin at ffffffffa2f95de4 PID: 41809 TASK: ffff9afeeeb8b180 CPU: 40 COMMAND: "ll_sa_40406" #0 [ffff99fe617c7c98] __schedule at ffffffffa2f8899f #1 [ffff99fe617c7d28] schedule at ffffffffa2f88eb9 #2 [ffff99fe617c7d38] ll_statahead_thread at ffffffffc133787e [lustre] #3 [ffff99fe617c7ec8] kthread at ffffffffa28c5da1 #4 [ffff99fe617c7f50] ret_from_fork_nospec_begin at ffffffffa2f95de4
This is blocking umount/ll_kill_super():
crash> bt -f 19912 PID: 19912 TASK: ffff9b7532fae300 CPU: 44 COMMAND: "umount" #0 [ffff9a002139bd08] __schedule at ffffffffa2f8899f ffff9a002139bd10: 0000000000000082 ffff9a002139bfd8 ffff9a002139bd20: ffff9a002139bfd8 ffff9a002139bfd8 ffff9a002139bd30: 000000000001acc0 ffff9afb94b4e300 ffff9a002139bd40: ffff9b797f2d3940 00000001e05d9d9f ffff9a002139bd50: 0000000000000282 000000006a88e28b ffff9a002139bd60: ffff9a002139bdc0 ffff9b797f313940 ffff9a002139bd70: ffff9a002139bdc0 00000001e05d9e1c ffff9a002139bd80: ffff9b797f313940 0000000000015980 ffff9a002139bd90: ffff9a002139bda0 ffffffffa2f88eb9 #1 [ffff9a002139bd98] schedule at ffffffffa2f88eb9 ffff9a002139bda0: ffff9a002139be48 ffffffffa2f868d8 #2 [ffff9a002139bda8] schedule_timeout at ffffffffa2f868d8 ffff9a002139bdb0: ffff9a0073ffd888 0000000000000282 ffff9a002139bdc0: ffff9b797f314b48 ffff9b797f314b48 ffff9a002139bdd0: 00000001e05d9e1c ffff9b797f313940 ffff9a002139bde0: ffffffffa28ad7f0 ffff9b7532fae300 ffff9a002139bdf0: ffffffffffffffff 0000000000000000 ffff9a002139be00: 0000000000000000 0000000000000000 ffff9a002139be10: 0000000000000282 000000006a88e28b ffff9a002139be20: ffff9b7532fae300 ffff9a0030bf0000 ffff9a002139be30: ffff9b7532fae300 ffffffffa3907570 ffff9a002139be40: ffff9b7532faead0 ffff9a002139be70 ffff9a002139be50: ffffffffc130a449 #3 [ffff9a002139be50] ll_kill_super at ffffffffc130a449 [lustre] ffff9a002139be58: 0000000000000000 ffff9b75f7eba000 ffff9a002139be68: ffffffffc0fa8ae0 ffff9a002139be88 ffff9a002139be78: ffffffffc0f599ad #4 [ffff9a002139be78] lustre_kill_super at ffffffffc0f599ad [obdclass] ffff9a002139be80: ffff9b75f7eba000 ffff9a002139bea8 ffff9a002139be90: ffffffffa2a50fae #5 [ffff9a002139be90] deactivate_locked_super at ffffffffa2a50fae ffff9a002139be98: ffff9b75f7eba000 0000000000000000 ffff9a002139bea8: ffff9a002139bec0 ffffffffa2a51736 #6 [ffff9a002139beb0] deactivate_super at ffffffffa2a51736 ffff9a002139beb8: ffff9a0073ffd680 ffff9a002139bed8 ffff9a002139bec8: ffffffffa2a70dbf #7 [ffff9a002139bec8] cleanup_mnt at ffffffffa2a70dbf ffff9a002139bed0: ffff9a0073ffd6b8 ffff9a002139bee8 ffff9a002139bee0: ffffffffa2a70e52 #8 [ffff9a002139bee0] __cleanup_mnt at ffffffffa2a70e52 ffff9a002139bee8: ffff9a002139bf28 ffffffffa28c28db #9 [ffff9a002139bef0] task_work_run at ffffffffa28c28db ffff9a002139bef8: ffff9b7532faeb04 0000000000000002 ffff9a002139bf08: ffff9a002139bf58 0000000000000000 ffff9a002139bf18: 0000000000000000 0000000000000000 ffff9a002139bf28: ffff9a002139bf48 ffffffffa282cc65 #10 [ffff9a002139bf30] do_notify_resume at ffffffffa282cc65 ffff9a002139bf38: 0000000000000000 0000000000000000 ffff9a002139bf48: 0000000000000000 ffffffffa2f962ef #11 [ffff9a002139bf50] int_signal at ffffffffa2f962ef RIP: 00002b88f55a5a97 RSP: 00007fffaf522e48 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 00005603f7e53040 RCX: ffffffffffffffff RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005603f7e54840 RBP: 00005603f7e54840 R8: 0000000000000000 R9: 0000000000000005 R10: 00007fffaf5228a0 R11: 0000000000000246 R12: 00002b88f4e3ad78 R13: 0000000000000000 R14: 00005603f7e54960 R15: 00005603f7e53040 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
sb = 0xffff9b75f7eba000 crash> struct super_block.s_fs_info ffff9b75f7eba000 s_fs_info = 0xffff9b75f7eb9000 crash> struct lustre_sb_info.lsi_llsbi 0xffff9b75f7eb9000 lsi_llsbi = 0xffff9a0030bf0000 crash> struct ll_sb_info.ll_umounting 0xffff9a0030bf0000 ll_umounting = 1 crash> struct ll_sb_info.ll_sa_running 0xffff9a0030bf0000 ll_sa_running = { counter = 3 }
Here on this cluster clients are all 2.12.6, and server is Oak which is running 2.12.6 too.
We have one client that is available for live troubleshooting (it's still blocked at umount).
Is there another known bug with such problem in 2.12? Happy to provide more info.
Attachments
Issue Links
- is related to
-
LU-15660 parallel-scale-nfsv4 test_racer_on_nfs: statahead doesn't stop
- Resolved