Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.17.0
-
None
-
3
-
9223372036854775807
Description
Looks like mdt_hsm_cdt_stop stops some khread too many times?
[27200.621224] Lustre: DEBUG MARKER: == insanity test 2: Second Failure Mode: MDS/OST Sat May 10 10:33:11 PM UTC 2025 ========================================================== 22:33:11 (1746916391) [27222.571920] Lustre: 4857:0:(client.c:2445:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1746916396/real 1746916396] req@ffff974aa10f9d40 x1831746590615040/t0(0) o400->MGC10.240.40.190@tcp@10.240.40.190@tcp:26/25 lens 224/224 e 0 to 1 dl 1746916412 ref 1 fl Rpc:XNQr/200/ffffffff rc 0/-1 job:'kworker.0' uid:0 gid:0 projid:4294967295 [27222.575331] LustreError: MGC10.240.40.190@tcp: Connection to MGS (at 10.240.40.190@tcp) was lost; in progress operations using this service will fail [27223.636590] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [27223.979889] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [27224.154704] Lustre: Failing over lustre-MDT0001 [27224.162873] ------------[ cut here ]------------ [27224.163480] refcount_t: addition on 0; use-after-free. [27224.164165] WARNING: CPU: 0 PID: 743453 at lib/refcount.c:25 refcount_warn_saturate+0x74/0x110 [27224.165168] Modules linked in: dm_flakey tls osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc rfkill intel_rapl_msr intel_rapl_common virtio_balloon i2c_piix4 pcspkr joydev dm_mod drm fuse ext4 mbcache jbd2 ata_generic virtio_net crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ata_piix libata virtio_blk net_failover failover serio_raw [last unloaded: dm_flakey] [27224.170385] CPU: 0 PID: 743453 Comm: umount Kdump: loaded Tainted: G OE ------- --- 5.14.0-427.42.1.el9_4.x86_64 #1 [27224.171730] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27224.172408] RIP: 0010:refcount_warn_saturate+0x74/0x110 [27224.173024] Code: 01 01 e8 ff 7b af ff 0f 0b c3 cc cc cc cc 80 3d ca e9 b1 01 00 75 cb 48 c7 c7 80 2a 58 9e c6 05 ba e9 b1 01 01 e8 dc 7b af ff <0f> 0b c3 cc cc cc cc 80 3d a9 e9 b1 01 00 75 a8 48 c7 c7 58 2a 58 [27224.174889] RSP: 0018:ffffafa4c7233af0 EFLAGS: 00010282 [27224.175484] RAX: 0000000000000000 RBX: ffff974ab3d0f000 RCX: 0000000000000027 [27224.176267] RDX: 0000000000000027 RSI: 0000000000027ffb RDI: ffff974b3fc20848 [27224.177050] RBP: ffff974a84d51cc0 R08: 0000000000000000 R09: 00000000ffff7fff [27224.177820] R10: ffffafa4c7233978 R11: ffffffff9efe7a48 R12: ffff974a84d51cf0 [27224.178605] R13: ffff974aad8b8000 R14: 00000000fffffff4 R15: ffff974a9fdbb27a [27224.179381] FS: 00007fd43be33540(0000) GS:ffff974b3fc00000(0000) knlGS:0000000000000000 [27224.180274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27224.180928] CR2: 00007ffec7444918 CR3: 000000002638c004 CR4: 00000000000606f0 [27224.181697] Call Trace: [27224.182052] <TASK> [27224.182341] ? show_trace_log_lvl+0x1c4/0x2df [27224.182882] ? show_trace_log_lvl+0x1c4/0x2df [27224.183432] ? kthread_stop+0x176/0x180 [27224.183936] ? refcount_warn_saturate+0x74/0x110 [27224.184469] ? __warn+0x81/0x110 [27224.184899] ? refcount_warn_saturate+0x74/0x110 [27224.185429] ? report_bug+0x10a/0x140 [27224.185903] ? handle_bug+0x3c/0x70 [27224.186366] ? exc_invalid_op+0x14/0x70 [27224.186826] ? asm_exc_invalid_op+0x16/0x20 [27224.187360] ? refcount_warn_saturate+0x74/0x110 [27224.187916] kthread_stop+0x176/0x180 [27224.188367] mdt_hsm_cdt_stop+0x12b/0x250 [mdt] [27224.189174] ? lu_context_fini+0xa7/0x190 [obdclass] [27224.190135] mdt_fini+0x98/0x580 [mdt] [27224.190609] mdt_device_fini+0x2b/0xc0 [mdt] [27224.191172] obd_precleanup+0xdc/0x280 [obdclass] [27224.191789] ? class_disconnect_exports+0x131/0x300 [obdclass] [27224.192514] class_cleanup+0x2d7/0x600 [obdclass] [27224.193150] class_process_config+0x1102/0x1ab0 [obdclass] [27224.193838] ? class_manual_cleanup+0x161/0x7a0 [obdclass] [27224.194519] class_manual_cleanup+0x43b/0x7a0 [obdclass] [27224.195217] server_put_super+0x98f/0xb40 [ptlrpc] [27224.196434] generic_shutdown_super+0x74/0x120 [27224.196986] kill_anon_super+0x14/0x30 [27224.197435] deactivate_locked_super+0x31/0xa0 [27224.197962] cleanup_mnt+0x100/0x160 [27224.198414] task_work_run+0x5c/0x90 [27224.198872] exit_to_user_mode_loop+0x122/0x130 [27224.199416] exit_to_user_mode_prepare+0xb6/0x100 [27224.199965] syscall_exit_to_user_mode+0x12/0x40 [27224.200507] do_syscall_64+0x69/0x90 [27224.200956] entry_SYSCALL_64_after_hwframe+0x77/0xe1 [27224.201550] RIP: 0033:0x7fd43bd0df0b [27224.202045] Code: 1b bf 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e1 be 0e 00 f7 d8 [27224.203923] RSP: 002b:00007ffec74479f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [27224.204736] RAX: 0000000000000000 RBX: 00005628b4a759d0 RCX: 00007fd43bd0df0b [27224.205518] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005628b4a75820 [27224.206311] RBP: 00005628b4a6fb30 R08: 0000000000000000 R09: 0000000000000000 [27224.207099] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000 [27224.207867] R13: 00005628b4a75820 R14: 00005628b4a6fc40 R15: 00005628b4a6fb30 [27224.208641] </TASK> [27224.208952] ---[ end trace a3f94a89480afc5c ]---
later culminating in a crash (null pointer deref):
[27224.253517] BUG: kernel NULL pointer dereference, address: 0000000000000000 [27224.254285] #PF: supervisor write access in kernel mode [27224.254849] #PF: error_code(0x0002) - not-present page [27224.255436] PGD 0 P4D 0 [27224.255768] Oops: 0002 [#1] PREEMPT SMP PTI [27224.256246] CPU: 0 PID: 743453 Comm: umount Kdump: loaded Tainted: G W OE ------- --- 5.14.0-427.42.1.el9_4.x86_64 #1 [27224.257429] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [27224.258058] RIP: 0010:kthread_stop+0x45/0x180 [27224.258550] Code: 00 f0 0f c1 45 30 85 c0 0f 84 40 01 00 00 8d 50 01 09 c2 0f 88 05 01 00 00 f6 45 36 20 0f 84 12 01 00 00 48 8b 9d a8 0b 00 00 <f0> 80 0b 02 48 89 ef e8 8f f6 ff ff 48 89 ef e8 07 b6 01 00 48 8d [27224.260372] RSP: 0018:ffffafa4c7233af8 EFLAGS: 00010246 [27224.260938] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027 [27224.261701] RDX: 0000000000000027 RSI: 0000000000027ffb RDI: ffff974b3fc20848 [27224.262461] RBP: ffff974a84d51cc0 R08: 0000000000000000 R09: 00000000ffff7fff [27224.263213] R10: ffffafa4c7233978 R11: ffffffff9efe7a48 R12: ffff974a84d51cf0 [27224.263954] R13: ffff974aad8b8000 R14: 00000000fffffff4 R15: ffff974a9fdbb27a [27224.264732] FS: 00007fd43be33540(0000) GS:ffff974b3fc00000(0000) knlGS:0000000000000000 [27224.265577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27224.266203] CR2: 0000000000000000 CR3: 000000002638c004 CR4: 00000000000606f0 [27224.266977] Call Trace: [27224.267298] <TASK> [27224.267578] ? show_trace_log_lvl+0x1c4/0x2df [27224.268078] ? show_trace_log_lvl+0x1c4/0x2df [27224.268570] ? mdt_hsm_cdt_stop+0x12b/0x250 [mdt] [27224.269160] ? __die_body.cold+0x8/0xd [27224.269597] ? page_fault_oops+0x134/0x170 [27224.270081] ? kernelmode_fixup_or_oops+0x84/0x110 [27224.270617] ? exc_page_fault+0x62/0x150 [27224.271073] ? asm_exc_page_fault+0x22/0x30 [27224.271548] ? kthread_stop+0x45/0x180 [27224.271984] ? kthread_stop+0x176/0x180 [27224.272443] mdt_hsm_cdt_stop+0x12b/0x250 [mdt] [27224.273006] ? lu_context_fini+0xa7/0x190 [obdclass] [27224.273643] mdt_fini+0x98/0x580 [mdt] [27224.274123] mdt_device_fini+0x2b/0xc0 [mdt] [27224.274645] obd_precleanup+0xdc/0x280 [obdclass] [27224.275249] ? class_disconnect_exports+0x131/0x300 [obdclass] [27224.275945] class_cleanup+0x2d7/0x600 [obdclass] [27224.276544] class_process_config+0x1102/0x1ab0 [obdclass] [27224.277223] ? class_manual_cleanup+0x161/0x7a0 [obdclass] [27224.277888] class_manual_cleanup+0x43b/0x7a0 [obdclass] [27224.278547] server_put_super+0x98f/0xb40 [ptlrpc] [27224.279229] generic_shutdown_super+0x74/0x120 [27224.279735] kill_anon_super+0x14/0x30 [27224.280181] deactivate_locked_super+0x31/0xa0 [27224.280684] cleanup_mnt+0x100/0x160 [27224.281115] task_work_run+0x5c/0x90 [27224.281538] exit_to_user_mode_loop+0x122/0x130 [27224.282052] exit_to_user_mode_prepare+0xb6/0x100 [27224.282572] syscall_exit_to_user_mode+0x12/0x40 [27224.283096] do_syscall_64+0x69/0x90 [27224.283522] entry_SYSCALL_64_after_hwframe+0x77/0xe1 [27224.284084] RIP: 0033:0x7fd43bd0df0b [27224.284513] Code: 1b bf 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e1 be 0e 00 f7 d8
First recorded hit On Oct 8th 2024 for 2.16.0rc1: https://testing.whamcloud.com/test_sets/d19b194f-c74f-4eba-b81d-406ec4f77cc4
Hits periodically on master since then (7 occurences to date)
Merged for 2.17