[LU-17097] RCU stall caused by osc_quota_cleanup Created: 07/Sep/23 Updated: 29/Nov/23 Resolved: 29/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sergey Cheremencev | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanity-quota@ldiskfs+DNE failed with Timeout(8009s)(Client: RCU stall). Sep 6 06:54:52 oleg310-client kernel: Lustre: DEBUG MARKER: == sanity-quota test complete, duration 4313 sec ========= 06:54:51 (1693997691) Sep 6 06:54:52 oleg310-client sshd[20583]: Received disconnect from 192.168.203.10 port 42586:11: disconnected by user Sep 6 06:54:52 oleg310-client sshd[20583]: Disconnected from 192.168.203.10 port 42586 Sep 6 06:54:52 oleg310-client sshd[20583]: pam_unix(sshd:session): session closed for user root Sep 6 06:54:52 oleg310-client systemd-logind: Removed session 600. Sep 6 06:55:07 oleg310-client kernel: ------------[ cut here ]------------ Sep 6 06:55:07 oleg310-client kernel: WARNING: CPU: 2 PID: 20679 at lib/debugobjects.c:286 debug_print_object+0x83/0xa0 Sep 6 06:55:07 oleg310-client kernel: ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) Sep 6 06:55:07 oleg310-client kernel: Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata Sep 6 06:55:07 oleg310-client kernel: CPU: 2 PID: 20679 Comm: umount Kdump: loaded Tainted: G OE ------------ 3.10.0-7.9-debug #1 Sep 6 06:55:07 oleg310-client kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Sep 6 06:55:07 oleg310-client kernel: Call Trace: Sep 6 06:55:07 oleg310-client kernel: [<ffffffff817ded29>] dump_stack+0x19/0x1b Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8108d558>] __warn+0xd8/0x100 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8108d5df>] warn_slowpath_fmt+0x5f/0x80 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81414723>] debug_print_object+0x83/0xa0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff814150af>] debug_object_activate+0x1af/0x210 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0179f44>] xas_free_nodes+0xa4/0xf0 [libcfs] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa017b26f>] xa_destroy+0xdf/0xf0 [libcfs] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa088d4d5>] osc_quota_cleanup+0x15/0x20 [osc] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa086ef1f>] osc_cleanup_common+0xbf/0x1b0 [osc] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f48f9>] class_free_dev+0x219/0x730 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f4ff0>] class_export_put+0x1e0/0x2e0 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f6c15>] class_unlink_export+0x125/0x160 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa030ca30>] class_decref+0x80/0x160 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa030ce71>] class_detach+0x1c1/0x310 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0314b2b>] class_process_config+0x163b/0x27c0 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0315e90>] class_manual_cleanup+0x1e0/0x770 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0903955>] lov_tgts_putref+0x385/0xad0 [lov] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0908927>] lov_disconnect+0x237/0x280 [lov] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0e8fa96>] obd_disconnect+0x56/0x300 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0e98ebc>] ll_put_super+0x81c/0xf30 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81247ac2>] kill_anon_super+0x12/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0ec711b>] lustre_kill_super+0x2b/0x30 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81248616>] deactivate_super+0x46/0x60 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff810b69b5>] task_work_run+0xb5/0xf0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff817f4363>] int_signal+0x12/0x17 Sep 6 06:55:07 oleg310-client kernel: ---[ end trace 37e266df8306097f ]--- Sep 6 06:55:07 oleg310-client kernel: ------------[ cut here ]------------ Sep 6 06:55:07 oleg310-client kernel: WARNING: CPU: 2 PID: 20679 at kernel/rcupdate.c:311 rcuhead_fixup_activate+0x5a/0x70 Sep 6 06:55:07 oleg310-client kernel: Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata Sep 6 06:55:07 oleg310-client kernel: CPU: 2 PID: 20679 Comm: umount Kdump: loaded Tainted: G W OE ------------ 3.10.0-7.9-debug #1 Sep 6 06:55:07 oleg310-client kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Sep 6 06:55:07 oleg310-client kernel: Call Trace: Sep 6 06:55:07 oleg310-client kernel: [<ffffffff817ded29>] dump_stack+0x19/0x1b Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8108d558>] __warn+0xd8/0x100 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8108d69d>] warn_slowpath_null+0x1d/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff810b6e8a>] rcuhead_fixup_activate+0x5a/0x70 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff814150cf>] debug_object_activate+0x1cf/0x210 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0179f44>] xas_free_nodes+0xa4/0xf0 [libcfs] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa017b26f>] xa_destroy+0xdf/0xf0 [libcfs] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa088d4d5>] osc_quota_cleanup+0x15/0x20 [osc] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa086ef1f>] osc_cleanup_common+0xbf/0x1b0 [osc] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f48f9>] class_free_dev+0x219/0x730 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f4ff0>] class_export_put+0x1e0/0x2e0 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa02f6c15>] class_unlink_export+0x125/0x160 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa030ca30>] class_decref+0x80/0x160 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa030ce71>] class_detach+0x1c1/0x310 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0314b2b>] class_process_config+0x163b/0x27c0 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0315e90>] class_manual_cleanup+0x1e0/0x770 [obdclass] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0903955>] lov_tgts_putref+0x385/0xad0 [lov] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0908927>] lov_disconnect+0x237/0x280 [lov] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0e8fa96>] obd_disconnect+0x56/0x300 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0e98ebc>] ll_put_super+0x81c/0xf30 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81247ac2>] kill_anon_super+0x12/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffffa0ec711b>] lustre_kill_super+0x2b/0x30 [lustre] Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81248616>] deactivate_super+0x46/0x60 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff810b69b5>] task_work_run+0xb5/0xf0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0 Sep 6 06:55:07 oleg310-client kernel: [<ffffffff817f4363>] int_signal+0x12/0x17 Sep 6 06:55:07 oleg310-client kernel: ---[ end trace 37e266df83060980 ]---
|
| Comments |
| Comment by Sergey Cheremencev [ 07/Sep/23 ] |
|
Hi simmonsja , Can you take a look? |
| Comment by Gerrit Updater [ 07/Sep/23 ] |
|
"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52304 |
| Comment by James A Simmons [ 07/Sep/23 ] |
|
I'm looking. Can this be reproduced only on RHEL7? If this is a RHEL7 only problem I don't think its worth reverting. |
| Comment by Sergey Cheremencev [ 08/Sep/23 ] |
|
No, I haven't seen such messages somewhere besides RHEL7. I saw these warnings just once during testing |
| Comment by Sergey Cheremencev [ 11/Sep/23 ] |
| Comment by James A Simmons [ 12/Sep/23 ] |
|
I'm working with Oleg to move to RHLE8 for debug kernel testing. This should go away then. |
| Comment by Sergey Cheremencev [ 12/Sep/23 ] |
|
Hi simmonsja
Thanks for update. Does it mean there is no issue with your patch and it is caused by some problem in RHEL7 debug kernel? |
| Comment by Gerrit Updater [ 15/Sep/23 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52381 |
| Comment by James A Simmons [ 15/Sep/23 ] |
|
I just pushed a patch to validate the xarray work with a normal RHEL7 kernel. |
| Comment by James A Simmons [ 21/Sep/23 ] |
|
Sorry it took awhile to run the patch. As you can see with a non-debug normal kernel sanity-quota passes. https://testing.whamcloud.com/test_sets/921b8bf1-b5ba-4664-9d6c-1c4c164e543c |
| Comment by Alexander Boyko [ 26/Sep/23 ] |
|
close to this issue [ 5407.737299] BUG: unable to handle kernel NULL pointer dereference at (null) [ 5407.739021] IP: [<ffffffffa016df58>] xas_free_nodes+0xb8/0xf0 [libcfs] [ 5407.740398] PGD 0 [ 5407.740789] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 5407.741671] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [ 5407.750803] CPU: 0 PID: 9909 Comm: umount Kdump: loaded Tainted: G OE ------------ 3.10.0-7.9-debug #1 [ 5407.753185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 5407.755316] task: ffff8800b6498000 ti: ffff8800a804c000 task.ti: ffff8800a804c000 [ 5407.756831] RIP: 0010:[<ffffffffa016df58>] [<ffffffffa016df58>] xas_free_nodes+0xb8/0xf0 [libcfs] [ 5407.758521] RSP: 0018:ffff8800a804f950 EFLAGS: 00010083 [ 5407.759476] RAX: 0000000000002710 RBX: ffff880136ef66c8 RCX: 0000000000000023 [ 5407.760702] RDX: ffff880136ef5930 RSI: ffff880136ef66e0 RDI: 0000000000000046 [ 5407.761989] RBP: ffff8800a804f978 R08: 0000000000000000 R09: 77b0000000000000 [ 5407.763203] R10: 00000000af0c2d01 R11: ffff8800af0c2f80 R12: 000000000000002a [ 5407.764433] R13: 0000000000000000 R14: ffff880136ef56d0 R15: ffff8800a804f988 [ 5407.765708] FS: 00007f19ba87f880(0000) GS:ffff88013e200000(0000) knlGS:0000000000000000 [ 5407.767225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5407.768257] CR2: 0000000000000000 CR3: 00000000a4d64000 CR4: 00000000000006f0 [ 5407.769490] Call Trace: [ 5407.769952] [<ffffffffa016f26f>] xa_destroy+0xdf/0xf0 [libcfs] [ 5407.771357] [<ffffffffa08814c5>] osc_quota_cleanup+0x15/0x20 [osc] [ 5407.773200] [<ffffffffa0862f1f>] osc_cleanup_common+0xbf/0x1b0 [osc] [ 5407.774839] [<ffffffffa02e88f9>] class_free_dev+0x219/0x730 [obdclass] [ 5407.776139] [<ffffffffa02e8ff0>] class_export_put+0x1e0/0x2e0 [obdclass] [ 5407.777489] [<ffffffffa02eac15>] class_unlink_export+0x125/0x160 [obdclass] [ 5407.778876] [<ffffffffa030016e>] class_decref_free+0x4e/0x90 [obdclass] [ 5407.780371] [<ffffffffa0300ad8>] class_decref+0x48/0xf0 [obdclass] [ 5407.781595] [<ffffffffa0300ee1>] class_detach+0x1c1/0x310 [obdclass] [ 5407.782914] [<ffffffffa0308b9b>] class_process_config+0x163b/0x27c0 [obdclass] [ 5407.784281] [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340 [ 5407.785303] [<ffffffffa0309f00>] class_manual_cleanup+0x1e0/0x770 [obdclass] [ 5407.786693] [<ffffffffa08f7955>] lov_tgts_putref+0x385/0xad0 [lov] [ 5407.787835] [<ffffffffa08fc927>] lov_disconnect+0x237/0x280 [lov] [ 5407.789099] [<ffffffffa0e82b46>] obd_disconnect+0x56/0x300 [lustre] |
| Comment by Sergey Cheremencev [ 25/Oct/23 ] |
|
Hit the same panic as Alex reported above: [ 5388.400457] ------------[ cut here ]------------ [ 5388.401251] WARNING: CPU: 2 PID: 12374 at lib/debugobjects.c:286 debug_print_object+0x83/0xa0 [ 5388.402684] ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) [ 5388.404149] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [ 5388.411563] CPU: 2 PID: 12374 Comm: umount Kdump: loaded Tainted: G W OE ------------ 3.10.0-7.9-debug #1 [ 5388.413343] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 5388.414823] Call Trace: [ 5388.415251] [<ffffffff817ded29>] dump_stack+0x19/0x1b [ 5388.416116] [<ffffffff8108d558>] __warn+0xd8/0x100 [ 5388.416925] [<ffffffff8108d5df>] warn_slowpath_fmt+0x5f/0x80 [ 5388.417850] [<ffffffff81414723>] debug_print_object+0x83/0xa0 [ 5388.418827] [<ffffffff814150af>] debug_object_activate+0x1af/0x210 [ 5388.419897] [<ffffffffa018be60>] ? xas_alloc+0xd0/0xd0 [libcfs] [ 5388.420931] [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0 [ 5388.421801] [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20 [ 5388.422722] [<ffffffffa018bf44>] xas_free_nodes+0xa4/0xf0 [libcfs] [ 5388.423795] [<ffffffffa018d26f>] xa_destroy+0xdf/0xf0 [libcfs] [ 5388.424779] [<ffffffffa08a38a5>] osc_quota_cleanup+0x15/0x20 [osc] [ 5388.425804] [<ffffffffa0884f1f>] osc_cleanup_common+0xbf/0x1b0 [osc] [ 5388.426902] [<ffffffffa03068f9>] class_free_dev+0x219/0x730 [obdclass] [ 5388.428025] [<ffffffffa0306ff0>] class_export_put+0x1e0/0x2e0 [obdclass] [ 5388.429205] [<ffffffffa0308c15>] class_unlink_export+0x125/0x160 [obdclass] [ 5388.430404] [<ffffffffa031e18e>] class_decref_free+0x4e/0x90 [obdclass] [ 5388.431549] [<ffffffffa031eaf8>] class_decref+0x48/0xf0 [obdclass] [ 5388.432643] [<ffffffffa031ef01>] class_detach+0x1c1/0x310 [obdclass] [ 5388.433728] [<ffffffffa0326bbb>] class_process_config+0x163b/0x27c0 [obdclass] [ 5388.434945] [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340 [ 5388.435881] [<ffffffffa0327f20>] class_manual_cleanup+0x1e0/0x770 [obdclass] [ 5388.437115] [<ffffffffa0919ed5>] lov_tgts_putref+0x385/0xad0 [lov] [ 5388.438171] [<ffffffffa091eea7>] lov_disconnect+0x237/0x280 [lov] [ 5388.439178] [<ffffffffa0eb9c96>] obd_disconnect+0x56/0x300 [lustre] [ 5388.440276] [<ffffffffa0ec30d7>] ll_put_super+0x767/0xce0 [lustre] [ 5388.441341] [<ffffffff8114df3d>] ? call_rcu_sched+0x1d/0x20 [ 5388.442307] [<ffffffffa0ef185c>] ? ll_destroy_inode+0x1c/0x20 [lustre] [ 5388.443408] [<ffffffff812639b8>] ? destroy_inode+0x38/0x60 [ 5388.444361] [<ffffffff81263aee>] ? evict+0x10e/0x180 [ 5388.445220] [<ffffffff817e8d7e>] ? _raw_spin_unlock+0xe/0x20 [ 5388.446197] [<ffffffff812907a6>] ? fsnotify_unmount_inodes+0x1d6/0x1e0 [ 5388.447300] [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0 [ 5388.448335] [<ffffffff81247ac2>] kill_anon_super+0x12/0x20 [ 5388.449275] [<ffffffffa0ef188b>] lustre_kill_super+0x2b/0x30 [lustre] [ 5388.450347] [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60 [ 5388.451397] [<ffffffff81248616>] deactivate_super+0x46/0x60 [ 5388.452334] [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80 [ 5388.453213] [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20 [ 5388.454129] [<ffffffff810b69b5>] task_work_run+0xb5/0xf0 [ 5388.455070] [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0 [ 5388.456025] [<ffffffff817f4363>] int_signal+0x12/0x17 [ 5388.456860] ---[ end trace a193f2979542d3e3 ]--- [ 5388.457628] BUG: unable to handle kernel NULL pointer dereference at (null) [ 5388.458920] IP: [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs] [ 5388.460023] PGD 0 [ 5388.460373] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 5388.461195] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [ 5388.468704] CPU: 2 PID: 12374 Comm: umount Kdump: loaded Tainted: G W OE ------------ 3.10.0-7.9-debug #1 [ 5388.470441] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 5388.471875] task: ffff88012b485550 ti: ffff8800a9968000 task.ti: ffff8800a9968000 [ 5388.473121] RIP: 0010:[<ffffffffa018bf58>] [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs] [ 5388.474643] RSP: 0018:ffff8800a996b980 EFLAGS: 00010083 [ 5388.475541] RAX: 0000000000002710 RBX: ffff8800b0a26910 RCX: 0000000000000005 [ 5388.476753] RDX: ffff8800b0a26b70 RSI: ffff8800b0a26928 RDI: 0000000000000046 [ 5388.477945] RBP: ffff8800a996b9a8 R08: 0000000000000000 R09: 5130000000000000 [ 5388.479144] R10: 6633393161206563 R11: 61727420646e6520 R12: 0000000000000001 [ 5388.480277] R13: 0000000000000000 R14: ffff8800b0aa06d8 R15: ffff8800a996b9b8 [ 5388.481433] FS: 00007fbcfa47f880(0000) GS:ffff88013e300000(0000) knlGS:0000000000000000 [ 5388.482807] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5388.483784] CR2: 0000000000000000 CR3: 00000000b6230000 CR4: 00000000000006e0 [ 5388.484962] Call Trace: [ 5388.485372] [<ffffffffa018d26f>] xa_destroy+0xdf/0xf0 [libcfs] [ 5388.486415] [<ffffffffa08a38a5>] osc_quota_cleanup+0x15/0x20 [osc] [ 5388.487506] [<ffffffffa0884f1f>] osc_cleanup_common+0xbf/0x1b0 [osc] [ 5388.488607] [<ffffffffa03068f9>] class_free_dev+0x219/0x730 [obdclass] [ 5388.489606] [<ffffffffa0306ff0>] class_export_put+0x1e0/0x2e0 [obdclass] [ 5388.490799] [<ffffffffa0308c15>] class_unlink_export+0x125/0x160 [obdclass] [ 5388.492043] [<ffffffffa031e18e>] class_decref_free+0x4e/0x90 [obdclass] [ 5388.493112] [<ffffffffa031eaf8>] class_decref+0x48/0xf0 [obdclass] [ 5388.494192] [<ffffffffa031ef01>] class_detach+0x1c1/0x310 [obdclass] [ 5388.495278] [<ffffffffa0326bbb>] class_process_config+0x163b/0x27c0 [obdclass] [ 5388.496501] [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340 [ 5388.497465] [<ffffffffa0327f20>] class_manual_cleanup+0x1e0/0x770 [obdclass] [ 5388.498699] [<ffffffffa0919ed5>] lov_tgts_putref+0x385/0xad0 [lov] [ 5388.499813] [<ffffffffa091eea7>] lov_disconnect+0x237/0x280 [lov] [ 5388.500829] [<ffffffffa0eb9c96>] obd_disconnect+0x56/0x300 [lustre] [ 5388.501904] [<ffffffffa0ec30d7>] ll_put_super+0x767/0xce0 [lustre] [ 5388.503002] [<ffffffff8114df3d>] ? call_rcu_sched+0x1d/0x20 [ 5388.503986] [<ffffffffa0ef185c>] ? ll_destroy_inode+0x1c/0x20 [lustre] [ 5388.505124] [<ffffffff812639b8>] ? destroy_inode+0x38/0x60 [ 5388.506008] [<ffffffff81263aee>] ? evict+0x10e/0x180 [ 5388.506891] [<ffffffff817e8d7e>] ? _raw_spin_unlock+0xe/0x20 [ 5388.507887] [<ffffffff812907a6>] ? fsnotify_unmount_inodes+0x1d6/0x1e0 [ 5388.509019] [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0 [ 5388.510087] [<ffffffff81247ac2>] kill_anon_super+0x12/0x20 [ 5388.511048] [<ffffffffa0ef188b>] lustre_kill_super+0x2b/0x30 [lustre] [ 5388.512173] [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60 [ 5388.513269] [<ffffffff81248616>] deactivate_super+0x46/0x60 [ 5388.514218] [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80 [ 5388.515137] [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20 [ 5388.516044] [<ffffffff810b69b5>] task_work_run+0xb5/0xf0 [ 5388.516974] [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0 [ 5388.517922] [<ffffffff817f4363>] int_signal+0x12/0x17 [ 5388.518743] Code: 8d 7b 18 48 c7 43 10 01 00 00 00 48 c7 c6 60 be 18 a0 e8 dc 1f fc e0 4c 39 f3 75 b7 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 <41> 0f b6 4d 00 4c 89 eb e9 5c ff ff ff 0f 1f 00 48 81 fa 00 10 [ 5388.523115] RIP [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs] [ 5388.524266] RSP <ffff8800a996b980> [ 5388.524876] CR2: 0000000000000000 simmonsja , could you comment? Do we need a separate ticket for that? Can we expect that https://review.whamcloud.com/c/fs/lustre-release/+/52381 could solve the problem with the panic also? |
| Comment by James A Simmons [ 27/Oct/23 ] |
|
Yes. Patch 52381 will fix the crash |
| Comment by Sergey Cheremencev [ 01/Nov/23 ] |
[ 3896.802710] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [ptlrpcd_00_00:1855] [ 3896.804652] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [ 3896.814091] CPU: 1 PID: 1855 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G OE ------------ 3.10.0-7.9-debug #1 [ 3896.816533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 3896.818482] task: ffff88012b245550 ti: ffff8800ae224000 task.ti: ffff8800ae224000 [ 3896.820255] RIP: 0010:[<ffffffffa016e00d>] [<ffffffffa016e00d>] xas_free_nodes+0xdd/0xf0 [libcfs] [ 3896.822272] RSP: 0018:ffff8800ae2279f0 EFLAGS: 00000282 [ 3896.823454] RAX: ffff880136faefe8 RBX: 0000000000000000 RCX: 000000000000000c [ 3896.824968] RDX: ffff880136faefea RSI: 0000000000000002 RDI: ffff8800ae227a90 [ 3896.826450] RBP: ffff8800ae227a18 R08: 000000000000000e R09: 0000000000000000 [ 3896.827961] R10: 0000000000000000 R11: 0000000008000000 R12: ffff880136faefe8 [ 3896.829424] R13: ffffffffa016dec6 R14: 0000000000000078 R15: ffff8800ae227fd8 [ 3896.830997] FS: 0000000000000000(0000) GS:ffff88013e280000(0000) knlGS:0000000000000000 [ 3896.832791] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3896.834035] CR2: 00007f3a1172c000 CR3: 00000000a8dcc000 CR4: 00000000000006e0 [ 3896.835631] Call Trace: [ 3896.836160] [<ffffffffa016fa54>] xas_store+0x184/0x540 [libcfs] [ 3896.837538] [<ffffffffa017027c>] __xa_insert+0xdc/0x150 [libcfs] [ 3896.838887] [<ffffffffa08897c0>] osc_quota_setdq+0x220/0x510 [osc] [ 3896.840285] [<ffffffffa0873adc>] osc_brw_fini_request+0xa9c/0x1b10 [osc] [ 3896.841694] [<ffffffffa0874ba7>] brw_interpret+0x57/0xdb0 [osc] [ 3896.843082] [<ffffffffa05b3fc8>] ptlrpc_check_set+0x428/0x2170 [ptlrpc] [ 3896.844618] [<ffffffffa05e4f94>] ptlrpcd+0xa94/0xb70 [ptlrpc] [ 3896.845884] [<ffffffff810baff0>] ? abort_exclusive_wait+0xa0/0xa0 [ 3896.847259] [<ffffffffa05e4500>] ? ptlrpcd_partners+0x3a0/0x3a0 [ptlrpc] [ 3896.848744] [<ffffffff810ba114>] kthread+0xe4/0xf0 [ 3896.849779] [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140 [ 3896.851231] [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21 [ 3896.852681] [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140 [ 3896.854083] Code: 5d c3 0f 1f 40 00 41 0f b6 4d 00 4c 89 eb e9 5c ff ff ff 0f 1f 00 48 81 fa 00 10 00 00 0f 86 6b ff ff ff 48 8d 5a fe 0f b6 4a fe <45> 31 e4 e9 3c ff ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f [ 3896.859259] Kernel panic - not syncing: softlockup: hung tasks |
| Comment by Gerrit Updater [ 29/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52381/ |
| Comment by Peter Jones [ 29/Nov/23 ] |
|
Landed for 2.16 |