[LU-17097] RCU stall caused by osc_quota_cleanup Created: 07/Sep/23  Updated: 29/Nov/23  Resolved: 29/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11063 RHEL7.[345] RCU breakage Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-quota@ldiskfs+DNE failed with Timeout(8009s)(Client: RCU stall).
https://testing.whamcloud.com/gerrit-janitor/34572/testresults/sanity-quota-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Sep  6 06:54:52 oleg310-client kernel: Lustre: DEBUG MARKER: == sanity-quota test complete, duration 4313 sec ========= 06:54:51 (1693997691)
Sep  6 06:54:52 oleg310-client sshd[20583]: Received disconnect from 192.168.203.10 port 42586:11: disconnected by user
Sep  6 06:54:52 oleg310-client sshd[20583]: Disconnected from 192.168.203.10 port 42586
Sep  6 06:54:52 oleg310-client sshd[20583]: pam_unix(sshd:session): session closed for user root
Sep  6 06:54:52 oleg310-client systemd-logind: Removed session 600.
Sep  6 06:55:07 oleg310-client kernel: ------------[ cut here ]------------
Sep  6 06:55:07 oleg310-client kernel: WARNING: CPU: 2 PID: 20679 at lib/debugobjects.c:286 debug_print_object+0x83/0xa0
Sep  6 06:55:07 oleg310-client kernel: ODEBUG: activate active (active state 1) object type: rcu_head hint:           (null)
Sep  6 06:55:07 oleg310-client kernel: Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
Sep  6 06:55:07 oleg310-client kernel: CPU: 2 PID: 20679 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
Sep  6 06:55:07 oleg310-client kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Sep  6 06:55:07 oleg310-client kernel: Call Trace:
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff817ded29>] dump_stack+0x19/0x1b
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8108d558>] __warn+0xd8/0x100
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8108d5df>] warn_slowpath_fmt+0x5f/0x80
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81414723>] debug_print_object+0x83/0xa0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff814150af>] debug_object_activate+0x1af/0x210
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0179f44>] xas_free_nodes+0xa4/0xf0 [libcfs]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa017b26f>] xa_destroy+0xdf/0xf0 [libcfs]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa088d4d5>] osc_quota_cleanup+0x15/0x20 [osc]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa086ef1f>] osc_cleanup_common+0xbf/0x1b0 [osc]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f48f9>] class_free_dev+0x219/0x730 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f4ff0>] class_export_put+0x1e0/0x2e0 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f6c15>] class_unlink_export+0x125/0x160 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa030ca30>] class_decref+0x80/0x160 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa030ce71>] class_detach+0x1c1/0x310 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0314b2b>] class_process_config+0x163b/0x27c0 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0315e90>] class_manual_cleanup+0x1e0/0x770 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0903955>] lov_tgts_putref+0x385/0xad0 [lov]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0908927>] lov_disconnect+0x237/0x280 [lov]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0e8fa96>] obd_disconnect+0x56/0x300 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0e98ebc>] ll_put_super+0x81c/0xf30 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81247ac2>] kill_anon_super+0x12/0x20 Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0ec711b>] lustre_kill_super+0x2b/0x30 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81248616>] deactivate_super+0x46/0x60
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff810b69b5>] task_work_run+0xb5/0xf0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff817f4363>] int_signal+0x12/0x17
Sep  6 06:55:07 oleg310-client kernel: ---[ end trace 37e266df8306097f ]---
Sep  6 06:55:07 oleg310-client kernel: ------------[ cut here ]------------
Sep  6 06:55:07 oleg310-client kernel: WARNING: CPU: 2 PID: 20679 at kernel/rcupdate.c:311 rcuhead_fixup_activate+0x5a/0x70
Sep  6 06:55:07 oleg310-client kernel: Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
Sep  6 06:55:07 oleg310-client kernel: CPU: 2 PID: 20679 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-7.9-debug #1
Sep  6 06:55:07 oleg310-client kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Sep  6 06:55:07 oleg310-client kernel: Call Trace: Sep  6 06:55:07 oleg310-client kernel: [<ffffffff817ded29>] dump_stack+0x19/0x1b
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8108d558>] __warn+0xd8/0x100
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8108d69d>] warn_slowpath_null+0x1d/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff810b6e8a>] rcuhead_fixup_activate+0x5a/0x70
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff814150cf>] debug_object_activate+0x1cf/0x210
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0179f44>] xas_free_nodes+0xa4/0xf0 [libcfs]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa017b26f>] xa_destroy+0xdf/0xf0 [libcfs]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa088d4d5>] osc_quota_cleanup+0x15/0x20 [osc]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa086ef1f>] osc_cleanup_common+0xbf/0x1b0 [osc]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f48f9>] class_free_dev+0x219/0x730 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f4ff0>] class_export_put+0x1e0/0x2e0 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa02f6c15>] class_unlink_export+0x125/0x160 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa030ca30>] class_decref+0x80/0x160 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa030ce71>] class_detach+0x1c1/0x310 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0314b2b>] class_process_config+0x163b/0x27c0 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0315e90>] class_manual_cleanup+0x1e0/0x770 [obdclass]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0903955>] lov_tgts_putref+0x385/0xad0 [lov]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0908927>] lov_disconnect+0x237/0x280 [lov]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0e8fa96>] obd_disconnect+0x56/0x300 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0e98ebc>] ll_put_super+0x81c/0xf30 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81247ac2>] kill_anon_super+0x12/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffffa0ec711b>] lustre_kill_super+0x2b/0x30 [lustre]
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81248616>] deactivate_super+0x46/0x60
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff810b69b5>] task_work_run+0xb5/0xf0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0
Sep  6 06:55:07 oleg310-client kernel: [<ffffffff817f4363>] int_signal+0x12/0x17
Sep  6 06:55:07 oleg310-client kernel: ---[ end trace 37e266df83060980 ]---
 

 



 Comments   
Comment by Sergey Cheremencev [ 07/Sep/23 ]

Hi simmonsja ,

Can you take a look?
I'll send a revert with the number of current ticket to check that https://review.whamcloud.com/c/fs/lustre-release/+/52094 passes initial testing.

Comment by Gerrit Updater [ 07/Sep/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52304
Subject: LU-17097 quota: Revert "LU-8130 osc: convert osc_quota hash to xarray"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb8306699f0f63ce5ee27c42d5ef08e61ee8f2ac

Comment by James A Simmons [ 07/Sep/23 ]

I'm looking. Can this be reproduced only on RHEL7? If this is a RHEL7 only problem I don't think its worth reverting.

Comment by Sergey Cheremencev [ 08/Sep/23 ]

No, I haven't seen such messages somewhere besides RHEL7. I saw these warnings just once during testing LU-17034. I'll send https://review.whamcloud.com/c/fs/lustre-release/+/52094 without revert "LU-8130 osc: convert osc_quota hash to xarray" to see whether it is reproducible.

Comment by Sergey Cheremencev [ 11/Sep/23 ]

Hit it again https://testing.whamcloud.com/gerrit-janitor/35272/testresults/sanity-quota-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Comment by James A Simmons [ 12/Sep/23 ]

I'm working with Oleg to move to RHLE8 for debug kernel testing. This should go away then.

Comment by Sergey Cheremencev [ 12/Sep/23 ]

Hi simmonsja 

'm working with Oleg to move to RHLE8 for debug kernel testing. This should go away then.

Thanks for update. Does it mean there is no issue with your patch and it is caused by some problem in RHEL7 debug kernel?

Comment by Gerrit Updater [ 15/Sep/23 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52381
Subject: LU-17097 tests: validate xarray on RHEL7 non debug kernels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f615ad49ca9d4673acdbdafdbea595b911c6269e

Comment by James A Simmons [ 15/Sep/23 ]

I just pushed a patch to validate the xarray work with a normal RHEL7 kernel. 

Comment by James A Simmons [ 21/Sep/23 ]

Sorry it took awhile to run the patch.  As you can see with a non-debug normal kernel sanity-quota passes.

https://testing.whamcloud.com/test_sets/921b8bf1-b5ba-4664-9d6c-1c4c164e543c

Comment by Alexander Boyko [ 26/Sep/23 ]

close to this issue

[ 5407.737299] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 5407.739021] IP: [<ffffffffa016df58>] xas_free_nodes+0xb8/0xf0 [libcfs]
[ 5407.740398] PGD 0 
[ 5407.740789] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 5407.741671] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
[ 5407.750803] CPU: 0 PID: 9909 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
[ 5407.753185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 5407.755316] task: ffff8800b6498000 ti: ffff8800a804c000 task.ti: ffff8800a804c000
[ 5407.756831] RIP: 0010:[<ffffffffa016df58>]  [<ffffffffa016df58>] xas_free_nodes+0xb8/0xf0 [libcfs]
[ 5407.758521] RSP: 0018:ffff8800a804f950  EFLAGS: 00010083
[ 5407.759476] RAX: 0000000000002710 RBX: ffff880136ef66c8 RCX: 0000000000000023
[ 5407.760702] RDX: ffff880136ef5930 RSI: ffff880136ef66e0 RDI: 0000000000000046
[ 5407.761989] RBP: ffff8800a804f978 R08: 0000000000000000 R09: 77b0000000000000
[ 5407.763203] R10: 00000000af0c2d01 R11: ffff8800af0c2f80 R12: 000000000000002a
[ 5407.764433] R13: 0000000000000000 R14: ffff880136ef56d0 R15: ffff8800a804f988
[ 5407.765708] FS:  00007f19ba87f880(0000) GS:ffff88013e200000(0000) knlGS:0000000000000000
[ 5407.767225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5407.768257] CR2: 0000000000000000 CR3: 00000000a4d64000 CR4: 00000000000006f0
[ 5407.769490] Call Trace:
[ 5407.769952]  [<ffffffffa016f26f>] xa_destroy+0xdf/0xf0 [libcfs]
[ 5407.771357]  [<ffffffffa08814c5>] osc_quota_cleanup+0x15/0x20 [osc]
[ 5407.773200]  [<ffffffffa0862f1f>] osc_cleanup_common+0xbf/0x1b0 [osc]
[ 5407.774839]  [<ffffffffa02e88f9>] class_free_dev+0x219/0x730 [obdclass]
[ 5407.776139]  [<ffffffffa02e8ff0>] class_export_put+0x1e0/0x2e0 [obdclass]
[ 5407.777489]  [<ffffffffa02eac15>] class_unlink_export+0x125/0x160 [obdclass]
[ 5407.778876]  [<ffffffffa030016e>] class_decref_free+0x4e/0x90 [obdclass]
[ 5407.780371]  [<ffffffffa0300ad8>] class_decref+0x48/0xf0 [obdclass]
[ 5407.781595]  [<ffffffffa0300ee1>] class_detach+0x1c1/0x310 [obdclass]
[ 5407.782914]  [<ffffffffa0308b9b>] class_process_config+0x163b/0x27c0 [obdclass]
[ 5407.784281]  [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340
[ 5407.785303]  [<ffffffffa0309f00>] class_manual_cleanup+0x1e0/0x770 [obdclass]
[ 5407.786693]  [<ffffffffa08f7955>] lov_tgts_putref+0x385/0xad0 [lov]
[ 5407.787835]  [<ffffffffa08fc927>] lov_disconnect+0x237/0x280 [lov]
[ 5407.789099]  [<ffffffffa0e82b46>] obd_disconnect+0x56/0x300 [lustre]

https://testing.whamcloud.com/gerrit-janitor/35992/testresults/sanity-quota-zfs-centos7_x86_64-centos7_x86_64/

Comment by Sergey Cheremencev [ 25/Oct/23 ]

Hit the same panic as Alex reported above:

[ 5388.400457] ------------[ cut here ]------------
[ 5388.401251] WARNING: CPU: 2 PID: 12374 at lib/debugobjects.c:286 debug_print_object+0x83/0xa0
[ 5388.402684] ODEBUG: activate active (active state 1) object type: rcu_head hint:           (null)
[ 5388.404149] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
[ 5388.411563] CPU: 2 PID: 12374 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-7.9-debug #1
[ 5388.413343] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 5388.414823] Call Trace:
[ 5388.415251]  [<ffffffff817ded29>] dump_stack+0x19/0x1b
[ 5388.416116]  [<ffffffff8108d558>] __warn+0xd8/0x100
[ 5388.416925]  [<ffffffff8108d5df>] warn_slowpath_fmt+0x5f/0x80
[ 5388.417850]  [<ffffffff81414723>] debug_print_object+0x83/0xa0
[ 5388.418827]  [<ffffffff814150af>] debug_object_activate+0x1af/0x210
[ 5388.419897]  [<ffffffffa018be60>] ? xas_alloc+0xd0/0xd0 [libcfs]
[ 5388.420931]  [<ffffffff8114dc8f>] __call_rcu+0x3f/0x2d0
[ 5388.421801]  [<ffffffff8114df3d>] call_rcu_sched+0x1d/0x20
[ 5388.422722]  [<ffffffffa018bf44>] xas_free_nodes+0xa4/0xf0 [libcfs]
[ 5388.423795]  [<ffffffffa018d26f>] xa_destroy+0xdf/0xf0 [libcfs]
[ 5388.424779]  [<ffffffffa08a38a5>] osc_quota_cleanup+0x15/0x20 [osc]
[ 5388.425804]  [<ffffffffa0884f1f>] osc_cleanup_common+0xbf/0x1b0 [osc]
[ 5388.426902]  [<ffffffffa03068f9>] class_free_dev+0x219/0x730 [obdclass]
[ 5388.428025]  [<ffffffffa0306ff0>] class_export_put+0x1e0/0x2e0 [obdclass]
[ 5388.429205]  [<ffffffffa0308c15>] class_unlink_export+0x125/0x160 [obdclass]
[ 5388.430404]  [<ffffffffa031e18e>] class_decref_free+0x4e/0x90 [obdclass]
[ 5388.431549]  [<ffffffffa031eaf8>] class_decref+0x48/0xf0 [obdclass]
[ 5388.432643]  [<ffffffffa031ef01>] class_detach+0x1c1/0x310 [obdclass]
[ 5388.433728]  [<ffffffffa0326bbb>] class_process_config+0x163b/0x27c0 [obdclass]
[ 5388.434945]  [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340
[ 5388.435881]  [<ffffffffa0327f20>] class_manual_cleanup+0x1e0/0x770 [obdclass]
[ 5388.437115]  [<ffffffffa0919ed5>] lov_tgts_putref+0x385/0xad0 [lov]
[ 5388.438171]  [<ffffffffa091eea7>] lov_disconnect+0x237/0x280 [lov]
[ 5388.439178]  [<ffffffffa0eb9c96>] obd_disconnect+0x56/0x300 [lustre]
[ 5388.440276]  [<ffffffffa0ec30d7>] ll_put_super+0x767/0xce0 [lustre]
[ 5388.441341]  [<ffffffff8114df3d>] ? call_rcu_sched+0x1d/0x20
[ 5388.442307]  [<ffffffffa0ef185c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[ 5388.443408]  [<ffffffff812639b8>] ? destroy_inode+0x38/0x60
[ 5388.444361]  [<ffffffff81263aee>] ? evict+0x10e/0x180
[ 5388.445220]  [<ffffffff817e8d7e>] ? _raw_spin_unlock+0xe/0x20
[ 5388.446197]  [<ffffffff812907a6>] ? fsnotify_unmount_inodes+0x1d6/0x1e0
[ 5388.447300]  [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0
[ 5388.448335]  [<ffffffff81247ac2>] kill_anon_super+0x12/0x20
[ 5388.449275]  [<ffffffffa0ef188b>] lustre_kill_super+0x2b/0x30 [lustre]
[ 5388.450347]  [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60
[ 5388.451397]  [<ffffffff81248616>] deactivate_super+0x46/0x60
[ 5388.452334]  [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80
[ 5388.453213]  [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20
[ 5388.454129]  [<ffffffff810b69b5>] task_work_run+0xb5/0xf0
[ 5388.455070]  [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0
[ 5388.456025]  [<ffffffff817f4363>] int_signal+0x12/0x17
[ 5388.456860] ---[ end trace a193f2979542d3e3 ]---
[ 5388.457628] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 5388.458920] IP: [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs]
[ 5388.460023] PGD 0 
[ 5388.460373] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 5388.461195] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
[ 5388.468704] CPU: 2 PID: 12374 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-7.9-debug #1
[ 5388.470441] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 5388.471875] task: ffff88012b485550 ti: ffff8800a9968000 task.ti: ffff8800a9968000
[ 5388.473121] RIP: 0010:[<ffffffffa018bf58>]  [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs]
[ 5388.474643] RSP: 0018:ffff8800a996b980  EFLAGS: 00010083
[ 5388.475541] RAX: 0000000000002710 RBX: ffff8800b0a26910 RCX: 0000000000000005
[ 5388.476753] RDX: ffff8800b0a26b70 RSI: ffff8800b0a26928 RDI: 0000000000000046
[ 5388.477945] RBP: ffff8800a996b9a8 R08: 0000000000000000 R09: 5130000000000000
[ 5388.479144] R10: 6633393161206563 R11: 61727420646e6520 R12: 0000000000000001
[ 5388.480277] R13: 0000000000000000 R14: ffff8800b0aa06d8 R15: ffff8800a996b9b8
[ 5388.481433] FS:  00007fbcfa47f880(0000) GS:ffff88013e300000(0000) knlGS:0000000000000000
[ 5388.482807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5388.483784] CR2: 0000000000000000 CR3: 00000000b6230000 CR4: 00000000000006e0
[ 5388.484962] Call Trace:
[ 5388.485372]  [<ffffffffa018d26f>] xa_destroy+0xdf/0xf0 [libcfs]
[ 5388.486415]  [<ffffffffa08a38a5>] osc_quota_cleanup+0x15/0x20 [osc]
[ 5388.487506]  [<ffffffffa0884f1f>] osc_cleanup_common+0xbf/0x1b0 [osc]
[ 5388.488607]  [<ffffffffa03068f9>] class_free_dev+0x219/0x730 [obdclass]
[ 5388.489606]  [<ffffffffa0306ff0>] class_export_put+0x1e0/0x2e0 [obdclass]
[ 5388.490799]  [<ffffffffa0308c15>] class_unlink_export+0x125/0x160 [obdclass]
[ 5388.492043]  [<ffffffffa031e18e>] class_decref_free+0x4e/0x90 [obdclass]
[ 5388.493112]  [<ffffffffa031eaf8>] class_decref+0x48/0xf0 [obdclass]
[ 5388.494192]  [<ffffffffa031ef01>] class_detach+0x1c1/0x310 [obdclass]
[ 5388.495278]  [<ffffffffa0326bbb>] class_process_config+0x163b/0x27c0 [obdclass]
[ 5388.496501]  [<ffffffff81220310>] ? __kmalloc+0x1e0/0x340
[ 5388.497465]  [<ffffffffa0327f20>] class_manual_cleanup+0x1e0/0x770 [obdclass]
[ 5388.498699]  [<ffffffffa0919ed5>] lov_tgts_putref+0x385/0xad0 [lov]
[ 5388.499813]  [<ffffffffa091eea7>] lov_disconnect+0x237/0x280 [lov]
[ 5388.500829]  [<ffffffffa0eb9c96>] obd_disconnect+0x56/0x300 [lustre]
[ 5388.501904]  [<ffffffffa0ec30d7>] ll_put_super+0x767/0xce0 [lustre]
[ 5388.503002]  [<ffffffff8114df3d>] ? call_rcu_sched+0x1d/0x20
[ 5388.503986]  [<ffffffffa0ef185c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[ 5388.505124]  [<ffffffff812639b8>] ? destroy_inode+0x38/0x60
[ 5388.506008]  [<ffffffff81263aee>] ? evict+0x10e/0x180
[ 5388.506891]  [<ffffffff817e8d7e>] ? _raw_spin_unlock+0xe/0x20
[ 5388.507887]  [<ffffffff812907a6>] ? fsnotify_unmount_inodes+0x1d6/0x1e0
[ 5388.509019]  [<ffffffff812476ca>] generic_shutdown_super+0x6a/0xf0
[ 5388.510087]  [<ffffffff81247ac2>] kill_anon_super+0x12/0x20
[ 5388.511048]  [<ffffffffa0ef188b>] lustre_kill_super+0x2b/0x30 [lustre]
[ 5388.512173]  [<ffffffff81247ec9>] deactivate_locked_super+0x49/0x60
[ 5388.513269]  [<ffffffff81248616>] deactivate_super+0x46/0x60
[ 5388.514218]  [<ffffffff81268b1f>] cleanup_mnt+0x3f/0x80
[ 5388.515137]  [<ffffffff81268bb2>] __cleanup_mnt+0x12/0x20
[ 5388.516044]  [<ffffffff810b69b5>] task_work_run+0xb5/0xf0
[ 5388.516974]  [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0
[ 5388.517922]  [<ffffffff817f4363>] int_signal+0x12/0x17
[ 5388.518743] Code: 8d 7b 18 48 c7 43 10 01 00 00 00 48 c7 c6 60 be 18 a0 e8 dc 1f fc e0 4c 39 f3 75 b7 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 <41> 0f b6 4d 00 4c 89 eb e9 5c ff ff ff 0f 1f 00 48 81 fa 00 10 
[ 5388.523115] RIP  [<ffffffffa018bf58>] xas_free_nodes+0xb8/0xf0 [libcfs]
[ 5388.524266]  RSP <ffff8800a996b980>
[ 5388.524876] CR2: 0000000000000000 

simmonsja , could you comment? Do we need a separate ticket for that? Can we expect that https://review.whamcloud.com/c/fs/lustre-release/+/52381 could solve the problem with the panic also?

https://testing.whamcloud.com/gerrit-janitor/37062/testresults/sanity-quota-zfs-centos7_x86_64-centos7_x86_64/

Comment by James A Simmons [ 27/Oct/23 ]

Yes. Patch 52381 will fix the crash

Comment by Sergey Cheremencev [ 01/Nov/23 ]
[ 3896.802710] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [ptlrpcd_00_00:1855]
[ 3896.804652] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs i2c_piix4 i2c_core pcspkr binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata
[ 3896.814091] CPU: 1 PID: 1855 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
[ 3896.816533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 3896.818482] task: ffff88012b245550 ti: ffff8800ae224000 task.ti: ffff8800ae224000
[ 3896.820255] RIP: 0010:[<ffffffffa016e00d>]  [<ffffffffa016e00d>] xas_free_nodes+0xdd/0xf0 [libcfs]
[ 3896.822272] RSP: 0018:ffff8800ae2279f0  EFLAGS: 00000282
[ 3896.823454] RAX: ffff880136faefe8 RBX: 0000000000000000 RCX: 000000000000000c
[ 3896.824968] RDX: ffff880136faefea RSI: 0000000000000002 RDI: ffff8800ae227a90
[ 3896.826450] RBP: ffff8800ae227a18 R08: 000000000000000e R09: 0000000000000000
[ 3896.827961] R10: 0000000000000000 R11: 0000000008000000 R12: ffff880136faefe8
[ 3896.829424] R13: ffffffffa016dec6 R14: 0000000000000078 R15: ffff8800ae227fd8
[ 3896.830997] FS:  0000000000000000(0000) GS:ffff88013e280000(0000) knlGS:0000000000000000
[ 3896.832791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3896.834035] CR2: 00007f3a1172c000 CR3: 00000000a8dcc000 CR4: 00000000000006e0
[ 3896.835631] Call Trace:
[ 3896.836160]  [<ffffffffa016fa54>] xas_store+0x184/0x540 [libcfs]
[ 3896.837538]  [<ffffffffa017027c>] __xa_insert+0xdc/0x150 [libcfs]
[ 3896.838887]  [<ffffffffa08897c0>] osc_quota_setdq+0x220/0x510 [osc]
[ 3896.840285]  [<ffffffffa0873adc>] osc_brw_fini_request+0xa9c/0x1b10 [osc]
[ 3896.841694]  [<ffffffffa0874ba7>] brw_interpret+0x57/0xdb0 [osc]
[ 3896.843082]  [<ffffffffa05b3fc8>] ptlrpc_check_set+0x428/0x2170 [ptlrpc]
[ 3896.844618]  [<ffffffffa05e4f94>] ptlrpcd+0xa94/0xb70 [ptlrpc]
[ 3896.845884]  [<ffffffff810baff0>] ? abort_exclusive_wait+0xa0/0xa0
[ 3896.847259]  [<ffffffffa05e4500>] ? ptlrpcd_partners+0x3a0/0x3a0 [ptlrpc]
[ 3896.848744]  [<ffffffff810ba114>] kthread+0xe4/0xf0
[ 3896.849779]  [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
[ 3896.851231]  [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
[ 3896.852681]  [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
[ 3896.854083] Code: 5d c3 0f 1f 40 00 41 0f b6 4d 00 4c 89 eb e9 5c ff ff ff 0f 1f 00 48 81 fa 00 10 00 00 0f 86 6b ff ff ff 48 8d 5a fe 0f b6 4a fe <45> 31 e4 e9 3c ff ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
[ 3896.859259] Kernel panic - not syncing: softlockup: hung tasks 

https://testing.whamcloud.com/gerrit-janitor/37274/testresults/sanity-quota-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Comment by Gerrit Updater [ 29/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52381/
Subject: LU-17097 osc: delete items in Xarray before its destroy
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a66daa9c1bf40695e10a283dff40a119dfd060bb

Comment by Peter Jones [ 29/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:32:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.