[LU-16772] Protect lqe_glbl_data in qmt_site_recalc_cb with mutex Created: 25/Apr/23 Updated: 29/Nov/23 Resolved: 14/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sergey Cheremencev | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
lqe_glbl_data should be protected with lqe_glbl_data_lock in qmt_site_reaclc_sb like it did in other places to avoid crashing: Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0005_UUID Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0006_UUID BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8 IP: [<ffffffffc10c81d8>] qmt_site_recalc_cb+0x318/0x7e0 [lquota] Oops: 0000 [#1] SMP CPU: 1 PID: 26035 Comm: qsd_reint_qpool Kdump: loaded 3.10.0-1160.53.1.el7.x86_64 #1 Call Trace: [<ffffffffc09ab7ae>] cfs_hash_for_each_tight+0x11e/0x320 [libcfs] [<ffffffffc09aba20>] cfs_hash_for_each+0x10/0x20 [libcfs] [<ffffffffc10c9df4>] qmt_pool_recalc+0xa64/0x11f0 [lquota] [<ffffffffad4c5e61>] kthread+0xd1/0xe0 |
| Comments |
| Comment by Gerrit Updater [ 25/Apr/23 ] |
|
"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50748 |
| Comment by Gerrit Updater [ 14/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50748/ |
| Comment by Peter Jones [ 14/Jun/23 ] |
|
Landed for 2.16 |
| Comment by Stephane Thiell [ 25/Jun/23 ] |
|
Hit the following MDS BUG with 2.15.3 when unmounting MDT0, that looks like [ 9119.471374] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Timed out tx for 10.0.10.239@o2ib7: 2 seconds [ 9119.481457] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Skipped 23 previous similar messages [ 9663.994337] Lustre: Failing over fir-MDT0000 [ 9663.999333] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9664.007214] IP: [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.013863] PGD 0 [ 9664.015908] Oops: 0000 [#1] SMP [ 9664.019191] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) sunrpc vfat fat dm_round_robin dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ses ablk_helper enclosure cryptd pcspkr sg i2c_piix4 k10temp svcrdma(OE) ccp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) dm_multipath iw_cm(OE) dm_mod ip_tables ext4 mbcache jbd2 sd_mod [ 9664.091933] crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) i2c_algo_bit mlx5_core(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw(OE) psample mpt3sas(OE) auxiliary(OE) devlink libahci crct10dif_pclmul libata drm tg3 mlx_compat(OE) crct10dif_common raid_class ptp crc32c_intel megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core [ 9664.126449] CPU: 20 PID: 12056 Comm: ldlm_bl_02 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1 [ 9664.139307] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.20.0 05/03/2023 [ 9664.146961] task: ffff8cc2f8e73180 ti: ffff8cc2fc534000 task.ti: ffff8cc2fc534000 [ 9664.154440] RIP: 0010:[<ffffffffc15f490a>] [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.163514] RSP: 0018:ffff8cc2fc537c20 EFLAGS: 00010246 [ 9664.168825] RAX: ffff8cc2f8e73180 RBX: ffff8c92cdcc0e70 RCX: ffff8cc2fc537fd8 [ 9664.175959] RDX: 0000000000000000 RSI: ffff8c83d606ca80 RDI: 0000000000000000 [ 9664.183093] RBP: ffff8cc2fc537c28 R08: ffff8cc2fc537c80 R09: ffff8cc2fc537b70 [ 9664.190223] R10: 00000000cfaae101 R11: ffff8c92cfaae6f0 R12: ffff8c83d606ca80 [ 9664.197358] R13: ffff8c92cdcc0f48 R14: 0000000000000000 R15: ffff8ca2e53875c0 [ 9664.204490] FS: 00007fdeb8ba4740(0000) GS:ffff8c92fef40000(0000) knlGS:0000000000000000 [ 9664.212576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9664.218322] CR2: 0000000000000000 CR3: 0000000c75a10000 CR4: 00000000003407e0 [ 9664.225456] Call Trace: [ 9664.227918] [<ffffffffc15ebb49>] qmt_lvbo_free+0xd9/0x380 [lquota] [ 9664.234207] [<ffffffffc1832aab>] mdt_lvbo_free+0x12b/0x150 [mdt] [ 9664.240344] [<ffffffffc110bda2>] ldlm_resource_putref+0x192/0x260 [ptlrpc] [ 9664.247338] [<ffffffffc10fff0e>] ldlm_lock_put+0x2fe/0x770 [ptlrpc] [ 9664.253730] [<ffffffffc1129d42>] ldlm_export_lock_put+0x12/0x20 [ptlrpc] [ 9664.260525] [<ffffffffc04efbd0>] cfs_hash_for_each_relax+0x270/0x450 [libcfs] [ 9664.267776] [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc] [ 9664.276240] [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc] [ 9664.284684] [<ffffffffc04f30e0>] cfs_hash_for_each_empty+0x80/0x1d0 [libcfs] [ 9664.291852] [<ffffffffc1108d22>] ldlm_export_cancel_locks+0xc2/0x1a0 [ptlrpc] [ 9664.299103] [<ffffffffc11356c0>] ldlm_bl_thread_main+0x7d0/0xb20 [ptlrpc] [ 9664.305977] [<ffffffffa7acc790>] ? wake_up_atomic_t+0x40/0x40 [ 9664.311843] [<ffffffffc1134ef0>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc] [ 9664.319231] [<ffffffffa7acb621>] kthread+0xd1/0xe0 [ 9664.324111] [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40 [ 9664.330206] [<ffffffffa81c51dd>] ret_from_fork_nospec_begin+0x7/0x21 [ 9664.336643] [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40 [ 9664.342735] Code: 00 10 00 00 00 48 c7 05 51 9a 02 00 00 00 00 00 e8 8c 65 ef fe e9 12 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 <48> 83 3f 00 48 89 fb 0f 84 9c 01 00 00 48 63 57 0c 48 8b 3d 3e [ 9664.363258] RIP [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.369992] RSP <ffff8cc2fc537c20> [ 9664.373485] CR2: 0000000000000000 I applied the patch above ( https://review.whamcloud.com/50748 ) on top of 2.15.3 and I couldn't reproduce the issue anymore, so it's a good sign. |
| Comment by Gerrit Updater [ 29/Nov/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53284 |