[LU-16772] Protect lqe_glbl_data in qmt_site_recalc_cb with mutex Created: 25/Apr/23  Updated: 29/Nov/23  Resolved: 14/Jun/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: Sergey Cheremencev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15021 sanity-quota test_55: crash in qmt_se... Resolved
is related to LU-16725 crash at qmt_free_lqe_gd+0xa/0x1f0 Resolved
is related to LU-14434 parallel-scale-nfsv4 test compilebenc... Resolved
is related to LU-16341 unable to handle kernel NULL in qmt_s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lqe_glbl_data should be protected with lqe_glbl_data_lock in qmt_site_reaclc_sb like it did in other places to avoid crashing:

 Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0005_UUID
 Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0006_UUID
 BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8
 IP: [<ffffffffc10c81d8>] qmt_site_recalc_cb+0x318/0x7e0 [lquota]
 Oops: 0000 [#1] SMP 
 CPU: 1 PID: 26035 Comm: qsd_reint_qpool Kdump: loaded 3.10.0-1160.53.1.el7.x86_64 #1
 Call Trace:
  [<ffffffffc09ab7ae>] cfs_hash_for_each_tight+0x11e/0x320 [libcfs]
  [<ffffffffc09aba20>] cfs_hash_for_each+0x10/0x20 [libcfs]
  [<ffffffffc10c9df4>] qmt_pool_recalc+0xa64/0x11f0 [lquota]
  [<ffffffffad4c5e61>] kthread+0xd1/0xe0


 Comments   
Comment by Gerrit Updater [ 25/Apr/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50748
Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 16685410130de3a9856fd9fd2891f14afa228e94

Comment by Gerrit Updater [ 14/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50748/
Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 50ff4d1da63e8bc1dba4b6b52219fb7024f8d66f

Comment by Peter Jones [ 14/Jun/23 ]

Landed for 2.16

Comment by Stephane Thiell [ 25/Jun/23 ]

Hit the following MDS BUG with 2.15.3 when unmounting MDT0, that looks like LU-16725 (which is marked as duplicate of this LU):

[ 9119.471374] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Timed out tx for 10.0.10.239@o2ib7: 2 seconds
[ 9119.481457] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Skipped 23 previous similar messages
[ 9663.994337] Lustre: Failing over fir-MDT0000
[ 9663.999333] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 9664.007214] IP: [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
[ 9664.013863] PGD 0 
[ 9664.015908] Oops: 0000 [#1] SMP 
[ 9664.019191] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) sunrpc vfat fat dm_round_robin dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ses ablk_helper enclosure cryptd pcspkr sg i2c_piix4 k10temp svcrdma(OE) ccp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) dm_multipath iw_cm(OE) dm_mod ip_tables ext4 mbcache jbd2 sd_mod
[ 9664.091933]  crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) i2c_algo_bit mlx5_core(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw(OE) psample mpt3sas(OE) auxiliary(OE) devlink libahci crct10dif_pclmul libata drm tg3 mlx_compat(OE) crct10dif_common raid_class ptp crc32c_intel megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core
[ 9664.126449] CPU: 20 PID: 12056 Comm: ldlm_bl_02 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1
[ 9664.139307] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.20.0 05/03/2023
[ 9664.146961] task: ffff8cc2f8e73180 ti: ffff8cc2fc534000 task.ti: ffff8cc2fc534000
[ 9664.154440] RIP: 0010:[<ffffffffc15f490a>]  [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
[ 9664.163514] RSP: 0018:ffff8cc2fc537c20  EFLAGS: 00010246
[ 9664.168825] RAX: ffff8cc2f8e73180 RBX: ffff8c92cdcc0e70 RCX: ffff8cc2fc537fd8
[ 9664.175959] RDX: 0000000000000000 RSI: ffff8c83d606ca80 RDI: 0000000000000000
[ 9664.183093] RBP: ffff8cc2fc537c28 R08: ffff8cc2fc537c80 R09: ffff8cc2fc537b70
[ 9664.190223] R10: 00000000cfaae101 R11: ffff8c92cfaae6f0 R12: ffff8c83d606ca80
[ 9664.197358] R13: ffff8c92cdcc0f48 R14: 0000000000000000 R15: ffff8ca2e53875c0
[ 9664.204490] FS:  00007fdeb8ba4740(0000) GS:ffff8c92fef40000(0000) knlGS:0000000000000000
[ 9664.212576] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9664.218322] CR2: 0000000000000000 CR3: 0000000c75a10000 CR4: 00000000003407e0
[ 9664.225456] Call Trace:
[ 9664.227918]  [<ffffffffc15ebb49>] qmt_lvbo_free+0xd9/0x380 [lquota]
[ 9664.234207]  [<ffffffffc1832aab>] mdt_lvbo_free+0x12b/0x150 [mdt]
[ 9664.240344]  [<ffffffffc110bda2>] ldlm_resource_putref+0x192/0x260 [ptlrpc]
[ 9664.247338]  [<ffffffffc10fff0e>] ldlm_lock_put+0x2fe/0x770 [ptlrpc]
[ 9664.253730]  [<ffffffffc1129d42>] ldlm_export_lock_put+0x12/0x20 [ptlrpc]
[ 9664.260525]  [<ffffffffc04efbd0>] cfs_hash_for_each_relax+0x270/0x450 [libcfs]
[ 9664.267776]  [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc]
[ 9664.276240]  [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc]
[ 9664.284684]  [<ffffffffc04f30e0>] cfs_hash_for_each_empty+0x80/0x1d0 [libcfs]
[ 9664.291852]  [<ffffffffc1108d22>] ldlm_export_cancel_locks+0xc2/0x1a0 [ptlrpc]
[ 9664.299103]  [<ffffffffc11356c0>] ldlm_bl_thread_main+0x7d0/0xb20 [ptlrpc]
[ 9664.305977]  [<ffffffffa7acc790>] ? wake_up_atomic_t+0x40/0x40
[ 9664.311843]  [<ffffffffc1134ef0>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc]
[ 9664.319231]  [<ffffffffa7acb621>] kthread+0xd1/0xe0
[ 9664.324111]  [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40
[ 9664.330206]  [<ffffffffa81c51dd>] ret_from_fork_nospec_begin+0x7/0x21
[ 9664.336643]  [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40
[ 9664.342735] Code: 00 10 00 00 00 48 c7 05 51 9a 02 00 00 00 00 00 e8 8c 65 ef fe e9 12 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 <48> 83 3f 00 48 89 fb 0f 84 9c 01 00 00 48 63 57 0c 48 8b 3d 3e 
[ 9664.363258] RIP  [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
[ 9664.369992]  RSP <ffff8cc2fc537c20>
[ 9664.373485] CR2: 0000000000000000

I applied the patch above ( https://review.whamcloud.com/50748 ) on top of 2.15.3 and I couldn't reproduce the issue anymore, so it's a good sign.

Comment by Gerrit Updater [ 29/Nov/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53284
Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 55e7f2a569d33db0f2aea02571e3abadccf6fc11

Generated at Sat Feb 10 03:29:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.