Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16772

Protect lqe_glbl_data in qmt_site_recalc_cb with mutex

Details

    • 3
    • 9223372036854775807

    Description

      lqe_glbl_data should be protected with lqe_glbl_data_lock in qmt_site_reaclc_sb like it did in other places to avoid crashing:

       Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0005_UUID
       Lustre: DEBUG MARKER: lctl pool_remove lustre.qpool1 lustre-OST0006_UUID
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000d8
       IP: [<ffffffffc10c81d8>] qmt_site_recalc_cb+0x318/0x7e0 [lquota]
       Oops: 0000 [#1] SMP 
       CPU: 1 PID: 26035 Comm: qsd_reint_qpool Kdump: loaded 3.10.0-1160.53.1.el7.x86_64 #1
       Call Trace:
        [<ffffffffc09ab7ae>] cfs_hash_for_each_tight+0x11e/0x320 [libcfs]
        [<ffffffffc09aba20>] cfs_hash_for_each+0x10/0x20 [libcfs]
        [<ffffffffc10c9df4>] qmt_pool_recalc+0xa64/0x11f0 [lquota]
        [<ffffffffad4c5e61>] kthread+0xd1/0xe0
      

      Attachments

        Issue Links

          Activity

            [LU-16772] Protect lqe_glbl_data in qmt_site_recalc_cb with mutex

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53284/
            Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 197e9aa693ed6eaef7b358a8f4afca4e50e80bd2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53284/ Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 197e9aa693ed6eaef7b358a8f4afca4e50e80bd2

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53284
            Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 55e7f2a569d33db0f2aea02571e3abadccf6fc11

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53284 Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 55e7f2a569d33db0f2aea02571e3abadccf6fc11
            sthiell Stephane Thiell added a comment - - edited

            Hit the following MDS BUG with 2.15.3 when unmounting MDT0, that looks like LU-16725 (which is marked as duplicate of this LU):

            [ 9119.471374] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Timed out tx for 10.0.10.239@o2ib7: 2 seconds
            [ 9119.481457] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Skipped 23 previous similar messages
            [ 9663.994337] Lustre: Failing over fir-MDT0000
            [ 9663.999333] BUG: unable to handle kernel NULL pointer dereference at           (null)
            [ 9664.007214] IP: [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
            [ 9664.013863] PGD 0 
            [ 9664.015908] Oops: 0000 [#1] SMP 
            [ 9664.019191] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) sunrpc vfat fat dm_round_robin dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ses ablk_helper enclosure cryptd pcspkr sg i2c_piix4 k10temp svcrdma(OE) ccp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) dm_multipath iw_cm(OE) dm_mod ip_tables ext4 mbcache jbd2 sd_mod
            [ 9664.091933]  crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) i2c_algo_bit mlx5_core(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw(OE) psample mpt3sas(OE) auxiliary(OE) devlink libahci crct10dif_pclmul libata drm tg3 mlx_compat(OE) crct10dif_common raid_class ptp crc32c_intel megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core
            [ 9664.126449] CPU: 20 PID: 12056 Comm: ldlm_bl_02 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1
            [ 9664.139307] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.20.0 05/03/2023
            [ 9664.146961] task: ffff8cc2f8e73180 ti: ffff8cc2fc534000 task.ti: ffff8cc2fc534000
            [ 9664.154440] RIP: 0010:[<ffffffffc15f490a>]  [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
            [ 9664.163514] RSP: 0018:ffff8cc2fc537c20  EFLAGS: 00010246
            [ 9664.168825] RAX: ffff8cc2f8e73180 RBX: ffff8c92cdcc0e70 RCX: ffff8cc2fc537fd8
            [ 9664.175959] RDX: 0000000000000000 RSI: ffff8c83d606ca80 RDI: 0000000000000000
            [ 9664.183093] RBP: ffff8cc2fc537c28 R08: ffff8cc2fc537c80 R09: ffff8cc2fc537b70
            [ 9664.190223] R10: 00000000cfaae101 R11: ffff8c92cfaae6f0 R12: ffff8c83d606ca80
            [ 9664.197358] R13: ffff8c92cdcc0f48 R14: 0000000000000000 R15: ffff8ca2e53875c0
            [ 9664.204490] FS:  00007fdeb8ba4740(0000) GS:ffff8c92fef40000(0000) knlGS:0000000000000000
            [ 9664.212576] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 9664.218322] CR2: 0000000000000000 CR3: 0000000c75a10000 CR4: 00000000003407e0
            [ 9664.225456] Call Trace:
            [ 9664.227918]  [<ffffffffc15ebb49>] qmt_lvbo_free+0xd9/0x380 [lquota]
            [ 9664.234207]  [<ffffffffc1832aab>] mdt_lvbo_free+0x12b/0x150 [mdt]
            [ 9664.240344]  [<ffffffffc110bda2>] ldlm_resource_putref+0x192/0x260 [ptlrpc]
            [ 9664.247338]  [<ffffffffc10fff0e>] ldlm_lock_put+0x2fe/0x770 [ptlrpc]
            [ 9664.253730]  [<ffffffffc1129d42>] ldlm_export_lock_put+0x12/0x20 [ptlrpc]
            [ 9664.260525]  [<ffffffffc04efbd0>] cfs_hash_for_each_relax+0x270/0x450 [libcfs]
            [ 9664.267776]  [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc]
            [ 9664.276240]  [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc]
            [ 9664.284684]  [<ffffffffc04f30e0>] cfs_hash_for_each_empty+0x80/0x1d0 [libcfs]
            [ 9664.291852]  [<ffffffffc1108d22>] ldlm_export_cancel_locks+0xc2/0x1a0 [ptlrpc]
            [ 9664.299103]  [<ffffffffc11356c0>] ldlm_bl_thread_main+0x7d0/0xb20 [ptlrpc]
            [ 9664.305977]  [<ffffffffa7acc790>] ? wake_up_atomic_t+0x40/0x40
            [ 9664.311843]  [<ffffffffc1134ef0>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc]
            [ 9664.319231]  [<ffffffffa7acb621>] kthread+0xd1/0xe0
            [ 9664.324111]  [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40
            [ 9664.330206]  [<ffffffffa81c51dd>] ret_from_fork_nospec_begin+0x7/0x21
            [ 9664.336643]  [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40
            [ 9664.342735] Code: 00 10 00 00 00 48 c7 05 51 9a 02 00 00 00 00 00 e8 8c 65 ef fe e9 12 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 <48> 83 3f 00 48 89 fb 0f 84 9c 01 00 00 48 63 57 0c 48 8b 3d 3e 
            [ 9664.363258] RIP  [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota]
            [ 9664.369992]  RSP <ffff8cc2fc537c20>
            [ 9664.373485] CR2: 0000000000000000
            

            I applied the patch above ( https://review.whamcloud.com/50748 ) on top of 2.15.3 and I couldn't reproduce the issue anymore, so it's a good sign.

            sthiell Stephane Thiell added a comment - - edited Hit the following MDS BUG with 2.15.3 when unmounting MDT0, that looks like LU-16725 (which is marked as duplicate of this LU): [ 9119.471374] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Timed out tx for 10.0.10.239@o2ib7: 2 seconds [ 9119.481457] LNet: 4131:0:(o2iblnd_cb.c:3418:kiblnd_check_conns()) Skipped 23 previous similar messages [ 9663.994337] Lustre: Failing over fir-MDT0000 [ 9663.999333] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9664.007214] IP: [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.013863] PGD 0 [ 9664.015908] Oops: 0000 [#1] SMP [ 9664.019191] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache libcfs(OE) sunrpc vfat fat dm_round_robin dcdbas amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ses ablk_helper enclosure cryptd pcspkr sg i2c_piix4 k10temp svcrdma(OE) ccp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) dm_multipath iw_cm(OE) dm_mod ip_tables ext4 mbcache jbd2 sd_mod [ 9664.091933] crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) i2c_algo_bit mlx5_core(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci mlxfw(OE) psample mpt3sas(OE) auxiliary(OE) devlink libahci crct10dif_pclmul libata drm tg3 mlx_compat(OE) crct10dif_common raid_class ptp crc32c_intel megaraid_sas scsi_transport_sas drm_panel_orientation_quirks pps_core [ 9664.126449] CPU: 20 PID: 12056 Comm: ldlm_bl_02 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1 [ 9664.139307] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.20.0 05/03/2023 [ 9664.146961] task: ffff8cc2f8e73180 ti: ffff8cc2fc534000 task.ti: ffff8cc2fc534000 [ 9664.154440] RIP: 0010:[<ffffffffc15f490a>] [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.163514] RSP: 0018:ffff8cc2fc537c20 EFLAGS: 00010246 [ 9664.168825] RAX: ffff8cc2f8e73180 RBX: ffff8c92cdcc0e70 RCX: ffff8cc2fc537fd8 [ 9664.175959] RDX: 0000000000000000 RSI: ffff8c83d606ca80 RDI: 0000000000000000 [ 9664.183093] RBP: ffff8cc2fc537c28 R08: ffff8cc2fc537c80 R09: ffff8cc2fc537b70 [ 9664.190223] R10: 00000000cfaae101 R11: ffff8c92cfaae6f0 R12: ffff8c83d606ca80 [ 9664.197358] R13: ffff8c92cdcc0f48 R14: 0000000000000000 R15: ffff8ca2e53875c0 [ 9664.204490] FS: 00007fdeb8ba4740(0000) GS:ffff8c92fef40000(0000) knlGS:0000000000000000 [ 9664.212576] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9664.218322] CR2: 0000000000000000 CR3: 0000000c75a10000 CR4: 00000000003407e0 [ 9664.225456] Call Trace: [ 9664.227918] [<ffffffffc15ebb49>] qmt_lvbo_free+0xd9/0x380 [lquota] [ 9664.234207] [<ffffffffc1832aab>] mdt_lvbo_free+0x12b/0x150 [mdt] [ 9664.240344] [<ffffffffc110bda2>] ldlm_resource_putref+0x192/0x260 [ptlrpc] [ 9664.247338] [<ffffffffc10fff0e>] ldlm_lock_put+0x2fe/0x770 [ptlrpc] [ 9664.253730] [<ffffffffc1129d42>] ldlm_export_lock_put+0x12/0x20 [ptlrpc] [ 9664.260525] [<ffffffffc04efbd0>] cfs_hash_for_each_relax+0x270/0x450 [libcfs] [ 9664.267776] [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc] [ 9664.276240] [<ffffffffc1108980>] ? ldlm_cancel_lock_for_export.isra.26+0x370/0x370 [ptlrpc] [ 9664.284684] [<ffffffffc04f30e0>] cfs_hash_for_each_empty+0x80/0x1d0 [libcfs] [ 9664.291852] [<ffffffffc1108d22>] ldlm_export_cancel_locks+0xc2/0x1a0 [ptlrpc] [ 9664.299103] [<ffffffffc11356c0>] ldlm_bl_thread_main+0x7d0/0xb20 [ptlrpc] [ 9664.305977] [<ffffffffa7acc790>] ? wake_up_atomic_t+0x40/0x40 [ 9664.311843] [<ffffffffc1134ef0>] ? ldlm_handle_bl_callback+0x400/0x400 [ptlrpc] [ 9664.319231] [<ffffffffa7acb621>] kthread+0xd1/0xe0 [ 9664.324111] [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40 [ 9664.330206] [<ffffffffa81c51dd>] ret_from_fork_nospec_begin+0x7/0x21 [ 9664.336643] [<ffffffffa7acb550>] ? insert_kthread_work+0x40/0x40 [ 9664.342735] Code: 00 10 00 00 00 48 c7 05 51 9a 02 00 00 00 00 00 e8 8c 65 ef fe e9 12 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 <48> 83 3f 00 48 89 fb 0f 84 9c 01 00 00 48 63 57 0c 48 8b 3d 3e [ 9664.363258] RIP [<ffffffffc15f490a>] qmt_free_lqe_gd+0xa/0x1f0 [lquota] [ 9664.369992] RSP <ffff8cc2fc537c20> [ 9664.373485] CR2: 0000000000000000 I applied the patch above ( https://review.whamcloud.com/50748 ) on top of 2.15.3 and I couldn't reproduce the issue anymore, so it's a good sign.
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50748/
            Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 50ff4d1da63e8bc1dba4b6b52219fb7024f8d66f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50748/ Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb Project: fs/lustre-release Branch: master Current Patch Set: Commit: 50ff4d1da63e8bc1dba4b6b52219fb7024f8d66f

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50748
            Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 16685410130de3a9856fd9fd2891f14afa228e94

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50748 Subject: LU-16772 quota: protect lqe_glbl_data in qmt_site_recalc_cb Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 16685410130de3a9856fd9fd2891f14afa228e94

            People

              scherementsev Sergey Cheremencev
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: