[LU-17034] memory corruption caused by bug in qmt_seed_glbe_all Created: 16/Aug/23  Updated: 24/Jan/24  Resolved: 18/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Sergey Cheremencev Assignee: Sergey Cheremencev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-16930 BUG: nid_keycmp+0x6 Resolved
Related
is related to LU-17037 Tests should run with high and sparse... In Progress
is related to LU-17033 Add RCU protect for export nid operation Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The code in qmt_seed_glbe_all doesn't support a case when OST index is larger than the number of OSTs. For example, if the system has 4 OSTs with indexes 0001, 0002, 00c9, 00ca. As could be seen from the below code index 00c9 would cause writing outside lqeg_arr which has 64 elements by default. 

void qmt_seed_glbe_all(const struct lu_env *env, struct lqe_glbl_data *lgd,
                       bool qunit, bool edquot)
{
...
                for (j = 0; j < slaves_cnt; j++) {
                        idx = qmt_sarr_get_idx(qpi, j);
                        LASSERT(idx >= 0);

                        if (edquot) {
                                int lge_edquot, new_edquot, edquot_nu;

                                lge_edquot = lgd->lqeg_arr[idx].lge_edquot;
                                edquot_nu = lgd->lqeg_arr[idx].lge_edquot_nu;
                                new_edquot = lqe->lqe_edquot;

                                if (lge_edquot == new_edquot ||
                                    (edquot_nu && lge_edquot == 1))
                                        goto qunit_lbl;
                                lgd->lqeg_arr[idx].lge_edquot = new_edquot;

3 things are required to make this bug possible:

  • enabled quota(quota_slave.enalbed != 0) and quota limits set for at least one ID(user/group/project).
  • at least one OST pool in the system
  • at least one OST in the OST pool with index > 64(QMT_INIT_SLV_CNT)

This bug may cause different kind of kernel panics, but on the system where it often occurred in 80% of all cases it corrupted UUID and NID rhashtables. All of these panics are described in LU-16930. By default the size of lqeg_arr is 64*16=1024. It means that with high probability it would corrupt the neighbor kmalloc-1024 region. 



 Comments   
Comment by Gerrit Updater [ 25/Aug/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52094
Subject: LU-17034 quota: lqeg_arr memmory corruption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3db2668fd0e161875ed20ac8b14184de1a8046b9

Comment by Jian Yu [ 30/Aug/23 ]

With sparse OST indexes "OST_INDEX_LIST=[0,10,20,40,55,60,80]" (for OSTCOUNT=7) and "ENABLE_QUOTA=yes", performance-sanity test 2 and sanity-benchmark test dbench crashed on master branch:
https://testing.whamcloud.com/test_sets/aa85d42f-f125-48a0-9b9f-c001b6ec3349
https://testing.whamcloud.com/test_sets/2a8e95b6-fb76-40fe-bebc-809f9a5959df

[  265.154037] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ================== 01:14:05 (1693358045)
[  265.448184] LustreError: 16616:0:(qmt_entry.c:865:qmt_adjust_edquot_qunit_notify()) ASSERTION( idx <= lgd->lqeg_num_used ) failed: 
[  265.450565] LustreError: 16616:0:(qmt_entry.c:865:qmt_adjust_edquot_qunit_notify()) LBUG
[  265.452116] Pid: 16616, comm: mdt_rdpg00_003 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Tue Aug 1 06:59:39 UTC 2023
[  265.454013] Call Trace TBD:
[  265.454761] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[  265.455838] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[  265.456807] [<0>] qmt_adjust_edquot_qunit_notify+0x4e1/0x4f0 [lquota]
[  265.458122] [<0>] qmt_dqacq0+0x1b00/0x2430 [lquota]
[  265.459108] [<0>] qmt_intent_policy+0x942/0xfe0 [lquota]
[  265.460151] [<0>] mdt_intent_opc+0xa66/0xc30 [mdt]
[  265.461270] [<0>] mdt_intent_policy+0xe8/0x460 [mdt]
[  265.462259] [<0>] ldlm_lock_enqueue+0x455/0xaf0 [ptlrpc]
[  265.463809] [<0>] ldlm_handle_enqueue+0x645/0x1870 [ptlrpc]
[  265.464983] [<0>] tgt_enqueue+0xa8/0x230 [ptlrpc]
[  265.466042] [<0>] tgt_request_handle+0xd20/0x19c0 [ptlrpc]
[  265.467193] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[  265.468460] [<0>] ptlrpc_main+0xc91/0x15a0 [ptlrpc]
[  265.469535] [<0>] kthread+0x134/0x150
[  265.470333] [<0>] ret_from_fork+0x35/0x40
[  265.471167] Kernel panic - not syncing: LBUG
[  265.472006] CPU: 0 PID: 16616 Comm: mdt_rdpg00_003 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.15.1.el8_lustre.x86_64 #1
[  265.474318] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  265.475391] Call Trace:
[  265.475914]  dump_stack+0x41/0x60
[  265.476588]  panic+0xe7/0x2ac
[  265.477194]  ? ret_from_fork+0x35/0x40
[  265.477931]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[  265.478883]  qmt_adjust_edquot_qunit_notify+0x4e1/0x4f0 [lquota]
[  265.480027]  qmt_dqacq0+0x1b00/0x2430 [lquota]
[  265.480909]  ? qmt_intent_policy+0x942/0xfe0 [lquota]
[  265.481906]  qmt_intent_policy+0x942/0xfe0 [lquota]
[  265.482863]  mdt_intent_opc+0xa66/0xc30 [mdt]
[  265.483752]  ? lprocfs_counter_add+0x12a/0x1a0 [obdclass]
[  265.485025]  mdt_intent_policy+0xe8/0x460 [mdt]
[  265.485920]  ldlm_lock_enqueue+0x455/0xaf0 [ptlrpc]
[  265.486933]  ? cfs_hash_bd_add_locked+0x1f/0x90 [libcfs]
[  265.487962]  ? cfs_hash_multi_bd_lock+0xa0/0xa0 [libcfs]
[  265.488978]  ldlm_handle_enqueue+0x645/0x1870 [ptlrpc]
[  265.490054]  tgt_enqueue+0xa8/0x230 [ptlrpc]
[  265.490977]  tgt_request_handle+0xd20/0x19c0 [ptlrpc]
[  265.492024]  ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[  265.493246]  ? lprocfs_counter_add+0x12a/0x1a0 [obdclass]
[  265.494312]  ptlrpc_main+0xc91/0x15a0 [ptlrpc]
[  265.495246]  ? __schedule+0x2d9/0x870
[  265.495972]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[  265.497025]  kthread+0x134/0x150
[  265.497677]  ? set_kthread_struct+0x50/0x50
[  265.498474]  ret_from_fork+0x35/0x40
Comment by Gerrit Updater [ 06/Sep/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52293
Subject: LU-17034 tests: memory corruption in PQ
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3cf0ee70e918030f33f2efba4f7a9974afe96c9f

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52094/
Subject: LU-17034 quota: lqeg_arr memmory corruption
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67f90e42889ff22d574e82cc647f6076e48c65a5

Comment by Peter Jones [ 18/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:32:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.