[LU-17034] memory corruption caused by bug in qmt_seed_glbe_all Created: 16/Aug/23 Updated: 24/Jan/24 Resolved: 18/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sergey Cheremencev | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
The code in qmt_seed_glbe_all doesn't support a case when OST index is larger than the number of OSTs. For example, if the system has 4 OSTs with indexes 0001, 0002, 00c9, 00ca. As could be seen from the below code index 00c9 would cause writing outside lqeg_arr which has 64 elements by default. void qmt_seed_glbe_all(const struct lu_env *env, struct lqe_glbl_data *lgd,
bool qunit, bool edquot)
{
...
for (j = 0; j < slaves_cnt; j++) {
idx = qmt_sarr_get_idx(qpi, j);
LASSERT(idx >= 0);
if (edquot) {
int lge_edquot, new_edquot, edquot_nu;
lge_edquot = lgd->lqeg_arr[idx].lge_edquot;
edquot_nu = lgd->lqeg_arr[idx].lge_edquot_nu;
new_edquot = lqe->lqe_edquot;
if (lge_edquot == new_edquot ||
(edquot_nu && lge_edquot == 1))
goto qunit_lbl;
lgd->lqeg_arr[idx].lge_edquot = new_edquot;
3 things are required to make this bug possible:
This bug may cause different kind of kernel panics, but on the system where it often occurred in 80% of all cases it corrupted UUID and NID rhashtables. All of these panics are described in |
| Comments |
| Comment by Gerrit Updater [ 25/Aug/23 ] |
|
"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52094 |
| Comment by Jian Yu [ 30/Aug/23 ] |
|
With sparse OST indexes "OST_INDEX_LIST=[0,10,20,40,55,60,80]" (for OSTCOUNT=7) and "ENABLE_QUOTA=yes", performance-sanity test 2 and sanity-benchmark test dbench crashed on master branch: [ 265.154037] Lustre: DEBUG MARKER: == sanity-benchmark test dbench: dbench ================== 01:14:05 (1693358045) [ 265.448184] LustreError: 16616:0:(qmt_entry.c:865:qmt_adjust_edquot_qunit_notify()) ASSERTION( idx <= lgd->lqeg_num_used ) failed: [ 265.450565] LustreError: 16616:0:(qmt_entry.c:865:qmt_adjust_edquot_qunit_notify()) LBUG [ 265.452116] Pid: 16616, comm: mdt_rdpg00_003 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Tue Aug 1 06:59:39 UTC 2023 [ 265.454013] Call Trace TBD: [ 265.454761] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [ 265.455838] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [ 265.456807] [<0>] qmt_adjust_edquot_qunit_notify+0x4e1/0x4f0 [lquota] [ 265.458122] [<0>] qmt_dqacq0+0x1b00/0x2430 [lquota] [ 265.459108] [<0>] qmt_intent_policy+0x942/0xfe0 [lquota] [ 265.460151] [<0>] mdt_intent_opc+0xa66/0xc30 [mdt] [ 265.461270] [<0>] mdt_intent_policy+0xe8/0x460 [mdt] [ 265.462259] [<0>] ldlm_lock_enqueue+0x455/0xaf0 [ptlrpc] [ 265.463809] [<0>] ldlm_handle_enqueue+0x645/0x1870 [ptlrpc] [ 265.464983] [<0>] tgt_enqueue+0xa8/0x230 [ptlrpc] [ 265.466042] [<0>] tgt_request_handle+0xd20/0x19c0 [ptlrpc] [ 265.467193] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc] [ 265.468460] [<0>] ptlrpc_main+0xc91/0x15a0 [ptlrpc] [ 265.469535] [<0>] kthread+0x134/0x150 [ 265.470333] [<0>] ret_from_fork+0x35/0x40 [ 265.471167] Kernel panic - not syncing: LBUG [ 265.472006] CPU: 0 PID: 16616 Comm: mdt_rdpg00_003 Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.15.1.el8_lustre.x86_64 #1 [ 265.474318] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 265.475391] Call Trace: [ 265.475914] dump_stack+0x41/0x60 [ 265.476588] panic+0xe7/0x2ac [ 265.477194] ? ret_from_fork+0x35/0x40 [ 265.477931] lbug_with_loc.cold.8+0x18/0x18 [libcfs] [ 265.478883] qmt_adjust_edquot_qunit_notify+0x4e1/0x4f0 [lquota] [ 265.480027] qmt_dqacq0+0x1b00/0x2430 [lquota] [ 265.480909] ? qmt_intent_policy+0x942/0xfe0 [lquota] [ 265.481906] qmt_intent_policy+0x942/0xfe0 [lquota] [ 265.482863] mdt_intent_opc+0xa66/0xc30 [mdt] [ 265.483752] ? lprocfs_counter_add+0x12a/0x1a0 [obdclass] [ 265.485025] mdt_intent_policy+0xe8/0x460 [mdt] [ 265.485920] ldlm_lock_enqueue+0x455/0xaf0 [ptlrpc] [ 265.486933] ? cfs_hash_bd_add_locked+0x1f/0x90 [libcfs] [ 265.487962] ? cfs_hash_multi_bd_lock+0xa0/0xa0 [libcfs] [ 265.488978] ldlm_handle_enqueue+0x645/0x1870 [ptlrpc] [ 265.490054] tgt_enqueue+0xa8/0x230 [ptlrpc] [ 265.490977] tgt_request_handle+0xd20/0x19c0 [ptlrpc] [ 265.492024] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc] [ 265.493246] ? lprocfs_counter_add+0x12a/0x1a0 [obdclass] [ 265.494312] ptlrpc_main+0xc91/0x15a0 [ptlrpc] [ 265.495246] ? __schedule+0x2d9/0x870 [ 265.495972] ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] [ 265.497025] kthread+0x134/0x150 [ 265.497677] ? set_kthread_struct+0x50/0x50 [ 265.498474] ret_from_fork+0x35/0x40 |
| Comment by Gerrit Updater [ 06/Sep/23 ] |
|
"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52293 |
| Comment by Gerrit Updater [ 18/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52094/ |
| Comment by Peter Jones [ 18/Nov/23 ] |
|
Landed for 2.16 |