Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
The code in qmt_seed_glbe_all() doesn't support a case when OST index is larger than the number of OSTs.
BUG: unable to handle kernel paging request at ffff91c13a4ad868 IP: [<ffffffffc1001931>] qmt_lvbo_update+0x261/0xe60 [lquota] CPU: 0 PID: 5100 Comm: qmt_reba_lustre 3.10.0-1160.83.1.el7_lustre.ddn17.x86_64 #1 RIP: 0010:[<ffffffffc1001931>] [<ffffffffc1001931>] qmt_lvbo_update+0x261/0xe60 [lquota] Call Trace: [<ffffffffc13158db>] mdt_lvbo_update+0xbb/0x140 [mdt] [<ffffffffc0ba0002>] ldlm_cb_interpret+0x122/0x740 [ptlrpc] [<ffffffffc0bbaa67>] ptlrpc_check_set+0x3f7/0x2230 [ptlrpc] [<ffffffffc0bbcabb>] ptlrpc_set_wait+0x21b/0x7e0 [ptlrpc] [<ffffffffc0b780b5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc] [<ffffffffc0b9b4cb>] ldlm_glimpse_locks+0x3b/0x110 [ptlrpc] [<ffffffffc0fffe6f>] qmt_glimpse_lock.isra.15+0x39f/0xa50 [lquota] [<ffffffffc10009c4>] qmt_reba_thread+0x4a4/0xa80 [lquota] [<ffffffff9c0cb511>] kthread+0xd1/0xe0
For example, if the system has 4 OSTs with indexes 0001, 0002, 00c9, 00ca. As could be seen from the below code index 00c9 would cause writing outside lqeg_arr which has 64 elements by default.
void qmt_seed_glbe_all(const struct lu_env *env, struct lqe_glbl_data *lgd, bool qunit, bool edquot) { ... for (j = 0; j < slaves_cnt; j++) { idx = qmt_sarr_get_idx(qpi, j); LASSERT(idx >= 0); if (edquot) { int lge_edquot, new_edquot, edquot_nu; lge_edquot = lgd->lqeg_arr[idx].lge_edquot; edquot_nu = lgd->lqeg_arr[idx].lge_edquot_nu; new_edquot = lqe->lqe_edquot; if (lge_edquot == new_edquot || (edquot_nu && lge_edquot == 1)) goto qunit_lbl; lgd->lqeg_arr[idx].lge_edquot = new_edquot;
3 things are required to make this bug possible:
- enabled quota(quota_slave.enalbed != 0) and quota limits set for at least one ID(user/group/project).
- at least one OST pool in the system
- at least one OST in the OST pool with index > 64(QMT_INIT_SLV_CNT)
This bug may cause different kind of kernel panics, but on the system where it often occurred in 80% of all cases it corrupted UUID and NID rhashtables. All of these panics are described in LU-16930. By default the size of lqeg_arr is 64*16=1024. It means that with high probability it would corrupt the neighbor kmalloc-1024 region.