Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17034

memory corruption caused by bug in qmt_seed_glbe_all

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The code in qmt_seed_glbe_all() doesn't support a case when OST index is larger than the number of OSTs.

      BUG: unable to handle kernel paging request at ffff91c13a4ad868
      IP: [<ffffffffc1001931>] qmt_lvbo_update+0x261/0xe60 [lquota]
      CPU: 0 PID: 5100 Comm: qmt_reba_lustre 3.10.0-1160.83.1.el7_lustre.ddn17.x86_64 #1
      RIP: 0010:[<ffffffffc1001931>]  [<ffffffffc1001931>] qmt_lvbo_update+0x261/0xe60 [lquota]
      Call Trace:
        [<ffffffffc13158db>] mdt_lvbo_update+0xbb/0x140 [mdt]
        [<ffffffffc0ba0002>] ldlm_cb_interpret+0x122/0x740 [ptlrpc]
        [<ffffffffc0bbaa67>] ptlrpc_check_set+0x3f7/0x2230 [ptlrpc]
        [<ffffffffc0bbcabb>] ptlrpc_set_wait+0x21b/0x7e0 [ptlrpc]
        [<ffffffffc0b780b5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
        [<ffffffffc0b9b4cb>] ldlm_glimpse_locks+0x3b/0x110 [ptlrpc]
        [<ffffffffc0fffe6f>] qmt_glimpse_lock.isra.15+0x39f/0xa50 [lquota]
        [<ffffffffc10009c4>] qmt_reba_thread+0x4a4/0xa80 [lquota]
        [<ffffffff9c0cb511>] kthread+0xd1/0xe0
      

      For example, if the system has 4 OSTs with indexes 0001, 0002, 00c9, 00ca. As could be seen from the below code index 00c9 would cause writing outside lqeg_arr which has 64 elements by default. 

      void qmt_seed_glbe_all(const struct lu_env *env, struct lqe_glbl_data *lgd,
                             bool qunit, bool edquot)
      {
      ...
                      for (j = 0; j < slaves_cnt; j++) {
                              idx = qmt_sarr_get_idx(qpi, j);
                              LASSERT(idx >= 0);
      
                              if (edquot) {
                                      int lge_edquot, new_edquot, edquot_nu;
      
                                      lge_edquot = lgd->lqeg_arr[idx].lge_edquot;
                                      edquot_nu = lgd->lqeg_arr[idx].lge_edquot_nu;
                                      new_edquot = lqe->lqe_edquot;
      
                                      if (lge_edquot == new_edquot ||
                                          (edquot_nu && lge_edquot == 1))
                                              goto qunit_lbl;
                                      lgd->lqeg_arr[idx].lge_edquot = new_edquot;

      3 things are required to make this bug possible:

      • enabled quota(quota_slave.enalbed != 0) and quota limits set for at least one ID(user/group/project).
      • at least one OST pool in the system
      • at least one OST in the OST pool with index > 64(QMT_INIT_SLV_CNT)

      This bug may cause different kind of kernel panics, but on the system where it often occurred in 80% of all cases it corrupted UUID and NID rhashtables. All of these panics are described in LU-16930. By default the size of lqeg_arr is 64*16=1024. It means that with high probability it would corrupt the neighbor kmalloc-1024 region. 

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: