Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17034

memory corruption caused by bug in qmt_seed_glbe_all

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The code in qmt_seed_glbe_all doesn't support a case when OST index is larger than the number of OSTs. For example, if the system has 4 OSTs with indexes 0001, 0002, 00c9, 00ca. As could be seen from the below code index 00c9 would cause writing outside lqeg_arr which has 64 elements by default. 

      void qmt_seed_glbe_all(const struct lu_env *env, struct lqe_glbl_data *lgd,
                             bool qunit, bool edquot)
      {
      ...
                      for (j = 0; j < slaves_cnt; j++) {
                              idx = qmt_sarr_get_idx(qpi, j);
                              LASSERT(idx >= 0);
      
                              if (edquot) {
                                      int lge_edquot, new_edquot, edquot_nu;
      
                                      lge_edquot = lgd->lqeg_arr[idx].lge_edquot;
                                      edquot_nu = lgd->lqeg_arr[idx].lge_edquot_nu;
                                      new_edquot = lqe->lqe_edquot;
      
                                      if (lge_edquot == new_edquot ||
                                          (edquot_nu && lge_edquot == 1))
                                              goto qunit_lbl;
                                      lgd->lqeg_arr[idx].lge_edquot = new_edquot;

      3 things are required to make this bug possible:

      • enabled quota(quota_slave.enalbed != 0) and quota limits set for at least one ID(user/group/project).
      • at least one OST pool in the system
      • at least one OST in the OST pool with index > 64(QMT_INIT_SLV_CNT)

      This bug may cause different kind of kernel panics, but on the system where it often occurred in 80% of all cases it corrupted UUID and NID rhashtables. All of these panics are described in LU-16930. By default the size of lqeg_arr is 64*16=1024. It means that with high probability it would corrupt the neighbor kmalloc-1024 region. 

      Attachments

        Issue Links

          Activity

            People

              scherementsev Sergey Cheremencev
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: