Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.4
    • Lustre 2.10.1
    • None
    • CentOS 7.4
    • 3
    • 9223372036854775807

    Description

      Hi,

       

      We are seeing quota problems with 2.10.1 where, from time to time, group quotas are generating EDQUOT (users are actually reporting the problem) while there is room left. Some OSTs are seen as full as shown below and the rebalancing doesn't seem to work:

       

      [root@oak-rbh01 ~]# lfs quota -v -g oak_p-cvmed /oak
      Disk quotas for grp oak_p-cvmed (gid 3683):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 96389875036  120000000000 120000000000       - 6955998  18000000 18000000       -
      oak-MDT0000_UUID
                      3248820       -       0       - 6955998       - 8388608       -
      oak-OST0000_UUID
                      1196660392       - 1342177280       -       -       -       -       -
      oak-OST0001_UUID
                      1908235316       - 2147483648       -       -       -       -       -
      oak-OST0002_UUID
                      1225832336       - 1342177280       -       -       -       -       -
      oak-OST0003_UUID
                      1327591244       - 1342177280       -       -       -       -       -
      oak-OST0004_UUID
                      1965895968       - 2147483648       -       -       -       -       -
      oak-OST0005_UUID
                      1159391612       - 1342177280       -       -       -       -       -
      oak-OST0006_UUID
                      1635255040       - 1879048192       -       -       -       -       -
      oak-OST0007_UUID
                      1818596964       - 1879048192       -       -       -       -       -
      oak-OST0008_UUID
                      1872031764       - 1879048192       -       -       -       -       -
      oak-OST0009_UUID
                      2061279604       - 2147483648       -       -       -       -       -
      oak-OST000a_UUID
                      1445543488       - 1610612736       -       -       -       -       -
      oak-OST000b_UUID
                      1875314700       - 1879048192       -       -       -       -       -
      oak-OST000c_UUID
                      1301881412       - 1342177280       -       -       -       -       -
      oak-OST000d_UUID
                      1766688092       - 1879048192       -       -       -       -       -
      oak-OST000e_UUID
                      2005981712       - 2147483648       -       -       -       -       -
      oak-OST000f_UUID
                      1491138396       - 1610612736       -       -       -       -       -
      oak-OST0010_UUID
                      1292096088       - 1342177280       -       -       -       -       -
      oak-OST0011_UUID
                      1222866272       - 1342177280       -       -       -       -       -
      oak-OST0012_UUID
                      1312869104       - 1342177280       -       -       -       -       -
      oak-OST0013_UUID
                      1185445504       - 1342177280       -       -       -       -       -
      oak-OST0014_UUID
                      1315544800       - 1342177280       -       -       -       -       -
      oak-OST0015_UUID
                      2025717256       - 2147483648       -       -       -       -       -
      oak-OST0016_UUID
                      1817010800       - 1879048192       -       -       -       -       -
      oak-OST0017_UUID
                      1699092560       - 1879048192       -       -       -       -       -
      oak-OST0018_UUID
                      1921966992       - 2147483648       -       -       -       -       -
      oak-OST0019_UUID
                      1752975104       - 1879048192       -       -       -       -       -
      oak-OST001a_UUID
                      2022449576       - 2147483648       -       -       -       -       -
      oak-OST001b_UUID
                      1476019956       - 1610612736       -       -       -       -       -
      oak-OST001c_UUID
                      2002420900       - 2147483648       -       -       -       -       -
      oak-OST001d_UUID
                      1175776272       - 1342177280       -       -       -       -       -
      oak-OST001e_UUID
                      1522667428       - 1610612736       -       -       -       -       -
      oak-OST001f_UUID
                      1698940868       - 1879048192       -       -       -       -       -
      oak-OST0020_UUID
                      1418438600       - 1610612736       -       -       -       -       -
      oak-OST0021_UUID
                      1848558676       - 1879048192       -       -       -       -       -
      oak-OST0022_UUID
                      1567670312       - 1610612736       -       -       -       -       -
      oak-OST0023_UUID
                      1755882404       - 1879048192       -       -       -       -       -
      oak-OST0024_UUID
                      1725770704       - 1879048192       -       -       -       -       -
      oak-OST0025_UUID
                      2021496552       - 2147483648       -       -       -       -       -
      oak-OST0026_UUID
                      2340218652       - 2415919104       -       -       -       -       -
      oak-OST0027_UUID
                      2078849960       - 2147483648       -       -       -       -       -
      oak-OST0028_UUID
                      2401223300       - 2415919104       -       -       -       -       -
      oak-OST0029_UUID
                      2255153880       - 2415919104       -       -       -       -       -
      oak-OST002a_UUID
                      2479360100       - 2684354560       -       -       -       -       -
      oak-OST002b_UUID
                      1956889380       - 2147483648       -       -       -       -       -
      oak-OST002c_UUID
                      2336034612       - 2415919104       -       -       -       -       -
      oak-OST002d_UUID
                      1897045500       - 2147483648       -       -       -       -       -
      oak-OST002e_UUID
                      2069066412       - 2147483648       -       -       -       -       -
      oak-OST002f_UUID
                      2668099124       - 2684354560       -       -       -       -       -
      oak-OST0030_UUID
                      302970856       - 536870912       -       -       -       -       -
      oak-OST0031_UUID
                      425767268       - 536870912       -       -       -       -       -
      oak-OST0032_UUID
                      554265344       - 805306368       -       -       -       -       -
      oak-OST0033_UUID
                      616158116       - 805306368       -       -       -       -       -
      oak-OST0034_UUID
                      523406904       - 536870912       -       -       -       -       -
      oak-OST0035_UUID
                      832949332       - 1073741824       -       -       -       -       -
      oak-OST0036_UUID
                      431649588       - 536870912       -       -       -       -       -
      oak-OST0037_UUID
                      335297304       - 536870912       -       -       -       -       -
      oak-OST0038_UUID
                      768953372       - 805306368       -       -       -       -       -
      oak-OST0039_UUID
                      589398720       - 805306368       -       -       -       -       -
      oak-OST003a_UUID
                      822149664       - 1073741824       -       -       -       -       -
      oak-OST003b_UUID
                      246038976       - 268435456       -       -       -       -       -
      oak-OST003c_UUID
                      1002757608       - 1073741824       -       -       -       -       -
      oak-OST003d_UUID
                      655190956       - 805306368       -       -       -       -       -
      oak-OST003e_UUID
                      464755608*      - 464755608       -       -       -       -       -
      oak-OST003f_UUID
                      265537376       - 268435456       -       -       -       -       -
      oak-OST0040_UUID
                      380491764       - 536870912       -       -       -       -       -
      oak-OST0041_UUID
                      628194908       - 805306368       -       -       -       -       -
      oak-OST0042_UUID
                      220394524       - 268435456       -       -       -       -       -
      oak-OST0043_UUID
                      388284936       - 536870912       -       -       -       -       -
      oak-OST0044_UUID
                      429979492       - 536870912       -       -       -       -       -
      oak-OST0045_UUID
                      276764380       - 536870912       -       -       -       -       -
      oak-OST0046_UUID
                      346999308       - 536870912       -       -       -       -       -
      oak-OST0047_UUID
                      408032656       - 536870912       -       -       -       -       -
      oak-OST0048_UUID
                      17760704       - 268435456       -       -       -       -       -
      oak-OST0049_UUID
                      11267016       - 268435456       -       -       -       -       -
      oak-OST004a_UUID
                      12111344       - 268435456       -       -       -       -       -
      oak-OST004b_UUID
                      8750240       - 268435456       -       -       -       -       -
      oak-OST004c_UUID
                      31825656       - 268435456       -       -       -       -       -
      oak-OST004d_UUID
                      67586608       - 268435456       -       -       -       -       -
      Total allocated inode limit: 8388608, total allocated block limit: 106765196184
      
      

       In some cases I was able to disable/enable group q uota to fix the problem but in this particular, I can't find a way to force a refresh. Any idea would be welcome? This has a pretty important impact on some groups.

       

      Thanks!

      Stephane

       

      Attachments

        Activity

          [LU-10368] disk quota OST rebalancing issues

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32075/
          Subject: LU-10368 mdc: resend quotactl if needed
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: c2c8a3f6dec17144f317aab409f48b862d9aa1b1

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32075/ Subject: LU-10368 mdc: resend quotactl if needed Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: c2c8a3f6dec17144f317aab409f48b862d9aa1b1

          Hi Peter,

          I'll try the patch at the first opportunity after LUG.

          Please also note that we haven't noticed any new occurence of this issue in 2.10.3 so far.

          Thanks!

          Stephane

           

          sthiell Stephane Thiell added a comment - Hi Peter, I'll try the patch at the first opportunity after LUG. Please also note that we haven't noticed any new occurence of this issue in 2.10.3 so far. Thanks! Stephane  

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32075
          Subject: LU-10368 mdc: resend quotactl if needed
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: e90b1a190b38db9573a5284ca53efbff612d0972

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32075 Subject: LU-10368 mdc: resend quotactl if needed Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: e90b1a190b38db9573a5284ca53efbff612d0972
          pjones Peter Jones added a comment -

          Stephane

          I'm not sure whether it is easy for you to test out this fix and confirm whether it resolves the issue, but if you do have a way then that would be much appreciated. For now I'll mark this ticket as resolved - "innocent until proven guilty"  - and we'll queue it up for a future 2.10.x release.

          Peter

          pjones Peter Jones added a comment - Stephane I'm not sure whether it is easy for you to test out this fix and confirm whether it resolves the issue, but if you do have a way then that would be much appreciated. For now I'll mark this ticket as resolved - "innocent until proven guilty"  - and we'll queue it up for a future 2.10.x release. Peter

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31773/
          Subject: LU-10368 mdc: resend quotactl if needed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d511918e8eb725abba2561cc493e30651a89ac27

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31773/ Subject: LU-10368 mdc: resend quotactl if needed Project: fs/lustre-release Branch: master Current Patch Set: Commit: d511918e8eb725abba2561cc493e30651a89ac27

          Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31773
          Subject: LU-10368 mdc: resend quotactl if needed
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 30466952d14f2e891f3a1bdd29103ae578f00413

          gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31773 Subject: LU-10368 mdc: resend quotactl if needed Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 30466952d14f2e891f3a1bdd29103ae578f00413

          the bug in https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc is not the same issue,
          it's caused by the delayed response from MDT and client reconnected to MDT, which cause the MDS_QUOTACTL
          request in "mdc_quotactl" to fail for it is marked as "no resend".

          static int mdc_quotactl(struct obd_device *unused, struct obd_export *exp,
                                  struct obd_quotactl *oqctl)
          {
                  struct ptlrpc_request   *req;
                  struct obd_quotactl     *oqc;
                  int                      rc;
                  ENTRY;
          
                  req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
                                                  &RQF_MDS_QUOTACTL, LUSTRE_MDS_VERSION,
                                                  MDS_QUOTACTL);
                  if (req == NULL)
                          RETURN(-ENOMEM);
          
                  oqc = req_capsule_client_get(&req->rq_pill, &RMF_OBD_QUOTACTL);
                  *oqc = *oqctl;
          
                  ptlrpc_request_set_replen(req);
                  ptlrpc_at_set_req_timeout(req);
                  req->rq_no_resend = 1;                   <--- here
                  
                  rc = ptlrpc_queue_wait(req);
                  if (rc) 
                          CERROR("ptlrpc_queue_wait failed, rc: %d\n", rc);
                  
                  if (req->rq_repmsg &&
                      (oqc = req_capsule_server_get(&req->rq_pill, &RMF_OBD_QUOTACTL))) {
                          *oqctl = *oqc;
                  } else if (!rc) {
                          CERROR ("Can't unpack obd_quotactl\n");
                          rc = -EPROTO;
                  }
                  ptlrpc_req_finished(req);
          
                  RETURN(rc);
          } 
          

          the corresponding logs are

          [ 8535.569851] Lustre: DEBUG MARKER: == sanity-quota test 8: Run dbench with quota enabled ================================================ 22:08:14 (1515103694)
          [ 9410.140055] Lustre: 19972:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1515103967/real 1515103967]  req@ffff88006518cc00 x1588697858744704/t0(0) o48->lustre-MDT0000-mdc-ffff88006b606800@10.9.4.248@tcp:12/10 lens 336/336 e 21 to 1 dl 1515104568 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          [ 9410.148998] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection to lustre-MDT0000 (at 10.9.4.248@tcp) was lost; in progress operations using this service will wait for recovery to complete
          [ 9410.278389] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection restored to 10.9.4.248@tcp (at 10.9.4.248@tcp)
          [ 9410.281474] LustreError: 19972:0:(mdc_request.c:1840:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -107
          [ 9410.462846] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-quota test_8: @@@@@@ FAIL: clear quota for [type:-u name:quota_usr] failed
          
          hongchao.zhang Hongchao Zhang added a comment - the bug in https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc is not the same issue, it's caused by the delayed response from MDT and client reconnected to MDT, which cause the MDS_QUOTACTL request in "mdc_quotactl" to fail for it is marked as "no resend". static int mdc_quotactl(struct obd_device *unused, struct obd_export *exp, struct obd_quotactl *oqctl) { struct ptlrpc_request *req; struct obd_quotactl *oqc; int rc; ENTRY; req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp), &RQF_MDS_QUOTACTL, LUSTRE_MDS_VERSION, MDS_QUOTACTL); if (req == NULL) RETURN(-ENOMEM); oqc = req_capsule_client_get(&req->rq_pill, &RMF_OBD_QUOTACTL); *oqc = *oqctl; ptlrpc_request_set_replen(req); ptlrpc_at_set_req_timeout(req); req->rq_no_resend = 1; <--- here rc = ptlrpc_queue_wait(req); if (rc) CERROR("ptlrpc_queue_wait failed, rc: %d\n", rc); if (req->rq_repmsg && (oqc = req_capsule_server_get(&req->rq_pill, &RMF_OBD_QUOTACTL))) { *oqctl = *oqc; } else if (!rc) { CERROR ("Can't unpack obd_quotactl\n"); rc = -EPROTO; } ptlrpc_req_finished(req); RETURN(rc); } the corresponding logs are [ 8535.569851] Lustre: DEBUG MARKER: == sanity-quota test 8: Run dbench with quota enabled ================================================ 22:08:14 (1515103694) [ 9410.140055] Lustre: 19972:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1515103967/real 1515103967] req@ffff88006518cc00 x1588697858744704/t0(0) o48->lustre-MDT0000-mdc-ffff88006b606800@10.9.4.248@tcp:12/10 lens 336/336 e 21 to 1 dl 1515104568 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 9410.148998] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection to lustre-MDT0000 (at 10.9.4.248@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 9410.278389] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection restored to 10.9.4.248@tcp (at 10.9.4.248@tcp) [ 9410.281474] LustreError: 19972:0:(mdc_request.c:1840:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -107 [ 9410.462846] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity-quota test_8: @@@@@@ FAIL: clear quota for [type:-u name:quota_usr] failed
          pjones Peter Jones added a comment -

          Hongchao

          Can you please investigate?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Can you please investigate? Thanks Peter
          mdiep Minh Diep added a comment -

          https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc

          see similar message 'can't enable quota enforcement'

          mdiep Minh Diep added a comment - https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc see similar message 'can't enable quota enforcement'
          sthiell Stephane Thiell added a comment - - edited

          Just upgraded the servers to 2.10.2 and we'll see how it goes (right now I can't reproduce the issue).

          I'm also noticing the following lquota log messages (also reported in -LU-9790-):

          00040000:02020000:27.0F:1513208262.246875:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003c: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272624:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0038: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272627:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0034: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272628:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0042: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272629:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0046: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272630:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003a: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272631:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004c: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          00040000:02020000:27.0:1513208262.272632:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004a: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
          
          
          
          
          
          

          Our setup is: space acct on ug (no project quota, ie. no ldiskfs flag) and quota only enabled for groups (g):

          target name:    oak-OST004c
          pool ID:        0
          type:           dt
          quota enabled:  g
          conn to master: setup
          space acct:     ug
          user uptodate:  glb[0],slv[0],reint[0]
          group uptodate: glb[1],slv[1],reint[0]
          project uptodate: glb[0],slv[0],reint[0]
          
          
          

           

           

          Best,

          Stephane

          sthiell Stephane Thiell added a comment - - edited Just upgraded the servers to 2.10.2 and we'll see how it goes (right now I can't reproduce the issue). I'm also noticing the following lquota log messages (also reported in - LU-9790 -): 00040000:02020000:27.0F:1513208262.246875:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003c: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272624:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0038: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272627:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0034: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272628:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0042: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272629:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0046: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272630:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003a: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272631:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004c: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already 00040000:02020000:27.0:1513208262.272632:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004a: can 't enable quota enforcement since space accounting isn' t functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already Our setup is: space acct on ug (no project quota, ie. no ldiskfs flag) and quota only enabled for groups (g): target name: oak-OST004c pool ID: 0 type: dt quota enabled: g conn to master: setup space acct: ug user uptodate: glb[0],slv[0],reint[0] group uptodate: glb[1],slv[1],reint[0] project uptodate: glb[0],slv[0],reint[0]     Best, Stephane

          People

            hongchao.zhang Hongchao Zhang
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: