[LU-10368] disk quota OST rebalancing issues Created: 11/Dec/17  Updated: 03/May/18  Resolved: 19/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: Lustre 2.12.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.4


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

 

We are seeing quota problems with 2.10.1 where, from time to time, group quotas are generating EDQUOT (users are actually reporting the problem) while there is room left. Some OSTs are seen as full as shown below and the rebalancing doesn't seem to work:

 

[root@oak-rbh01 ~]# lfs quota -v -g oak_p-cvmed /oak
Disk quotas for grp oak_p-cvmed (gid 3683):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
           /oak 96389875036  120000000000 120000000000       - 6955998  18000000 18000000       -
oak-MDT0000_UUID
                3248820       -       0       - 6955998       - 8388608       -
oak-OST0000_UUID
                1196660392       - 1342177280       -       -       -       -       -
oak-OST0001_UUID
                1908235316       - 2147483648       -       -       -       -       -
oak-OST0002_UUID
                1225832336       - 1342177280       -       -       -       -       -
oak-OST0003_UUID
                1327591244       - 1342177280       -       -       -       -       -
oak-OST0004_UUID
                1965895968       - 2147483648       -       -       -       -       -
oak-OST0005_UUID
                1159391612       - 1342177280       -       -       -       -       -
oak-OST0006_UUID
                1635255040       - 1879048192       -       -       -       -       -
oak-OST0007_UUID
                1818596964       - 1879048192       -       -       -       -       -
oak-OST0008_UUID
                1872031764       - 1879048192       -       -       -       -       -
oak-OST0009_UUID
                2061279604       - 2147483648       -       -       -       -       -
oak-OST000a_UUID
                1445543488       - 1610612736       -       -       -       -       -
oak-OST000b_UUID
                1875314700       - 1879048192       -       -       -       -       -
oak-OST000c_UUID
                1301881412       - 1342177280       -       -       -       -       -
oak-OST000d_UUID
                1766688092       - 1879048192       -       -       -       -       -
oak-OST000e_UUID
                2005981712       - 2147483648       -       -       -       -       -
oak-OST000f_UUID
                1491138396       - 1610612736       -       -       -       -       -
oak-OST0010_UUID
                1292096088       - 1342177280       -       -       -       -       -
oak-OST0011_UUID
                1222866272       - 1342177280       -       -       -       -       -
oak-OST0012_UUID
                1312869104       - 1342177280       -       -       -       -       -
oak-OST0013_UUID
                1185445504       - 1342177280       -       -       -       -       -
oak-OST0014_UUID
                1315544800       - 1342177280       -       -       -       -       -
oak-OST0015_UUID
                2025717256       - 2147483648       -       -       -       -       -
oak-OST0016_UUID
                1817010800       - 1879048192       -       -       -       -       -
oak-OST0017_UUID
                1699092560       - 1879048192       -       -       -       -       -
oak-OST0018_UUID
                1921966992       - 2147483648       -       -       -       -       -
oak-OST0019_UUID
                1752975104       - 1879048192       -       -       -       -       -
oak-OST001a_UUID
                2022449576       - 2147483648       -       -       -       -       -
oak-OST001b_UUID
                1476019956       - 1610612736       -       -       -       -       -
oak-OST001c_UUID
                2002420900       - 2147483648       -       -       -       -       -
oak-OST001d_UUID
                1175776272       - 1342177280       -       -       -       -       -
oak-OST001e_UUID
                1522667428       - 1610612736       -       -       -       -       -
oak-OST001f_UUID
                1698940868       - 1879048192       -       -       -       -       -
oak-OST0020_UUID
                1418438600       - 1610612736       -       -       -       -       -
oak-OST0021_UUID
                1848558676       - 1879048192       -       -       -       -       -
oak-OST0022_UUID
                1567670312       - 1610612736       -       -       -       -       -
oak-OST0023_UUID
                1755882404       - 1879048192       -       -       -       -       -
oak-OST0024_UUID
                1725770704       - 1879048192       -       -       -       -       -
oak-OST0025_UUID
                2021496552       - 2147483648       -       -       -       -       -
oak-OST0026_UUID
                2340218652       - 2415919104       -       -       -       -       -
oak-OST0027_UUID
                2078849960       - 2147483648       -       -       -       -       -
oak-OST0028_UUID
                2401223300       - 2415919104       -       -       -       -       -
oak-OST0029_UUID
                2255153880       - 2415919104       -       -       -       -       -
oak-OST002a_UUID
                2479360100       - 2684354560       -       -       -       -       -
oak-OST002b_UUID
                1956889380       - 2147483648       -       -       -       -       -
oak-OST002c_UUID
                2336034612       - 2415919104       -       -       -       -       -
oak-OST002d_UUID
                1897045500       - 2147483648       -       -       -       -       -
oak-OST002e_UUID
                2069066412       - 2147483648       -       -       -       -       -
oak-OST002f_UUID
                2668099124       - 2684354560       -       -       -       -       -
oak-OST0030_UUID
                302970856       - 536870912       -       -       -       -       -
oak-OST0031_UUID
                425767268       - 536870912       -       -       -       -       -
oak-OST0032_UUID
                554265344       - 805306368       -       -       -       -       -
oak-OST0033_UUID
                616158116       - 805306368       -       -       -       -       -
oak-OST0034_UUID
                523406904       - 536870912       -       -       -       -       -
oak-OST0035_UUID
                832949332       - 1073741824       -       -       -       -       -
oak-OST0036_UUID
                431649588       - 536870912       -       -       -       -       -
oak-OST0037_UUID
                335297304       - 536870912       -       -       -       -       -
oak-OST0038_UUID
                768953372       - 805306368       -       -       -       -       -
oak-OST0039_UUID
                589398720       - 805306368       -       -       -       -       -
oak-OST003a_UUID
                822149664       - 1073741824       -       -       -       -       -
oak-OST003b_UUID
                246038976       - 268435456       -       -       -       -       -
oak-OST003c_UUID
                1002757608       - 1073741824       -       -       -       -       -
oak-OST003d_UUID
                655190956       - 805306368       -       -       -       -       -
oak-OST003e_UUID
                464755608*      - 464755608       -       -       -       -       -
oak-OST003f_UUID
                265537376       - 268435456       -       -       -       -       -
oak-OST0040_UUID
                380491764       - 536870912       -       -       -       -       -
oak-OST0041_UUID
                628194908       - 805306368       -       -       -       -       -
oak-OST0042_UUID
                220394524       - 268435456       -       -       -       -       -
oak-OST0043_UUID
                388284936       - 536870912       -       -       -       -       -
oak-OST0044_UUID
                429979492       - 536870912       -       -       -       -       -
oak-OST0045_UUID
                276764380       - 536870912       -       -       -       -       -
oak-OST0046_UUID
                346999308       - 536870912       -       -       -       -       -
oak-OST0047_UUID
                408032656       - 536870912       -       -       -       -       -
oak-OST0048_UUID
                17760704       - 268435456       -       -       -       -       -
oak-OST0049_UUID
                11267016       - 268435456       -       -       -       -       -
oak-OST004a_UUID
                12111344       - 268435456       -       -       -       -       -
oak-OST004b_UUID
                8750240       - 268435456       -       -       -       -       -
oak-OST004c_UUID
                31825656       - 268435456       -       -       -       -       -
oak-OST004d_UUID
                67586608       - 268435456       -       -       -       -       -
Total allocated inode limit: 8388608, total allocated block limit: 106765196184

 In some cases I was able to disable/enable group q uota to fix the problem but in this particular, I can't find a way to force a refresh. Any idea would be welcome? This has a pretty important impact on some groups.

 

Thanks!

Stephane

 



 Comments   
Comment by Peter Jones [ 12/Dec/17 ]

Thanks Stephane

Comment by Stephane Thiell [ 14/Dec/17 ]

Just upgraded the servers to 2.10.2 and we'll see how it goes (right now I can't reproduce the issue).

I'm also noticing the following lquota log messages (also reported in -LU-9790-):

00040000:02020000:27.0F:1513208262.246875:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003c: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272624:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0038: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272627:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0034: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272628:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0042: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272629:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST0046: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272630:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST003a: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272631:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004c: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already
00040000:02020000:27.0:1513208262.272632:0:257335:0:(qsd_config.c:202:qsd_process_config()) 0-0: oak-OST004a: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already





Our setup is: space acct on ug (no project quota, ie. no ldiskfs flag) and quota only enabled for groups (g):

target name:    oak-OST004c
pool ID:        0
type:           dt
quota enabled:  g
conn to master: setup
space acct:     ug
user uptodate:  glb[0],slv[0],reint[0]
group uptodate: glb[1],slv[1],reint[0]
project uptodate: glb[0],slv[0],reint[0]


 

 

Best,

Stephane

Comment by Minh Diep [ 05/Jan/18 ]

https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc

see similar message 'can't enable quota enforcement'

Comment by Peter Jones [ 21/Mar/18 ]

Hongchao

Can you please investigate?

Thanks

Peter

Comment by Hongchao Zhang [ 22/Mar/18 ]

the bug in https://testing.hpdd.intel.com/test_sets/9488b58c-f1b6-11e7-a169-52540065bddc is not the same issue,
it's caused by the delayed response from MDT and client reconnected to MDT, which cause the MDS_QUOTACTL
request in "mdc_quotactl" to fail for it is marked as "no resend".

static int mdc_quotactl(struct obd_device *unused, struct obd_export *exp,
                        struct obd_quotactl *oqctl)
{
        struct ptlrpc_request   *req;
        struct obd_quotactl     *oqc;
        int                      rc;
        ENTRY;

        req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
                                        &RQF_MDS_QUOTACTL, LUSTRE_MDS_VERSION,
                                        MDS_QUOTACTL);
        if (req == NULL)
                RETURN(-ENOMEM);

        oqc = req_capsule_client_get(&req->rq_pill, &RMF_OBD_QUOTACTL);
        *oqc = *oqctl;

        ptlrpc_request_set_replen(req);
        ptlrpc_at_set_req_timeout(req);
        req->rq_no_resend = 1;                   <--- here
        
        rc = ptlrpc_queue_wait(req);
        if (rc) 
                CERROR("ptlrpc_queue_wait failed, rc: %d\n", rc);
        
        if (req->rq_repmsg &&
            (oqc = req_capsule_server_get(&req->rq_pill, &RMF_OBD_QUOTACTL))) {
                *oqctl = *oqc;
        } else if (!rc) {
                CERROR ("Can't unpack obd_quotactl\n");
                rc = -EPROTO;
        }
        ptlrpc_req_finished(req);

        RETURN(rc);
} 

the corresponding logs are

[ 8535.569851] Lustre: DEBUG MARKER: == sanity-quota test 8: Run dbench with quota enabled ================================================ 22:08:14 (1515103694)
[ 9410.140055] Lustre: 19972:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1515103967/real 1515103967]  req@ffff88006518cc00 x1588697858744704/t0(0) o48->lustre-MDT0000-mdc-ffff88006b606800@10.9.4.248@tcp:12/10 lens 336/336 e 21 to 1 dl 1515104568 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[ 9410.148998] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection to lustre-MDT0000 (at 10.9.4.248@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 9410.278389] Lustre: lustre-MDT0000-mdc-ffff88006b606800: Connection restored to 10.9.4.248@tcp (at 10.9.4.248@tcp)
[ 9410.281474] LustreError: 19972:0:(mdc_request.c:1840:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -107
[ 9410.462846] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-quota test_8: @@@@@@ FAIL: clear quota for [type:-u name:quota_usr] failed
Comment by Gerrit Updater [ 26/Mar/18 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/31773
Subject: LU-10368 mdc: resend quotactl if needed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 30466952d14f2e891f3a1bdd29103ae578f00413

Comment by Gerrit Updater [ 19/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31773/
Subject: LU-10368 mdc: resend quotactl if needed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d511918e8eb725abba2561cc493e30651a89ac27

Comment by Peter Jones [ 19/Apr/18 ]

Stephane

I'm not sure whether it is easy for you to test out this fix and confirm whether it resolves the issue, but if you do have a way then that would be much appreciated. For now I'll mark this ticket as resolved - "innocent until proven guilty"  - and we'll queue it up for a future 2.10.x release.

Peter

Comment by Gerrit Updater [ 19/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32075
Subject: LU-10368 mdc: resend quotactl if needed
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: e90b1a190b38db9573a5284ca53efbff612d0972

Comment by Stephane Thiell [ 19/Apr/18 ]

Hi Peter,

I'll try the patch at the first opportunity after LUG.

Please also note that we haven't noticed any new occurence of this issue in 2.10.3 so far.

Thanks!

Stephane

 

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32075/
Subject: LU-10368 mdc: resend quotactl if needed
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: c2c8a3f6dec17144f317aab409f48b862d9aa1b1

Generated at Sat Feb 10 02:34:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.