[LU-1720] Quota doesn't work over 4TB on single OST Created: 08/Aug/12  Updated: 24/Nov/17  Resolved: 01/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: Lustre 2.1.4, Lustre 1.8.9

Type: Bug Priority: Major
Reporter: Shuichi Ihara (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS5.8 Lustre-1.8.8-wc1


Attachments: File debuglog.txt.gz     File reproducer.sh     Text File setlimit_err_msg.patch     Text File setlimit_err_msg.patch    
Issue Links:
Related
Severity: 2
Rank (Obsolete): 4061

 Description   

We set quota "ug3" to all OSTs and MDT, then an also set 5TB quota limitation to a user. But, if user1 writes files to single OST, it exceeds quota limitation when total file size gets 4TB.

# lfs quota -v -u user1 /lustre/
Disk quotas for user user1 (uid 1000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       /lustre/ 4295057504       0 5368709120       -      13       0       0       -
lustre-MDT0000_UUID
                      4       -    1024       -      13       -       0       -
lustre-OST0000_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0001_UUID
                4295057500*      - 4294959104       -       -       -       -       -
lustre-OST0002_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0003_UUID
                      0       -    1024       -       -       -       -       -
..
..
# lctl get_param lquota.*.quota_type
lquota.lustre-OST0001.quota_type=ug3
lquota.lustre-OST0004.quota_type=ug3
lquota.lustre-OST0008.quota_type=ug3
lquota.lustre-OST000c.quota_type=ug3
lquota.lustre-OST0011.quota_type=ug3
lquota.lustre-OST0015.quota_type=ug3
lquota.lustre-OST0019.quota_type=ug3
lquota.lustre-OST001d.quota_type=ug3
lquota.lustre-OST0021.quota_type=ug3
lquota.lustre-OST0025.quota_type=ug3
lquota.lustre-OST0028.quota_type=ug3
lquota.lustre-OST002d.quota_type=ug3
lquota.lustre-OST0031.quota_type=ug3
lquota.lustre-OST0035.quota_type=ug3
lquota.lustre-OST0039.quota_type=ug3

# lctl get_param lquota.*.quota_type
lquota.lustre-MDT0000.quota_type=ug3


 Comments   
Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

Hi, Ihara

Could you collect the messages on OSTs? I'm afraid that the quota file for local fs (operational quota file) is not coverted to 64bit yet, just like LU-1584?

Comment by Shuichi Ihara (Inactive) [ 08/Aug/12 ]

Hi Niu, I will send you messages latter (system was shutdown, will bootup soon), but this is completely new test system. The filesystem is formatted with

{ost,mdt}

.quota_type=ug3.

Comment by Shuichi Ihara (Inactive) [ 08/Aug/12 ]

Hi Niu,

tested again, it's very simple configuraiton.
1 x OSS, 1 x OST and 1 x Client. user1's quota limit is 5TB, and write 4.5TB to single OST. Formated with

{ost,mdt}

.quota_type=ug3, but it hits again when quota is exceeded by 4TB.

# lfs quotacheck -ug /lustre/
# lfs setquota -B 5368709120 -u user1 /lustre
# su - user1
user1 writes files 10 x 450GB files to /lustre

dd: writing `/lustre/quota_test/file-10': Disk quota exceeded
dd: closing output file `/lustre/quota_test/file-10': Input/output error

$ lfs quota -v -u user1 /lustre/
Disk quotas for user user1 (uid 1000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       /lustre/ 4294935256       0 5368709120       -      13       0       0       -
lustre-MDT0000_UUID
                      4       -    1024       -      13       -       0       -
lustre-OST0000_UUID
                4294935252*      - 4294933504       -       -       -       -       -
OSS's messages when quota is exceeded.

Aug  8 23:55:51 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) error set quota fs limit! (rc:-34)
Aug  8 23:55:51 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) error set quota fs limit! (rc:-34)
Aug  8 23:55:51 s02 kernel: Lustre: 19756:0:(quota_interface.c:491:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
Aug  8 23:55:52 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) error set quota fs limit! (rc:-34)
Aug  8 23:55:52 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) Skipped 2473 previous similar messages
Aug  8 23:55:53 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) error set quota fs limit! (rc:-34)
Aug  8 23:55:53 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) Skipped 1814 previous similar messages
Aug  8 23:55:55 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) error set quota fs limit! (rc:-34)
Aug  8 23:55:55 s02 kernel: LustreError: 18960:0:(quota_context.c:685:dqacq_completion()) Skipped 3868 previous similar messages

Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

Hi, Ihara

Which looks like the same problem reported in the lustre-discuss, and finally, when they format the filesystem with e2fsprogs-1.41.90.wc4, the problem is gone. (they can reproduce the problem with e2fsprogs-1.41.90.wc3). Could you check the e2fsprogs version on server? Thanks.

Comment by Shuichi Ihara (Inactive) [ 08/Aug/12 ]

Hi Niu,

e2fsprogs-1.42.3.wc1 is installed and the filesystem is formated with it.
I did same testing on the master branch, but the problem didn't happen.

Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

It's too bad that the error message only report an error code, could you apply this patch (which print more information) and enable D_QUOTA while running tests? Thanks.

Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

print more information when set local limit failed.

Comment by Shuichi Ihara (Inactive) [ 08/Aug/12 ]

Niu, could you plesae check patch again? it seems to be failing with patch.

/usr/src/lustre-1.8.8/lustre/quota/quota_context.c: In function 'dqacq_completion':
/usr/src/lustre-1.8.8/lustre/quota/quota_context.c:688: warning: format '%d' expects type 'int', but argument 13 has type 'long unsigned int'
make[6]: *** [/usr/src/lustre-1.8.8/lustre/quota/quota_context.o] Error 1
make[5]: *** [/usr/src/lustre-1.8.8/lustre/quota] Error 2
make[4]: *** [/usr/src/lustre-1.8.8/lustre] Error 2

Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

fix the complile error.

Comment by Niu Yawei (Inactive) [ 08/Aug/12 ]

sorry, please use the updated one.

Comment by Shuichi Ihara (Inactive) [ 09/Aug/12 ]

debug log with enabled D_QUOTA

Comment by Shuichi Ihara (Inactive) [ 09/Aug/12 ]

syslog on OSS after patches applied.

Aug  9 13:57:37 s02 kernel: LustreError: 8891:0:(quota_context.c:691:dqacq_completion()) error set quota fs limit! rc:-34, count:1024, hardlimit:4294967296 isblk:b
Aug  9 13:57:37 s02 kernel: LustreError: 8891:0:(quota_context.c:691:dqacq_completion()) Skipped 115657 previous similar messages
Aug  9 13:57:37 s02 kernel: Lustre: 9294:0:(quota_interface.c:475:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)
Comment by Niu Yawei (Inactive) [ 09/Aug/12 ]

Looks the 'hardlimit' and 'count' values are sane, I suspect that the local quota file was created as 32bit somehow... Could you just mount the ost device as ldiskfs and check the quota file name on it? The name should be "lquota.user" & "lquota.group" for 32bit, or "lquota_v2.user" & "lquota_v2.group" for 64bit. Thanks.

Comment by Shuichi Ihara (Inactive) [ 09/Aug/12 ]

Niu,
Yeah, I thought and had checked them before and lquota_v2.

{user,group}

existed. That was odd and filed this prolbem on here.
Here is I just checked. There are no "lquota.user" & "lquota.group" files, but v2 files are existing, instead.

# mount -t ldiskfs /dev/mapper/LUN00 /mnt/lustre/LUN00/
# ls /mnt/lustre/LUN00/
CONFIGS  O  health_check  last_rcvd  lost+found  lquota_v2.group  lquota_v2.user
Comment by Shuichi Ihara (Inactive) [ 09/Aug/12 ]

Tested with 1.8.7, but still hit same probolem.

Comment by Shuichi Ihara (Inactive) [ 09/Aug/12 ]

I have tested with lustre-2.1.2 on RHEL5, but we hit same quota limitation by 4TB.

Comment by Shuichi Ihara (Inactive) [ 10/Aug/12 ]

This is reproducer of this problem.
And here is what I did to make the lustre and tesing.

# MDS
# mkfs.lustre --reformat --mgs --mdt --param mdt.quota_type=ug3 /dev/sdb1
# mount -t lustre /dev/sdb1 /mnt/lustre/MDT
# lctl get_param lquota.*.quota_type
lquota.mdd_obd-lustre-MDT0000.quota_type=ug3


# OSS
# mkfs.lustre --reformat --ost --mgsnode=192.168.100.129@o2ib --param ost.quota_type=ug3 /dev/mapper/LUN59
# mount -t lustre /dev/mapper/LUN59 /mnt/lustre/LUN59
# lctl get_param lquota.*.quota_type
lquota.lustre-OST0000.quota_type=ug3


# Client
# mount -t lustre 192.168.100.129@o2ib:/lustre /lustre
# nohup /tmp/reproducer.sh &

# lfs quota -u user1 -v /lustre/
Disk quotas for user user1 (uid 1000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       /lustre/ 4295204904       0 5368709120       -     100       0       0       -
lustre-MDT0000_UUID
                      0       -    1024       -     100       -       0       -
lustre-OST0000_UUID
                4295204904*      - 4294966272 
Comment by Niu Yawei (Inactive) [ 10/Aug/12 ]

Thanks for your update, Ihara!

Finally, I found the reason: the kernel quota-large-limits-rhel5.patch (which makes kernel do_set_dqblk() support 64bits) is mis-updated in ba5dd769f66194a80920cf93d6014c78729efaae (LU-674 kernel update RHEL5.7 [2.6.18-274.3.1.el5]).

Yangshen, could you take a look on this, and fix the patch? Thanks.

Comment by Shuichi Ihara (Inactive) [ 10/Aug/12 ]

Niu, thanks for analysis. what do you mean mis-updated?

Comment by Niu Yawei (Inactive) [ 10/Aug/12 ]

The patch updated incorrectly when supporting new kernel. Please check: http://review.whamcloud.com/#change,3599

Comment by Yang Sheng [ 10/Aug/12 ]

patch for b2_1: http://review.whamcloud.com/3600

Comment by Yang Sheng [ 10/Aug/12 ]

patch for b1_8: http://review.whamcloud.com/3599

Comment by Shuichi Ihara (Inactive) [ 19/Aug/12 ]

I reproduced this problem with lustre-1.8.8 on RHEL5 and confirmed it's fixed by kernel patch on LU-1720. Also confirmed fixing on lustre-2.1.2 with RHEL5 and the problem doesn't happen on RHEL6 even without patches.

Comment by Shuichi Ihara (Inactive) [ 23/Aug/12 ]

Hi, I've confirmed the patch LU-1720 solves this problem, would you please merge if the review is finished?

Comment by Peter Jones [ 31/Aug/12 ]

Landed to b1_8. Still needs to land to 2.x branches

Comment by Yang Sheng [ 01/Nov/12 ]

Patch landed to all branch. Close bug.

Generated at Sat Feb 10 01:19:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.