[LU-19018] fallocate by-passes quota limits initially, subsequent fallocate Fails correctly with EDQUOT - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.16.1
Labels:
None
Environment:
Lustre 2.15 / EL7 or Lustre 2.16 / EL9

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Hello! I just realized that posix_fallocate() allows users to bypass Lustre quota enforcement. Tested against Lustre 2.15.5 and 2.16.1.

I'm attaching a simple fallocate helper fallocate.c that I used below.

We assume quota enforcement is enabled here (osd-ldiskfs.lustre-OST*.quota_slave_dt.enabled=up):

User quota enforcement:

$ lfs setquota -u sthiell -B 1T /elm

$ ./fallocate bigfalloc-user $(( 2 * 1024 * 1024 * 1024 * 1024 )) 

$ stat bigfalloc-user
  File: bigfalloc-user
  Size: 2199023255552	Blocks: 4294968080 IO Block: 4194304 regular file
Device: 899af214h/2308633108d	Inode: 144115441093058211  Links: 1
Access: (0644/-rw-r--r--)  Uid: (282232/ sthiell)   Gid: (  100/   users)
Access: 2025-05-14 15:39:29.000000000 -0700
Modify: 2025-05-14 15:39:29.000000000 -0700
Change: 2025-05-14 15:39:29.000000000 -0700
 Birth: 2025-05-14 15:39:29.000000000 -0700

$ lfs quota -u sthiell -h  /elm/
Disk quotas for usr sthiell (uid 282232):
      Filesystem    used   quota   limit   grace   files   quota   limit   grace
           /elm/      2T*     0k      1T       -       5       0       0       -

On another system, I was able to bypass project quotas in the same way.

$ ./fallocate /scratch/users/sthiell/bigfalloc $(( 10 * 1024 * 1024 * 1024 * 1024 ))
Successfully allocated 10995116277760 bytes for file /scratch/users/sthiell/bigfalloc


$ lfs quota -p 282232 -h /scratch/
Disk quotas for prj 282232 (pid 282232):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
      /scratch/    112T*     0k    100T       -    4849       0 104857600       -
pid 282232 is using default block quota setting

Originally discovered with dtar which makes use of posix_fallocate() to preallocate the resulting tar file. It seems pretty important to fix this ASAP.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dd.run.gz
552 kB
22/May/25 9:11 AM
fallocate_mod.c
1.0 kB
26/May/25 6:18 PM
fallocate.c
0.9 kB
14/May/25 10:49 PM
fallocate.run.gz
447 kB
22/May/25 9:11 AM
lfallocate.sh
0.7 kB
26/May/25 6:17 PM

Activity

[LU-19018] fallocate by-passes quota limits initially, subsequent fallocate Fails correctly with EDQUOT

Sergey Cheremencev added a comment - 4 days ago

The problem here is that fallocate doesn't take into account quota overflow flags at the client side. It only fails if you allocate new parts of the file one by one: fallocate -f10M f1, fallocate -f10M -i10M f1, fallocate -f 10M -i 20M f1 ... Besides that, it is still possible to fallocate new file when the client is overquota:

[root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f1
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l30M] [/mnt/lustre/f1]
[root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
      Filesystem  kbytes   bquota  blimit  bgrace   files   iquota  ilimit  igrace
     /mnt/lustre   30720        0   51200       -       1        0       0       -
[root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f2
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l30M] [/mnt/lustre/f2]
[root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
      Filesystem  kbytes   bquota  blimit  bgrace   files   iquota  ilimit  igrace
     /mnt/lustre   61440*       0   51200       -       2*       0       0       -
[root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f3
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l30M] [/mnt/lustre/f3]
[root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
      Filesystem  kbytes   bquota  blimit  bgrace   files   iquota  ilimit  igrace
     /mnt/lustre   92160*       0   51200       -       3*       0       0       -
[root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f4
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l30M] [/mnt/lustre/f4]
[root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
      Filesystem  kbytes   bquota  blimit  bgrace   files   iquota  ilimit  igrace
     /mnt/lustre  122880*       0   51200       -       4*       0       0       -
[root@vm3 tests]# ./runas -u quota_usr dd if=/dev/urandom of=/mnt/lustre/ffff3 bs=1M count=40
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [dd] [if=/dev/urandom] [of=/mnt/lustre/ffff3] [bs=1M] [count=40]
dd: error writing '/mnt/lustre/ffff3': Disk quota exceeded
5+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.218391 s, 19.2 MB/s
[root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f55
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l30M] [/mnt/lustre/f55]
[root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
      Filesystem  kbytes   bquota  blimit  bgrace   files   iquota  ilimit  igrace
     /mnt/lustre  157696*       0   51200       -       6*       0       0       -

That said, 2 things should fixed here:

It shouldn't be possible to fallocate a new file when the user(group or project) hits it's quota limits. The inode at the MDS is still admitted to be created at the MDS side if the QID doesn't hit inode limits. At the same time fallocate at the OSTs should fail with -EDQUOT.
fallocate should take into account remaining quotas. I.e. if the QID still has quotas and not hitting any limits, fallocate should not be able to allocate the space above the limit.

Im not sure how this would be implemented. The 1st issue could be fixed pretty easy - something like osc_quota_chkdq should be called to check that we are not over quota yet. Another approach that can be probably used with the 1st one is to return -EDQUOT from the server side.

The 2nd problem looks a bit complicated to fix. Probably we should retrieve quotas(like we do in a regular lfs quota) in fallocate handler to estimate does the requested fallocate size fit the remaining quota. Or we can check that at the server side and return -EDQUOT if the OST cannot supply requested space size.

I haven't looked at the code, so there could be better solutions.

Sergey Cheremencev added a comment - 4 days ago The problem here is that fallocate doesn't take into account quota overflow flags at the client side. It only fails if you allocate new parts of the file one by one: fallocate -f10M f1, fallocate -f10M -i10M f1, fallocate -f 10M -i 20M f1 ... Besides that, it is still possible to fallocate new file when the client is overquota: [root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f1 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l30M] [/mnt/lustre/f1] [root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 30720 0 51200 - 1 0 0 - [root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f2 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l30M] [/mnt/lustre/f2] [root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 61440* 0 51200 - 2* 0 0 - [root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f3 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l30M] [/mnt/lustre/f3] [root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 92160* 0 51200 - 3* 0 0 - [root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f4 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l30M] [/mnt/lustre/f4] [root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 122880* 0 51200 - 4* 0 0 - [root@vm3 tests]# ./runas -u quota_usr dd if=/dev/urandom of=/mnt/lustre/ffff3 bs=1M count=40 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [dd] [if=/dev/urandom] [of=/mnt/lustre/ffff3] [bs=1M] [count=40] dd: error writing '/mnt/lustre/ffff3': Disk quota exceeded 5+0 records in 4+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.218391 s, 19.2 MB/s [root@vm3 tests]# ./runas -u quota_usr fallocate -l30M /mnt/lustre/f55 running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l30M] [/mnt/lustre/f55] [root@vm3 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 157696* 0 51200 - 6* 0 0 - That said, 2 things should fixed here: It shouldn't be possible to fallocate a new file when the user(group or project) hits it's quota limits. The inode at the MDS is still admitted to be created at the MDS side if the QID doesn't hit inode limits. At the same time fallocate at the OSTs should fail with -EDQUOT. fallocate should take into account remaining quotas. I.e. if the QID still has quotas and not hitting any limits, fallocate should not be able to allocate the space above the limit. Im not sure how this would be implemented. The 1st issue could be fixed pretty easy - something like osc_quota_chkdq should be called to check that we are not over quota yet. Another approach that can be probably used with the 1st one is to return -EDQUOT from the server side. The 2nd problem looks a bit complicated to fix. Probably we should retrieve quotas(like we do in a regular lfs quota) in fallocate handler to estimate does the requested fallocate size fit the remaining quota. Or we can check that at the server side and return -EDQUOT if the OST cannot supply requested space size. I haven't looked at the code, so there could be better solutions.

Arshad Hussain added a comment - 08/Jun/25 8:55 AM - edited

If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically.

Thanks for your report. Yes, I also understand this is the current behaviour. I also think this a known limitation in Lustre (looking at how sanity-quota/1h test-case is done). Please note that if fallocate is called divided in smaller incremental several parts it honores the EDQUOT as demonestrated by lfallocate.sh (attached). I have also updated the title to reflect the bug clearly. Please feel free to correct the title.

Arshad Hussain added a comment - 08/Jun/25 8:55 AM - edited If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically. Thanks for your report. Yes, I also understand this is the current behaviour. I also think this a known limitation in Lustre (looking at how sanity-quota/1h test-case is done). Please note that if fallocate is called divided in smaller incremental several parts it honores the EDQUOT as demonestrated by lfallocate.sh (attached). I have also updated the title to reflect the bug clearly. Please feel free to correct the title.

Stephane Thiell added a comment - 06/Jun/25 8:32 PM

Thanks Arshad and Sergey for looking into this! If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically.

I wanted to note that this is also impacting the DDN ExaScaler stack, it is easily reproduced with the standard fallocate command (-x or not actually). This is on our NVIDIA SuperPod:

sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs setstripe -c 10 teststripe
sthiell@login-01:/scratch/m000001/sthiell/tmp$ fallocate -x -l 50T teststripe 

sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe 
  File: teststripe
  Size: 54975581388800	Blocks: 107374201760 IO Block: 4194304 regular file
Device: 2015dafeh/538303230d	Inode: 360288091639906923  Links: 1
Access: (0644/-rw-r--r--)  Uid: (1695273714/ sthiell)   Gid: (851715518/marlowe-m000001)
Access: 2025-06-06 12:11:59.000000000 -0700
Modify: 2025-06-06 12:11:53.000000000 -0700
Change: 2025-06-06 12:11:53.000000000 -0700
 Birth: 2025-06-06 12:11:53.000000000 -0700

 sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs quota -h -p 851715518 /scratch/
Disk quotas for prj 851715518 (pid 851715518):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
      /scratch/  50.13T*     0k     15T       -  641726       0  831488       -

Then, a write to this file quickly generates EDQUOT, and the overall blocks allocated is somehow reduced:

sthiell@login-01:/scratch/m000001/sthiell/tmp$ dd if=/dev/zero of=teststripe bs=16M count=1000 seek=1M
dd: error writing 'teststripe': Disk quota exceeded
5+0 records in
4+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.0529419 s, 1.3 GB/s

sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe 
  File: teststripe
  Size: 17592253153280	Blocks: 34359826608 IO Block: 4194304 regular file
Device: 2015dafeh/538303230d	Inode: 360288091639907030  Links: 1
Access: (0644/-rw-r--r--)  Uid: (1695273714/ sthiell)   Gid: (851715518/marlowe-m000001)
Access: 2025-06-06 13:23:23.000000000 -0700
Modify: 2025-06-06 13:23:23.000000000 -0700
Change: 2025-06-06 13:23:23.000000000 -0700
 Birth: 2025-06-06 13:20:36.000000000 -0700

It's not great if a successful fallocated file cannot be written to afterwards. The point of using fallocate is to make sure the blocks are reserved. It would be better if the initial fallocate could fail with EDQUOT, I think. Thanks again.

Stephane Thiell added a comment - 06/Jun/25 8:32 PM Thanks Arshad and Sergey for looking into this! If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically. I wanted to note that this is also impacting the DDN ExaScaler stack, it is easily reproduced with the standard fallocate command (-x or not actually). This is on our NVIDIA SuperPod: sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs setstripe -c 10 teststripe sthiell@login-01:/scratch/m000001/sthiell/tmp$ fallocate -x -l 50T teststripe sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe File: teststripe Size: 54975581388800 Blocks: 107374201760 IO Block: 4194304 regular file Device: 2015dafeh/538303230d Inode: 360288091639906923 Links: 1 Access: (0644/-rw-r--r--) Uid: (1695273714/ sthiell) Gid: (851715518/marlowe-m000001) Access: 2025-06-06 12:11:59.000000000 -0700 Modify: 2025-06-06 12:11:53.000000000 -0700 Change: 2025-06-06 12:11:53.000000000 -0700 Birth: 2025-06-06 12:11:53.000000000 -0700 sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs quota -h -p 851715518 /scratch/ Disk quotas for prj 851715518 (pid 851715518): Filesystem used quota limit grace files quota limit grace /scratch/ 50.13T* 0k 15T - 641726 0 831488 - Then, a write to this file quickly generates EDQUOT, and the overall blocks allocated is somehow reduced: sthiell@login-01:/scratch/m000001/sthiell/tmp$ dd if=/dev/zero of=teststripe bs=16M count=1000 seek=1M dd: error writing 'teststripe': Disk quota exceeded 5+0 records in 4+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 0.0529419 s, 1.3 GB/s sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe File: teststripe Size: 17592253153280 Blocks: 34359826608 IO Block: 4194304 regular file Device: 2015dafeh/538303230d Inode: 360288091639907030 Links: 1 Access: (0644/-rw-r--r--) Uid: (1695273714/ sthiell) Gid: (851715518/marlowe-m000001) Access: 2025-06-06 13:23:23.000000000 -0700 Modify: 2025-06-06 13:23:23.000000000 -0700 Change: 2025-06-06 13:23:23.000000000 -0700 Birth: 2025-06-06 13:20:36.000000000 -0700 It's not great if a successful fallocated file cannot be written to afterwards. The point of using fallocate is to make sure the blocks are reserved. It would be better if the initial fallocate could fail with EDQUOT, I think. Thanks again.

Arshad Hussain added a comment - 26/May/25 6:14 PM

Hi Stephane,

My first reaction was this was quota. Which let me down different path of analysis. Quota internals is very less known to me and I was wrong. sanity-quota/1h test-case does do the checking of EDQUOT. However, it does that dividing into several parts . I could make your fallocate.c work by calling fallocate in steps. (Please see Sergey's answer/comments in this ticket). My max_dirty_mb is was set to 100M and I was testing with files less than 100M. I could
still see qsd_op_begin0() correctly hit EDQUOT.

fallocate_mod.c (gcc -o posix_fallocate fallocate_mod.c) is your fallocate.c which is slightly modified to accept start offset instead of 0. lfallocate.sh
is adapted from sanity-quota/1h to write call fallocate_mod.c(binary) in steps.

This is what I am getting:

Quota set to 14M

$ lfs quota -u quota_usr -h /mnt/lustre/dir/
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/dir/     4k      0k     14M       -       1       0       0       -

Write 15M.

$ ./lfallocate.sh /mnt/lustre/dir/f1 15728640
Successfully allocated 7864320 bytes for file /mnt/lustre/dir/f1
posix_fallocate: Disk quota exceeded
./lfallocate.sh: failed for /mnt/lustre/dir/f1

Write 12M

./lfallocate.sh /mnt/lustre/dir/f1 12582912
Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1
Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1

Verify

lfs quota -u quota_usr -h /mnt/lustre/dir/
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/dir/     12M      0k     14M       -       2       0       0       -

Arshad Hussain added a comment - 26/May/25 6:14 PM Hi Stephane, My first reaction was this was quota. Which let me down different path of analysis. Quota internals is very less known to me and I was wrong. sanity-quota/1h test-case does do the checking of EDQUOT. However, it does that dividing into several parts . I could make your fallocate.c work by calling fallocate in steps. (Please see Sergey's answer/comments in this ticket). My max_dirty_mb is was set to 100M and I was testing with files less than 100M. I could still see qsd_op_begin0() correctly hit EDQUOT. fallocate_mod.c (gcc -o posix_fallocate fallocate_mod.c) is your fallocate.c which is slightly modified to accept start offset instead of 0. lfallocate.sh is adapted from sanity-quota/1h to write call fallocate_mod.c(binary) in steps. This is what I am getting: Quota set to 14M $ lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 4k 0k 14M - 1 0 0 - Write 15M. $ ./lfallocate.sh /mnt/lustre/dir/f1 15728640 Successfully allocated 7864320 bytes for file /mnt/lustre/dir/f1 posix_fallocate: Disk quota exceeded ./lfallocate.sh: failed for /mnt/lustre/dir/f1 Write 12M ./lfallocate.sh /mnt/lustre/dir/f1 12582912 Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1 Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1 Verify lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 12M 0k 14M - 2 0 0 -

Arshad Hussain added a comment - 22/May/25 5:52 PM

Thanks scherementsev. for you inputs.

. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?"

This implies that, it appears to allocate first and then report the usage afterward? What I understand. fallocate declares quota usage through osd_declare_inode_qid() it may not query the QMT(Master) before proceeding?

I am tracing through osd_declare_inode_qid() and it did seem to call qsd_op_begin().
osd_declare_inode_qid
->qsd_op_begin()

Small snip of call graph (just tracing lquota.ko)

# tracer: function_graph                                                        
CPU  DURATION                  FUNCTION CALLS                                 
# |     |   |                     |   |   |   |                                 
 1)   0.235 us    |  qsd_op_begin [lquota]();                                   
 1)   0.114 us    |  qsd_op_begin [lquota]();                                   
 1)   0.124 us    |  qsd_op_begin [lquota]();                                   
 1)   0.111 us    |  qsd_op_begin [lquota]();                                   
 1)   0.146 us    |  qsd_op_end [lquota]();     
...
...

Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:

I can confirm I have hit this too. This means it is querying QMT and reporting error. What conditions are met here? Any hints.

Arshad Hussain added a comment - 22/May/25 5:52 PM Thanks scherementsev . for you inputs. . It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" This implies that, it appears to allocate first and then report the usage afterward? What I understand. fallocate declares quota usage through osd_declare_inode_qid() it may not query the QMT(Master) before proceeding? I am tracing through osd_declare_inode_qid() and it did seem to call qsd_op_begin(). osd_declare_inode_qid ->qsd_op_begin() Small snip of call graph (just tracing lquota.ko) # tracer: function_graph CPU DURATION FUNCTION CALLS # | | | | | | | 1) 0.235 us | qsd_op_begin [lquota](); 1) 0.114 us | qsd_op_begin [lquota](); 1) 0.124 us | qsd_op_begin [lquota](); 1) 0.111 us | qsd_op_begin [lquota](); 1) 0.146 us | qsd_op_end [lquota](); ... ... Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails: I can confirm I have hit this too. This means it is querying QMT and reporting error. What conditions are met here? Any hints.

Sergey Cheremencev added a comment - 22/May/25 3:48 PM

Hi arshad512 ,

Regarding your example with DD. I think it is a well known behaviour when the data is firstly stored in a cache on the client. If the "cache size" is big enough, it may cause to hit the limit. You could try the same but ater setting max_dirty_mb to 10M. Or just try to write the same amount of date divided to several parts: 40M + 20M + 10M - I expect you will get -EDQUOT trying to write 20M. I think this doesn't relate to fallocate problem from the description.

Im pretty busy right now and can't spend looking to the details. But at first look the problem with fallocate is that QSD doesn't acquire the quota during ofd_object_fallocate. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" before allocating. This behaviour might have some strong reasons in that I'm not aware of. But at first look it can be fixed if required.

Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:

[root@vm1 tests]# lfs quota -u quota_usr /mnt/lustre
Disk quotas for usr quota_usr (uid 1000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
    /mnt/lustre       0       0   20480       -       0       0       0       -
[root@vm1 tests]# ./runas -u quota_usr fallocate -l100M /mnt/lustre/ff
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l100M] [/mnt/lustre/ff]
[root@vm1 tests]# ./runas -u quota_usr fallocate -l110M /mnt/lustre/ff
running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
 [fallocate] [-l110M] [/mnt/lustre/ff]
fallocate: fallocate failed: Disk quota exceeded

Sergey Cheremencev added a comment - 22/May/25 3:48 PM Hi arshad512 , Regarding your example with DD. I think it is a well known behaviour when the data is firstly stored in a cache on the client. If the "cache size" is big enough, it may cause to hit the limit. You could try the same but ater setting max_dirty_mb to 10M. Or just try to write the same amount of date divided to several parts: 40M + 20M + 10M - I expect you will get -EDQUOT trying to write 20M. I think this doesn't relate to fallocate problem from the description. Im pretty busy right now and can't spend looking to the details. But at first look the problem with fallocate is that QSD doesn't acquire the quota during ofd_object_fallocate. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" before allocating. This behaviour might have some strong reasons in that I'm not aware of. But at first look it can be fixed if required. Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails: [root@vm1 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes quota limit grace files quota limit grace /mnt/lustre 0 0 20480 - 0 0 0 - [root@vm1 tests]# ./runas -u quota_usr fallocate -l100M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l100M] [/mnt/lustre/ff] [root@vm1 tests]# ./runas -u quota_usr fallocate -l110M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l110M] [/mnt/lustre/ff] fallocate: fallocate failed: Disk quota exceeded

Arshad Hussain added a comment - 22/May/25 7:36 AM - edited

Hi scherementsev, Something does not seem right. Maybe it my understanding. Could you please confirm if this is correct behavior for quota enforcement?
It seems that the enforcement is not getting enforced even for generic write (dd) too. I have to redo few log gathering. Will I will be uploading soon.

Here is the test-case. (This is with latest master)

mkdir /mnt/lustre/dir
chown quota_usr /mnt/lustre/dir
lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/dir
runuser -l quota_usr -c 'dd if=/dev/zero of=/mnt/lustre/dir/d1 bs=1M count=70'
lfs quota -u quota_usr -h /mnt/lustre/dir/
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/dir/    68M*     25M     25M       -      2*       0       0       -

... few seconds later...

lfs quota -u quota_usr -h /mnt/lustre/dir/
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       -

With the same above test-case switching dd to fallocate. We see the same output. As we stand. We think this is incorrect. Therefore
for above 'dd' run the behavior is also incorrect?

runuser -l quota_usr -c 'fallocate -l70M /mnt/lustre/dir/f1'
lfs quota -u quota_usr -h /mnt/lustre/dir/
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       -

Arshad Hussain added a comment - 22/May/25 7:36 AM - edited Hi scherementsev , Something does not seem right. Maybe it my understanding. Could you please confirm if this is correct behavior for quota enforcement? It seems that the enforcement is not getting enforced even for generic write (dd) too. I have to redo few log gathering. Will I will be uploading soon. Here is the test-case. (This is with latest master) mkdir /mnt/lustre/dir chown quota_usr /mnt/lustre/dir lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/dir runuser -l quota_usr -c 'dd if=/dev/zero of=/mnt/lustre/dir/d1 bs=1M count=70' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 68M* 25M 25M - 2* 0 0 - ... few seconds later... lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 70M* 25M 25M - 2* 0 0 - With the same above test-case switching dd to fallocate. We see the same output. As we stand. We think this is incorrect. Therefore for above 'dd' run the behavior is also incorrect? runuser -l quota_usr -c 'fallocate -l70M /mnt/lustre/dir/f1' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 70M* 25M 25M - 2* 0 0 -

Arshad Hussain added a comment - 16/May/25 10:51 AM

Under setattr/fallocate path osd_declare_inode_qid() is correctly set. On my test system it is failing under qmt_adjust_edquot(). lqe_revoke_time(not sure why this is not set) is always 0 which is leading to quota limit not getting honored.

May 16 03:27:41 rocky9a kernel: LustreError: 143883:0:(qmt_entry.c:552:qmt_adjust_edquot()) $$$ set revoke_time explicitly  qmt:lustre-QMT0000 pool:dt-0x0 id:1002 enforced:1 hard:25600 soft:25600 granted:34820 time:1747985261 qunit: 1024 edquot:0 may_rel:0 revoke:0 default:no

I will need time to debug this.

Recreation steps:

lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota
runas -u quota_usr -g quota_usr fallocate -l50M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota

Output:

Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       -

Expected Output:

Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       -  <<<<<<<<<< 50M should be less than 25M

Arshad Hussain added a comment - 16/May/25 10:51 AM Under setattr/fallocate path osd_declare_inode_qid() is correctly set. On my test system it is failing under qmt_adjust_edquot(). lqe_revoke_time(not sure why this is not set) is always 0 which is leading to quota limit not getting honored. May 16 03:27:41 rocky9a kernel: LustreError: 143883:0:(qmt_entry.c:552:qmt_adjust_edquot()) $$$ set revoke_time explicitly qmt:lustre-QMT0000 pool:dt-0x0 id:1002 enforced:1 hard:25600 soft:25600 granted:34820 time:1747985261 qunit: 1024 edquot:0 may_rel:0 revoke:0 default:no I will need time to debug this. Recreation steps: lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota runas -u quota_usr -g quota_usr fallocate -l50M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota Output: Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 50M* 25M 25M - 2* 0 0 - Expected Output: Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 50M* 25M 25M - 2* 0 0 - <<<<<<<<<< 50M should be less than 25M

Arshad Hussain added a comment - 16/May/25 4:55 AM - edited

Hi Andreas, Stephane

This is indeed a bug. I could recreate this with master + Rocky 9.3 with modified sanity-quota/78A.

unas -u quota_usr -g quota_usr fallocate -x -l30M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota
running as uid/gid/euid/egid 1002/1002/1002/1002, groups: 1002
 [fallocate] [-x] [-l30M] [/mnt/lustre/d78A.sanity-quota/f78A.sanity-quota]
Disk quotas for usr quota_usr (uid 1002):
     Filesystem  kbytes  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
    /mnt/lustre  30724*   25600   25600       -      2*       0       0       -  <<<<<<<<<<<<<<
xxxx
Disk quotas for usr quota_usr (uid 1002):
     Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
/mnt/lustre/d78A.sanity-quota    30M*     25M     25M       -      2*       0       0       - <<<<<<<<<<<<<<<

While we do have test-case for fallocate + quota the overflow was not getting checked. Sorry for that. I am looking into this now.

Arshad Hussain added a comment - 16/May/25 4:55 AM - edited Hi Andreas, Stephane This is indeed a bug. I could recreate this with master + Rocky 9.3 with modified sanity-quota/78A. unas -u quota_usr -g quota_usr fallocate -x -l30M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota running as uid/gid/euid/egid 1002/1002/1002/1002, groups: 1002 [fallocate] [-x] [-l30M] [/mnt/lustre/d78A.sanity-quota/f78A.sanity-quota] Disk quotas for usr quota_usr (uid 1002): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 30724* 25600 25600 - 2* 0 0 - <<<<<<<<<<<<<< xxxx Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 30M* 25M 25M - 2* 0 0 - <<<<<<<<<<<<<<< While we do have test-case for fallocate + quota the overflow was not getting checked. Sorry for that. I am looking into this now.

Andreas Dilger added a comment - 15/May/25 6:19 PM

arshad512 would you be able to take a look at this?

Andreas Dilger added a comment - 15/May/25 6:19 PM arshad512 would you be able to take a look at this?

People

Assignee:: Arshad Hussain

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/May/25 11:02 PM

Updated:: 5 days ago 5:08 PM