Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.16.1
-
None
-
Lustre 2.15 / EL7 or Lustre 2.16 / EL9
-
2
-
9223372036854775807
Description
Hello! I just realized that posix_fallocate() allows users to bypass Lustre quota enforcement. Tested against Lustre 2.15.5 and 2.16.1.
I'm attaching a simple fallocate helper fallocate.c that I used below.
We assume quota enforcement is enabled here (osd-ldiskfs.lustre-OST*.quota_slave_dt.enabled=up):
User quota enforcement:
$ lfs setquota -u sthiell -B 1T /elm $ ./fallocate bigfalloc-user $(( 2 * 1024 * 1024 * 1024 * 1024 )) $ stat bigfalloc-user File: bigfalloc-user Size: 2199023255552 Blocks: 4294968080 IO Block: 4194304 regular file Device: 899af214h/2308633108d Inode: 144115441093058211 Links: 1 Access: (0644/-rw-r--r--) Uid: (282232/ sthiell) Gid: ( 100/ users) Access: 2025-05-14 15:39:29.000000000 -0700 Modify: 2025-05-14 15:39:29.000000000 -0700 Change: 2025-05-14 15:39:29.000000000 -0700 Birth: 2025-05-14 15:39:29.000000000 -0700 $ lfs quota -u sthiell -h /elm/ Disk quotas for usr sthiell (uid 282232): Filesystem used quota limit grace files quota limit grace /elm/ 2T* 0k 1T - 5 0 0 -
On another system, I was able to bypass project quotas in the same way.
$ ./fallocate /scratch/users/sthiell/bigfalloc $(( 10 * 1024 * 1024 * 1024 * 1024 )) Successfully allocated 10995116277760 bytes for file /scratch/users/sthiell/bigfalloc $ lfs quota -p 282232 -h /scratch/ Disk quotas for prj 282232 (pid 282232): Filesystem used quota limit grace files quota limit grace /scratch/ 112T* 0k 100T - 4849 0 104857600 - pid 282232 is using default block quota setting
Originally discovered with dtar which makes use of posix_fallocate() to preallocate the resulting tar file. It seems pretty important to fix this ASAP.
Attachments
- dd.run.gz
- 552 kB
- fallocate_mod.c
- 1.0 kB
- fallocate.c
- 0.9 kB
- fallocate.run.gz
- 447 kB
- lfallocate.sh
- 0.7 kB
Activity
If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically.
Thanks for your report. Yes, I also understand this is the current behaviour. I also think this a known limitation in Lustre (looking at how sanity-quota/1h test-case is done). Please note that if fallocate is called divided in smaller incremental several parts it honores the EDQUOT as demonestrated by lfallocate.sh (attached). I have also updated the title to reflect the bug clearly. Please feel free to correct the title.
Thanks Arshad and Sergey for looking into this! If I understand correctly, only the first fallocate beyond the quota limit will succeed, but further calls to fallocate will not. I also tried to write to the fallocated file afterwards and got EDQUOT. So it might not be as bad as I originally thought! Still the initial fallocate allows users to get past their quota limits, at least logically.
I wanted to note that this is also impacting the DDN ExaScaler stack, it is easily reproduced with the standard fallocate command (-x or not actually). This is on our NVIDIA SuperPod:
sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs setstripe -c 10 teststripe sthiell@login-01:/scratch/m000001/sthiell/tmp$ fallocate -x -l 50T teststripe sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe File: teststripe Size: 54975581388800 Blocks: 107374201760 IO Block: 4194304 regular file Device: 2015dafeh/538303230d Inode: 360288091639906923 Links: 1 Access: (0644/-rw-r--r--) Uid: (1695273714/ sthiell) Gid: (851715518/marlowe-m000001) Access: 2025-06-06 12:11:59.000000000 -0700 Modify: 2025-06-06 12:11:53.000000000 -0700 Change: 2025-06-06 12:11:53.000000000 -0700 Birth: 2025-06-06 12:11:53.000000000 -0700 sthiell@login-01:/scratch/m000001/sthiell/tmp$ lfs quota -h -p 851715518 /scratch/ Disk quotas for prj 851715518 (pid 851715518): Filesystem used quota limit grace files quota limit grace /scratch/ 50.13T* 0k 15T - 641726 0 831488 -
Then, a write to this file quickly generates EDQUOT, and the overall blocks allocated is somehow reduced:
sthiell@login-01:/scratch/m000001/sthiell/tmp$ dd if=/dev/zero of=teststripe bs=16M count=1000 seek=1M dd: error writing 'teststripe': Disk quota exceeded 5+0 records in 4+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 0.0529419 s, 1.3 GB/s sthiell@login-01:/scratch/m000001/sthiell/tmp$ stat teststripe File: teststripe Size: 17592253153280 Blocks: 34359826608 IO Block: 4194304 regular file Device: 2015dafeh/538303230d Inode: 360288091639907030 Links: 1 Access: (0644/-rw-r--r--) Uid: (1695273714/ sthiell) Gid: (851715518/marlowe-m000001) Access: 2025-06-06 13:23:23.000000000 -0700 Modify: 2025-06-06 13:23:23.000000000 -0700 Change: 2025-06-06 13:23:23.000000000 -0700 Birth: 2025-06-06 13:20:36.000000000 -0700
It's not great if a successful fallocated file cannot be written to afterwards. The point of using fallocate is to make sure the blocks are reserved. It would be better if the initial fallocate could fail with EDQUOT, I think. Thanks again.
Hi Stephane,
My first reaction was this was quota. Which let me down different path of analysis. Quota internals is very less known to me and I was wrong. sanity-quota/1h test-case does do the checking of EDQUOT. However, it does that dividing into several parts . I could make your fallocate.c work by calling fallocate in steps. (Please see Sergey's answer/comments in this ticket). My max_dirty_mb is was set to 100M and I was testing with files less than 100M. I could
still see qsd_op_begin0() correctly hit EDQUOT.
fallocate_mod.c (gcc -o posix_fallocate fallocate_mod.c) is your fallocate.c which is slightly modified to accept start offset instead of 0. lfallocate.sh
is adapted from sanity-quota/1h to write call fallocate_mod.c(binary) in steps.
This is what I am getting:
Quota set to 14M
$ lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 4k 0k 14M - 1 0 0 -
Write 15M.
$ ./lfallocate.sh /mnt/lustre/dir/f1 15728640 Successfully allocated 7864320 bytes for file /mnt/lustre/dir/f1 posix_fallocate: Disk quota exceeded ./lfallocate.sh: failed for /mnt/lustre/dir/f1
Write 12M
./lfallocate.sh /mnt/lustre/dir/f1 12582912 Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1 Successfully allocated 6291456 bytes for file /mnt/lustre/dir/f1
Verify
lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 12M 0k 14M - 2 0 0 -
Thanks scherementsev. for you inputs.
. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?"
This implies that, it appears to allocate first and then report the usage afterward? What I understand. fallocate declares quota usage through osd_declare_inode_qid() it may not query the QMT(Master) before proceeding?
I am tracing through osd_declare_inode_qid() and it did seem to call qsd_op_begin().
osd_declare_inode_qid
->qsd_op_begin()
Small snip of call graph (just tracing lquota.ko)
# tracer: function_graph CPU DURATION FUNCTION CALLS # | | | | | | | 1) 0.235 us | qsd_op_begin [lquota](); 1) 0.114 us | qsd_op_begin [lquota](); 1) 0.124 us | qsd_op_begin [lquota](); 1) 0.111 us | qsd_op_begin [lquota](); 1) 0.146 us | qsd_op_end [lquota](); ... ...
Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:
I can confirm I have hit this too. This means it is querying QMT and reporting error. What conditions are met here? Any hints.
Hi arshad512 ,
Regarding your example with DD. I think it is a well known behaviour when the data is firstly stored in a cache on the client. If the "cache size" is big enough, it may cause to hit the limit. You could try the same but ater setting max_dirty_mb to 10M. Or just try to write the same amount of date divided to several parts: 40M + 20M + 10M - I expect you will get -EDQUOT trying to write 20M. I think this doesn't relate to fallocate problem from the description.
Im pretty busy right now and can't spend looking to the details. But at first look the problem with fallocate is that QSD doesn't acquire the quota during ofd_object_fallocate. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" before allocating. This behaviour might have some strong reasons in that I'm not aware of. But at first look it can be fixed if required.
Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:
[root@vm1 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000): Filesystem kbytes quota limit grace files quota limit grace /mnt/lustre 0 0 20480 - 0 0 0 - [root@vm1 tests]# ./runas -u quota_usr fallocate -l100M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l100M] [/mnt/lustre/ff] [root@vm1 tests]# ./runas -u quota_usr fallocate -l110M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l110M] [/mnt/lustre/ff] fallocate: fallocate failed: Disk quota exceeded
Hi scherementsev, Something does not seem right. Maybe it my understanding. Could you please confirm if this is correct behavior for quota enforcement?
It seems that the enforcement is not getting enforced even for generic write (dd) too. I have to redo few log gathering. Will I will be uploading soon.
Here is the test-case. (This is with latest master)
mkdir /mnt/lustre/dir chown quota_usr /mnt/lustre/dir lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/dir runuser -l quota_usr -c 'dd if=/dev/zero of=/mnt/lustre/dir/d1 bs=1M count=70' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 68M* 25M 25M - 2* 0 0 -
... few seconds later...
lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 70M* 25M 25M - 2* 0 0 -
With the same above test-case switching dd to fallocate. We see the same output. As we stand. We think this is incorrect. Therefore
for above 'dd' run the behavior is also incorrect?
runuser -l quota_usr -c 'fallocate -l70M /mnt/lustre/dir/f1' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/dir/ 70M* 25M 25M - 2* 0 0 -
Under setattr/fallocate path osd_declare_inode_qid() is correctly set. On my test system it is failing under qmt_adjust_edquot(). lqe_revoke_time(not sure why this is not set) is always 0 which is leading to quota limit not getting honored.
May 16 03:27:41 rocky9a kernel: LustreError: 143883:0:(qmt_entry.c:552:qmt_adjust_edquot()) $$$ set revoke_time explicitly qmt:lustre-QMT0000 pool:dt-0x0 id:1002 enforced:1 hard:25600 soft:25600 granted:34820 time:1747985261 qunit: 1024 edquot:0 may_rel:0 revoke:0 default:no
I will need time to debug this.
Recreation steps:
lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota runas -u quota_usr -g quota_usr fallocate -l50M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota
Output:
Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 50M* 25M 25M - 2* 0 0 -
Expected Output:
Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 50M* 25M 25M - 2* 0 0 - <<<<<<<<<< 50M should be less than 25M
Hi Andreas, Stephane
This is indeed a bug. I could recreate this with master + Rocky 9.3 with modified sanity-quota/78A.
unas -u quota_usr -g quota_usr fallocate -x -l30M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota running as uid/gid/euid/egid 1002/1002/1002/1002, groups: 1002 [fallocate] [-x] [-l30M] [/mnt/lustre/d78A.sanity-quota/f78A.sanity-quota] Disk quotas for usr quota_usr (uid 1002): Filesystem kbytes bquota blimit bgrace files iquota ilimit igrace /mnt/lustre 30724* 25600 25600 - 2* 0 0 - <<<<<<<<<<<<<< xxxx Disk quotas for usr quota_usr (uid 1002): Filesystem used bquota blimit bgrace files iquota ilimit igrace /mnt/lustre/d78A.sanity-quota 30M* 25M 25M - 2* 0 0 - <<<<<<<<<<<<<<<
While we do have test-case for fallocate + quota the overflow was not getting checked. Sorry for that. I am looking into this now.
The problem here is that fallocate doesn't take into account quota overflow flags at the client side. It only fails if you allocate new parts of the file one by one: fallocate -f10M f1, fallocate -f10M -i10M f1, fallocate -f 10M -i 20M f1 ... Besides that, it is still possible to fallocate new file when the client is overquota:
That said, 2 things should fixed here:
Im not sure how this would be implemented. The 1st issue could be fixed pretty easy - something like osc_quota_chkdq should be called to check that we are not over quota yet. Another approach that can be probably used with the 1st one is to return -EDQUOT from the server side.
The 2nd problem looks a bit complicated to fix. Probably we should retrieve quotas(like we do in a regular lfs quota) in fallocate handler to estimate does the requested fallocate size fit the remaining quota. Or we can check that at the server side and return -EDQUOT if the OST cannot supply requested space size.
I haven't looked at the code, so there could be better solutions.