Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19018

fallocate by-passes quota limits initially, subsequent fallocate Fails correctly with EDQUOT

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.16.1
    • None
    • Lustre 2.15 / EL7 or Lustre 2.16 / EL9
    • 2
    • 9223372036854775807

    Description

      Hello! I just realized that posix_fallocate() allows users to bypass Lustre quota enforcement. Tested against Lustre 2.15.5 and 2.16.1.

      I'm attaching a simple fallocate helper fallocate.c that I used below.

      We assume quota enforcement is enabled here (osd-ldiskfs.lustre-OST*.quota_slave_dt.enabled=up):

      User quota enforcement:

      $ lfs setquota -u sthiell -B 1T /elm
      
      $ ./fallocate bigfalloc-user $(( 2 * 1024 * 1024 * 1024 * 1024 )) 
      
      $ stat bigfalloc-user
        File: bigfalloc-user
        Size: 2199023255552	Blocks: 4294968080 IO Block: 4194304 regular file
      Device: 899af214h/2308633108d	Inode: 144115441093058211  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (282232/ sthiell)   Gid: (  100/   users)
      Access: 2025-05-14 15:39:29.000000000 -0700
      Modify: 2025-05-14 15:39:29.000000000 -0700
      Change: 2025-05-14 15:39:29.000000000 -0700
       Birth: 2025-05-14 15:39:29.000000000 -0700
      
      $ lfs quota -u sthiell -h  /elm/
      Disk quotas for usr sthiell (uid 282232):
            Filesystem    used   quota   limit   grace   files   quota   limit   grace
                 /elm/      2T*     0k      1T       -       5       0       0       -
      

      On another system, I was able to bypass project quotas in the same way.

      $ ./fallocate /scratch/users/sthiell/bigfalloc $(( 10 * 1024 * 1024 * 1024 * 1024 ))
      Successfully allocated 10995116277760 bytes for file /scratch/users/sthiell/bigfalloc
      
      
      $ lfs quota -p 282232 -h /scratch/
      Disk quotas for prj 282232 (pid 282232):
           Filesystem    used   quota   limit   grace   files   quota   limit   grace
            /scratch/    112T*     0k    100T       -    4849       0 104857600       -
      pid 282232 is using default block quota setting
      

      Originally discovered with dtar which makes use of posix_fallocate() to preallocate the resulting tar file. It seems pretty important to fix this ASAP.

      Attachments

        1. dd.run.gz
          552 kB
        2. fallocate_mod.c
          1.0 kB
        3. fallocate.c
          0.9 kB
        4. fallocate.run.gz
          447 kB
        5. lfallocate.sh
          0.7 kB

        Activity

          [LU-19018] fallocate by-passes quota limits initially, subsequent fallocate Fails correctly with EDQUOT

          Thanks scherementsev.  for you inputs.

          . It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?"
          

          This implies that, it appears to allocate first and then report the usage afterward?  What I understand. fallocate declares quota usage through osd_declare_inode_qid() it may not query the QMT(Master) before proceeding?

          I am tracing through osd_declare_inode_qid() and it did seem to call qsd_op_begin().
          osd_declare_inode_qid
          ->qsd_op_begin()

          Small snip of call graph (just tracing lquota.ko)

          # tracer: function_graph                                                        
          CPU  DURATION                  FUNCTION CALLS                                 
          # |     |   |                     |   |   |   |                                 
           1)   0.235 us    |  qsd_op_begin [lquota]();                                   
           1)   0.114 us    |  qsd_op_begin [lquota]();                                   
           1)   0.124 us    |  qsd_op_begin [lquota]();                                   
           1)   0.111 us    |  qsd_op_begin [lquota]();                                   
           1)   0.146 us    |  qsd_op_end [lquota]();     
          ...
          ...   

           

          Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:

          I can confirm I have hit this too.  This means it is querying QMT and reporting error. What conditions are met here?  Any hints.

          arshad512 Arshad Hussain added a comment - Thanks scherementsev .  for you inputs. . It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" This implies that, it appears to allocate first and then report the usage afterward?  What I understand. fallocate declares quota usage through osd_declare_inode_qid() it may not query the QMT(Master) before proceeding? I am tracing through osd_declare_inode_qid() and it did seem to call qsd_op_begin(). osd_declare_inode_qid ->qsd_op_begin() Small snip of call graph (just tracing lquota.ko) # tracer: function_graph                                                         CPU  DURATION                  FUNCTION CALLS                                  # |     |   |                     |   |   |   |                                   1)   0.235 us    |  qsd_op_begin [lquota]();                                     1)   0.114 us    |  qsd_op_begin [lquota]();                                     1)   0.124 us    |  qsd_op_begin [lquota]();                                     1)   0.111 us    |  qsd_op_begin [lquota]();                                     1)   0.146 us    |  qsd_op_end [lquota]();     ... ...    Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails: I can confirm I have hit this too.  This means it is querying QMT and reporting error. What conditions are met here?  Any hints.

          Hi arshad512 ,

          Regarding your example with DD. I think it is a well known behaviour when the data is firstly stored in a cache on the client. If the "cache size" is big enough, it may cause to hit the limit. You could try the same but ater setting max_dirty_mb to 10M. Or just try to write the same amount of date divided to several parts: 40M + 20M + 10M - I expect you will get -EDQUOT trying to write 20M. I think this doesn't relate to fallocate problem from the description.

          Im pretty busy right now and can't spend looking to the details. But at first look the problem with fallocate is that QSD doesn't acquire the quota during ofd_object_fallocate. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" before allocating. This behaviour might have some strong reasons in that I'm not aware of. But at first look it can be fixed if required.

          Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails:

          [root@vm1 tests]# lfs quota -u quota_usr /mnt/lustre
          Disk quotas for usr quota_usr (uid 1000):
               Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
              /mnt/lustre       0       0   20480       -       0       0       0       -
          [root@vm1 tests]# ./runas -u quota_usr fallocate -l100M /mnt/lustre/ff
          running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
           [fallocate] [-l100M] [/mnt/lustre/ff]
          [root@vm1 tests]# ./runas -u quota_usr fallocate -l110M /mnt/lustre/ff
          running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000
           [fallocate] [-l110M] [/mnt/lustre/ff]
          fallocate: fallocate failed: Disk quota exceeded 
          scherementsev Sergey Cheremencev added a comment - Hi arshad512 , Regarding your example with DD. I think it is a well known behaviour when the data is firstly stored in a cache on the client. If the "cache size" is big enough, it may cause to hit the limit. You could try the same but ater setting max_dirty_mb to 10M. Or just try to write the same amount of date divided to several parts: 40M + 20M + 10M - I expect you will get -EDQUOT trying to write 20M. I think this doesn't relate to fallocate problem from the description. Im pretty busy right now and can't spend looking to the details. But at first look the problem with fallocate is that QSD doesn't acquire the quota during ofd_object_fallocate. It does it in manner of reporting, i.e. it says to QMT: the usage has been increased, but doesn't ask "can I increase the usage?" before allocating. This behaviour might have some strong reasons in that I'm not aware of. But at first look it can be fixed if required. Note, that it is possible to hit quota limits with fallocate only when allocate the file. At the 2nd time, if you try to fallocate larger size it fails: [root@vm1 tests]# lfs quota -u quota_usr /mnt/lustre Disk quotas for usr quota_usr (uid 1000):      Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace     /mnt/lustre       0       0   20480       -       0       0       0       - [root@vm1 tests]# ./runas -u quota_usr fallocate -l100M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l100M] [/mnt/lustre/ff] [root@vm1 tests]# ./runas -u quota_usr fallocate -l110M /mnt/lustre/ff running as uid/gid/euid/egid 1000/1000/1000/1000, groups: 1000 [fallocate] [-l110M] [/mnt/lustre/ff] fallocate: fallocate failed: Disk quota exceeded
          arshad512 Arshad Hussain added a comment - - edited

          Hi scherementsev, Something does not seem right. Maybe it my understanding. Could you please confirm if this is correct behavior for quota enforcement?
          It seems that the enforcement is not getting enforced even for generic write (dd) too. I have to redo few log gathering. Will I will be uploading soon.

          Here is the test-case. (This is with latest master)

          mkdir /mnt/lustre/dir
          chown quota_usr /mnt/lustre/dir
          lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/dir
          runuser -l quota_usr -c 'dd if=/dev/zero of=/mnt/lustre/dir/d1 bs=1M count=70'
          lfs quota -u quota_usr -h /mnt/lustre/dir/
          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/dir/    68M*     25M     25M       -      2*       0       0       - 

          ... few seconds later...

          lfs quota -u quota_usr -h /mnt/lustre/dir/
          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       - 

           

          With the same above test-case switching dd to fallocate. We see the same output. As we stand. We think this is incorrect. Therefore
          for above 'dd' run the behavior is also incorrect?

           

          runuser -l quota_usr -c 'fallocate -l70M /mnt/lustre/dir/f1'
          lfs quota -u quota_usr -h /mnt/lustre/dir/
          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       - 
          

           

           

          arshad512 Arshad Hussain added a comment - - edited Hi scherementsev , Something does not seem right. Maybe it my understanding. Could you please confirm if this is correct behavior for quota enforcement? It seems that the enforcement is not getting enforced even for generic write (dd) too. I have to redo few log gathering. Will I will be uploading soon. Here is the test-case. (This is with latest master) mkdir /mnt/lustre/dir chown quota_usr /mnt/lustre/dir lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/dir runuser -l quota_usr -c 'dd if=/dev/zero of=/mnt/lustre/dir/d1 bs=1M count=70' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/dir/    68M*     25M     25M       -      2*       0       0       -  ... few seconds later... lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       -    With the same above test-case switching dd to fallocate. We see the same output. As we stand. We think this is incorrect. Therefore for above 'dd' run the behavior is also incorrect?   runuser -l quota_usr -c 'fallocate -l70M /mnt/lustre/dir/f1' lfs quota -u quota_usr -h /mnt/lustre/dir/ Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/dir/    70M*     25M     25M       -      2*       0       0       -     

          Under setattr/fallocate path osd_declare_inode_qid() is correctly set. On my test system it is failing under qmt_adjust_edquot(). lqe_revoke_time(not sure why this is not set) is always 0 which is leading to quota limit not getting honored. 

          May 16 03:27:41 rocky9a kernel: LustreError: 143883:0:(qmt_entry.c:552:qmt_adjust_edquot()) $$$ set revoke_time explicitly  qmt:lustre-QMT0000 pool:dt-0x0 id:1002 enforced:1 hard:25600 soft:25600 granted:34820 time:1747985261 qunit: 1024 edquot:0 may_rel:0 revoke:0 default:no

          I will need time to debug this.

           

          Recreation steps:

          lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota
          runas -u quota_usr -g quota_usr fallocate -l50M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota

          Output:

          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       - 
           
          

          Expected Output:

          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       -  <<<<<<<<<< 50M should be less than 25M

           

          arshad512 Arshad Hussain added a comment - Under setattr/fallocate path osd_declare_inode_qid() is correctly set. On my test system it is failing under qmt_adjust_edquot(). lqe_revoke_time(not sure why this is not set) is always 0 which is leading to quota limit not getting honored.  May 16 03:27:41 rocky9a kernel: LustreError: 143883:0:(qmt_entry.c:552:qmt_adjust_edquot()) $$$ set revoke_time explicitly  qmt:lustre-QMT0000 pool:dt-0x0 id:1002 enforced:1 hard:25600 soft:25600 granted:34820 time:1747985261 qunit: 1024 edquot:0 may_rel:0 revoke:0 default:no I will need time to debug this.   Recreation steps: lfs setquota -u quota_usr -b25M -B25M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota runas -u quota_usr -g quota_usr fallocate -l50M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota Output: Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       -    Expected Output: Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/d78A.sanity-quota    50M*     25M     25M       -      2*       0       0       -  <<<<<<<<<< 50M should be less than 25M  
          arshad512 Arshad Hussain added a comment - - edited

          Hi Andreas, Stephane

          This is indeed a bug. I could recreate this with master + Rocky 9.3 with modified sanity-quota/78A.

           

          unas -u quota_usr -g quota_usr fallocate -x -l30M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota
          running as uid/gid/euid/egid 1002/1002/1002/1002, groups: 1002
           [fallocate] [-x] [-l30M] [/mnt/lustre/d78A.sanity-quota/f78A.sanity-quota]
          Disk quotas for usr quota_usr (uid 1002):
               Filesystem  kbytes  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
              /mnt/lustre  30724*   25600   25600       -      2*       0       0       -  <<<<<<<<<<<<<<
          xxxx
          Disk quotas for usr quota_usr (uid 1002):
               Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace 
          /mnt/lustre/d78A.sanity-quota    30M*     25M     25M       -      2*       0       0       - <<<<<<<<<<<<<<<

          While we do have test-case for fallocate + quota the overflow was not getting checked. Sorry for that. I am looking into this now.

           

           

          arshad512 Arshad Hussain added a comment - - edited Hi Andreas, Stephane This is indeed a bug. I could recreate this with master + Rocky 9.3 with modified sanity-quota/78A.   unas -u quota_usr -g quota_usr fallocate -x -l30M /mnt/lustre/d78A.sanity-quota/f78A.sanity-quota running as uid/gid/euid/egid 1002/1002/1002/1002, groups: 1002  [fallocate] [-x] [-l30M] [/mnt/lustre/d78A.sanity-quota/f78A.sanity-quota] Disk quotas for usr quota_usr (uid 1002):      Filesystem  kbytes  bquota  blimit  bgrace   files  iquota  ilimit  igrace    /mnt/lustre  30724*   25600   25600       -      2*       0       0       -  <<<<<<<<<<<<<< xxxx Disk quotas for usr quota_usr (uid 1002):      Filesystem    used  bquota  blimit  bgrace   files  iquota  ilimit  igrace  /mnt/lustre/d78A.sanity-quota    30M*     25M     25M       -      2*       0       0       - <<<<<<<<<<<<<<< While we do have test-case for fallocate + quota the overflow was not getting checked. Sorry for that. I am looking into this now.    

          arshad512 would you be able to take a look at this?

          adilger Andreas Dilger added a comment - arshad512 would you be able to take a look at this?

          People

            arshad512 Arshad Hussain
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: