[LU-16771] statfs_max_age not used with statfs() project quotas? Created: 25/Apr/23  Updated: 08/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Stephane Thiell
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.12 servers with 2.15 clients


Issue Links:
Related
is related to LU-17395 df -h is limited by project quota eve... Open
is related to LU-15721 projid quota limit statfs() on direct... Resolved
is related to LU-17191 sanity-quota test_1b, 1d, 1f, 1i: FAI... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have noticed that smbd can do lots of statfs() calls, which on our lustre filesystem with project quota enabled will gather project quota info and takes time.
I tried to play with lustre's statfs caching but it doesn't seem to help. Is that not used when project quotas are set?

For example, on Oak we have 464 HDD-based OSTs and when they are loaded, gathering quota info might take a bit of time but statfs caching doesn't seem to help:

[root@oak-h04v20 pmischel]# lctl set_param llite.oak-ffff999f20cf9800.statfs_max_age=600
llite.oak-ffff999f20cf9800.statfs_max_age=600
[root@oak-h04v20 pmischel]# time df .; sleep 1; time df .; sleep 1; time df .
Filesystem                              1K-blocks      Used    Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 250000000000 349118136 249650881864   1% /oak

real	0m1.786s
user	0m0.002s
sys	0m0.006s
Filesystem                              1K-blocks      Used    Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 250000000000 349118136 249650881864   1% /oak

real	0m1.253s
user	0m0.000s
sys	0m0.008s
Filesystem                              1K-blocks      Used    Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 250000000000 349118136 249650881864   1% /oak

real	0m0.272s
user	0m0.002s
sys	0m0.006s

We use the "get quota command" defined in smbd.conf to return filesystem quotas to SMB clients, which is a user-space program for us, that does some caching by itself, so we don't need quota info from statfs() at all. Is there a way to either disable the project quota behavior with statfs() or make statfs() caching work with 2.15 clients and project quotas?



 Comments   
Comment by Stephane Thiell [ 25/Apr/23 ]

As performance comparison, this is what statfs() gives on filesystem's root directory without project quotas (0):

[root@oak-h04v20 pmischel]# cd /oak
[root@oak-h04v20 oak]#  time df .; sleep 1; time df .; sleep 1; time df .
Filesystem                                1K-blocks           Used      Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 50078673614848 32180974619712 17392664012044  65% /oak

real	0m0.001s
user	0m0.000s
sys	0m0.001s
Filesystem                                1K-blocks           Used      Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 50078673614848 32180974619712 17392664012044  65% /oak

real	0m0.001s
user	0m0.000s
sys	0m0.001s
Filesystem                                1K-blocks           Used      Available Use% Mounted on
10.0.2.51@o2ib5:10.0.2.52@o2ib5:/oak 50078673614848 32180974619712 17392664012044  65% /oak

real	0m0.001s
user	0m0.001s
sys	0m0.000s
Comment by Andreas Dilger [ 26/Apr/23 ]

Stephane, I had a brief look at the code and thought this might be easily fixed in ll_statfs() just by adding a check for max_age. However, I realized a complication is that the project quota is (as expected) project/directory-specific, so it isn't safe to cache the statfs data between calls for the whole filesystem. The project quota usage would need to be cached in the ll_inode_info directory union (which has unused space vs. the file union) or per-project basis (e.g. in a hash table).

Comment by Stephane Thiell [ 26/Apr/23 ]

It makes sense, thanks Andreas. Do you think it would make sense to add a client tunable to disable project-enabled statfs() for such use cases (like our SMB gateways)? Maybe that would be the easiest thing to do and I could try to have a look. Don't get me wrong, we like the statfs() feature in 2.15 on clients with actual users – we don't want to disable it everywhere.

Comment by Andreas Dilger [ 27/Apr/23 ]

Yes, that would be reasonably straight forward to do, maybe "llite.*.statfs_project"?

Something like the following (totally untested) would enable the statfs_project feature by default, but allow it to be turned off if needed:

        set_bit(LL_SBI_TINY_WRITE, sbi->ll_flags);
        set_bit(LL_SBI_PARALLEL_DIO, sbi->ll_flags);
+       set_bit(LL_SBI_STATFS_PROJECT, sbi->ll_flags);

int ll_statfs(struct dentry *de, struct kstatfs *sfs)
{
        struct ll_sb_info *sbi = ll_s2sbi(sb);
        :
-       rc = ll_statfs_internal(ll_s2sbi(sb), &osfs, OBD_STATFS_SUM);
+       rc = ll_statfs_internal(sbi, &osfs, OBD_STATFS_SUM);
        :

        if (ll_i2info(de->d_inode)->lli_projid &&
+           test_bit(LL_SBI_STATFS_PROJECT, sbi->ll_flags) &&
            test_bit(LLIF_PROJECT_INHERIT, &ll_i2info(de->d_inode)->lli_flags))
                return ll_statfs_project(de->d_inode, sfs);
}

                case LL_SBI_LAZYSTATFS:
                case LL_SBI_VERBOSE:
+               case LL_SBI_STATFS_PROJECT:
                        if (turn_off)
                                clear_bit(token, sbi->ll_flags);
                        else
                                set_bit(token, sbi->ll_flags);
                        break;


        {LL_SBI_ENCRYPT,                "noencrypt"},
        {LL_SBI_FOREIGN_SYMLINK,        "foreign_symlink=%s"},
+       {LL_SBI_STATFS_PROJECT,         "statfs_project"},
+       {LL_SBI_STATFS_PROJECT,         "nostatfs_project"},
        {LL_SBI_NUM_MOUNT_OPT,          NULL},


        LL_SBI_FOREIGN_SYMLINK,         /* foreign fake-symlink support */
        LL_SBI_FOREIGN_SYMLINK_UPCALL,  /* foreign fake-symlink upcall set */
+       LL_SBI_STATFS_PROJECT,          /* statfs returns project quota */
        LL_SBI_NUM_MOUNT_OPT,

along with suitable code copied in lustre/llite/lprocfs_llite.c to add the statfs_project parameter (e.g. from statahead_agl).

This should allow using "mount -t lustre -o nostatfs_project" or "lctl set_param llite.FSNAME-*.statfs_project=0" to disable it.

Comment by Gerrit Updater [ 28/Oct/23 ]

"Stephane Thiell <sthiell@stanford.edu>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52872
Subject: LU-16771 llite: add statfs_project tunable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 784280bc2682647cebd44e0c97fec65eb15a57d9

Comment by Stephane Thiell [ 28/Oct/23 ]

Thanks Andreas for the guidance! Sorry that I lost track of this but the problem resurfaced recently. In this case, we noticed a large amount of mds_quotactl Lustre RPCs coming from some MPI jobs, and this apparently was due to statfs() being called by PMPI_File_delete():

Thread 1 (Thread 0x7f9f899b3740 (LWP 24341)):
#0  0x00007f9f88046657 in statfs64 () from /lib64/libc.so.6
#1  0x00007f9f880466cb in statvfs64 () from /lib64/libc.so.6
#2  0x00007f9f87a15927 in opal_path_nfs () from /share/software/user/open/openmpi/4.1.2/lib/libopen-pal.so.40
#3  0x00007f9f88eed83d in mca_fs_base_get_fstype () from /share/software/user/open/openmpi/4.1.2/lib/libmpi.so.40
#4  0x00007f9f701ca3ef in mca_fs_lustre_component_file_query () from /share/software/user/open/openmpi/4.1.2/lib/openmpi/mca_fs_lustre.so
#5  0x00007f9f88eed1c6 in mca_fs_base_file_select () from /share/software/user/open/openmpi/4.1.2/lib/libmpi.so.40
#6  0x00007f9f7a46d24c in mca_common_ompio_file_delete () from /share/software/user/open/openmpi/4.1.2/lib/libmca_common_ompio.so.41
#7  0x00007f9f7b94d5de in delete_select () from /share/software/user/open/openmpi/4.1.2/lib/openmpi/mca_io_ompio.so
#8  0x00007f9f88eef1c5 in mca_io_base_delete () from /share/software/user/open/openmpi/4.1.2/lib/libmpi.so.40
#9  0x00007f9f88e99fd3 in PMPI_File_delete () from /share/software/user/open/openmpi/4.1.2/lib/libmpi.so.40
<private code>

The jobs in question were looping into thousands of files (of course) calling PMPI_File_delete(). statfs() is called by opal_path_nfs() to check "whether fname is on network file system"... Having a tunable to turn statfs_project off should help until some kind of caching is implemented.

Comment by Andreas Dilger [ 28/Oct/23 ]

Honestly, I would also file a bug on whomever implemented that for PMPI_File_delete() to see if it can be changed/removed? Why do they care if the file is on a network filesystem, and is sending thousands of extra RPCs on a network filesystem a good way to improve performance? At a minimum it would make sense to do statfs on the parent directories and not the files, and cache this across calls if the files are being deleted in the same directory (or have the same parent, and statfs that once). That will save everyone's time...

Failing that, you could add a per-projid statfs cache that is subject to the same 1 second expiry as with regular statfs? That avoids the need to disable this feature just because of a misfeature in a userspace tool, and would solve the problem for everyone (who will likely not know about this tunable, or will not be able to turn it off themselves).

Comment by Stephane Thiell [ 28/Oct/23 ]

I will investigate where would be best to report this, you're totally right. However, I also fear it is not an isolated occurrence with PMPI_File_delete() as from what I gather, MPI developers seem to really like statfs()!! See LU-15788 for another example of statfs() being used in some MPI-IO implementation when opening a file... I am actually a bit surprised this problem is not reported more often, but if you don't use project quotas, Lustre's statfs() with the caching seems VERY fast. So OK, I will also start to look how to add a per-projid statfs cache. It would be ideal to have that available on compute nodes, to keep the nice per-project statfs feature while reducing overall RPCs.
For my original problem reported in this ticket, which was related to the use of Samba re-exporting Lustre on dedicated servers, statfs_project=0 will be sufficient.

Comment by Gerrit Updater [ 03/Jan/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52872/
Subject: LU-16771 llite: add statfs_project tunable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d0a968312aab883e4a0b3f40816e881b93b9dd2c

Comment by Peter Jones [ 03/Jan/24 ]

Landed for 2.16

Comment by Andreas Dilger [ 03/Jan/24 ]

Peter, I don't think that the landed patch really fixes the issue properly. It adds a knob to disable the project quota specific statfs, but that only helps if the admin changes the parameter. It would be better to have a cache on the client for this info so that it "just works".

Generated at Sat Feb 10 03:29:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.