[LU-13856] ost00 100% full Created: 05/Aug/20  Updated: 15/Mar/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Ryan Seal Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None

Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

I am unable to write data to the lustre file system due to ost00 being 100% full. I am receiving the following error:

State of repository file us unknown due to error while truncating file: Error writting iobuffer for '<file>': No space left on device



 Comments   
Comment by Andreas Dilger [ 05/Aug/20 ]

Typically it is best to keep the filesystem below 90% full to avoid sudden large application IO causing the filesystem to run out of space, and to avoid performance loss as the slowest inner tracks of the disk are used and free space fragmentation results in poor allocations.

Several things to do in this case:

  • run "lfs df" and "lfs df -i" to see if it is only OST0000 that is full, or if all OSTs are full. If it is only OST0000 that is full, this is likely due to a misconfiguration of the default file layout that is forcing all files to be allocated on that OST. Otherwise, the MDS would normally avoid OSTs that are becoming more full than others.
  • run "lfs getstripe <lustre_mountpoint>" to see what the filesystem-wide default layout is (lmm_stripe_offset and lmm_stripe_count are particularly important here}}, and possibly the same on some subdirectories and files to see if they are requiring files use OST index 0 instead of "-1" which means "use any OST"
  • run "lfs find <lustre_dir> [...] --ost 0" to find files on this OST index, and see if they are disproportionate numbers of files in a specific directory using OST index 0 vs. a normal round-robin distribution that would be expected (ie. only 1/num_osts files should be allocated on OST0000 by default). You could use this list also to delete files that are no longer needed, if any.
  • run something like "lfs find <lustre_subdir> -ost 0 -mtime +10 -size +128M -print0 | lfs_migrate -y -0 -i -1" to find larger/older files in <lustre_subdir> and migrate them to a different OST (assuming there is space on the other OSTs)
Comment by Ryan Seal [ 05/Aug/20 ]

ost00 is the only one that is full. The other are around 70% or less. I noticed that #pcs cluster status shows pcsd as offline for the 4 oss servers I have but crm_mon.

When I run lfs df -i or lfs df -h I get no output.

For lfs getstripe, where can i get the lustre_mountpoint?

Comment by Andreas Dilger [ 05/Aug/20 ]

The "<lustre_mountpoint>" is the directory where Lustre is mounted on the client. I don't know what that is, since there is exceedingly little in this ticket for me to work with. The "lfs df" and "lfs df -i" and "lfs getstripe" commands need to be run on a client node.

When you write that these commands "get no output", does that mean "they return immediately without printing anything"? That probably means that they are being run on the server instead of a client. Or do you mean "they hang forever and do not print anything", which probably means that the servers are not working properly, which would match your comments that report there are OSS nodes offline.

That said, if there are OSTs which are not working properly, that is useful information to have. Are there errors reported on the console of the client or OSS nodes, beyond the "no space left on device" error?

Comment by Ryan Seal [ 06/Aug/20 ]

I was able to get the pcsd back online this morning. Currently I have set ost0000 to not active and inop on the primary mds by doing the following:

lctl --device 8 deactivate

lctl set_param .osp.data-OST0000*.active=0

Despite this being set it is still trying to write to ost00.

 

I was running lfs on the servers which explains why the command was not working. 

"lfs df" reports that ost0000 is mounted on "/data[OST:0]. We have many client. Does it matter what client the migrate is ran on?

I have 4 OSS's with 3 ost's mounted on each. Could you help with how to migrate data off of ost00 to the others?  Also will the migration be done on the servers or clients?

 

Comment by Andreas Dilger [ 06/Aug/20 ]

If you are using Lustre 2.10.7 or later on the servers then the preferred mechanism to stop file creation on the OST is "lctl set_param osp.data-OST0000*.max_create_count=0", which will stop file creation but not deactivate the OST completely. You can set "lctl set_param osp.data-OST0000*.active=1" to reactivate that OST.

Could you help with how to migrate data off of ost00 to the others?

The mechanism to migrate files off OST0000 was already listed in my previous comment:

  • run something like "lfs find /data/<somedir> -ost 0 -mtime +10 -size +128M -print0 | lfs_migrate -y -0 -i -1" to find larger/older files in <somedir> and migrate them to a different OST (assuming there is space on the other OSTs)

Also will the migration be done on the servers or clients?

On the clients. If the "lfs find" is run on different subdirectories then you could run a few of them in parallel on different clients.

Does it matter what client the migrate is ran on?

Not really.

Comment by Peter Jones [ 08/Aug/20 ]

Hi Ryan

I'm just checking in to see how things are progressing

Peter

Comment by Ryan Seal [ 13/Aug/21 ]

Is there a way to use lctl get_param to show the max_create_count?

Comment by Andreas Dilger [ 13/Aug/21 ]

Yes, "lctl get_param osp.*.max_create_count" on the MDS. This will normally default to 20000.

Comment by Ryan Seal [ 14/Sep/21 ]

It looks like the root of this issue is with the striping of the data across all the ost's. How do i configure lustre to stripe data evenly across all the ost's. Is this done on the primary mds? How do I see the current striping configuration.

Comment by Andreas Dilger [ 14/Sep/21 ]

You can get the filesystem-wide default striping by running "lfs getstripe -d <root directory>" on a client. The default is:

stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1

The MDS will normally balance new files across all OSTs pretty evenly, unless told otherwise. It can happen that OSTs become imbalanced if there are very large 1-stripe files created on one OST, or if the default striping for a directory incorrectly uses "--stripe-index=0" to force creation on OST0000 instead of "--stripe-index=-1" that allows the MDS to select any OST.

Note that it is also possible to set a different default file layout on any subdirectory, or on a per-file basis, so the cause of this imbalance may be in a specific subdirectory.

Comment by Ryan Seal [ 29/Sep/21 ]

How would I set the striping to the default setting? The output from the getstripe returned :

stripe_count: -1 stripe_size: 4194304 pattern: raid0 stripe_offset: 0

Could this be the reason ost00 is filling up? Would I set the stripe on a client?

Comment by Andreas Dilger [ 30/Sep/21 ]

Yes, setting "stripe_offset: 0" means "put all files onto OST0000". Also, the "stripe_count: -1" means "stripe across all OSTs", which is probably also not what you want, since this adds significant overhead for small files, and consumes a lot of objects needlessly on every OST.

You can fix both of these issues by running the following command as the root user (or via sudo) on any client:

# lfs setstripe --stripe-size=4M --stripe-index=-1 --stripe-count=1 <root_directory>

If you have a significant number of large files, it would be much better to set a default layout that is using the PFL feature:

# lfs setstripe -E 256M -c 1 -E 16G -c 4 -E eof -S 4M -c 40 <root_directory>

In this example, "large file" means files up to 256MB will use a single OST, then up to 16GB will stripe across 4 OSTs, and anything larger than 16GB in size will be striped across (up to) 40 OSTs (if your filesystem has fewer than 40 OSTs, it will use the number available). See https://wiki.lustre.org/Configuring_Lustre_File_Striping for details.

Comment by Ryan Seal [ 17/Nov/21 ]

I set the striping to the recommended setting above. 

#lfs setstripe -E 256M -c 1 -E 16G -c 4 -E eof -S 4M -c 40 <root_directory>

After doing this when trying to migrate data from one ost to another by:

#lfs find /data/example/ -ost 13 -mtime +10 -size +128m -print0 | lfs_migrate -y -0 -i -1

I get the following error:

lfs migrate migrate: unrecognized option '-1'

Is this due to the new striping configuration? If so, how do I manually migrate data in the event 1 ost becomes almost full?

 

Comment by Ryan Seal [ 07/Dec/21 ]

Any updates?

Comment by Andreas Dilger [ 09/Dec/21 ]

It might be that you need to specify "{{-i-1}" (no space) to the lfs_migrate script, but in any case this should not be needed. However, if the files are large and striped across all OSTs, then migrating them will not actually reduce space usage, since the same smount of data will be on OST0000 after the migration.

My recommendation would be migrate the largest files from the OST0000, but reduce the stripe count slightly below the actual number of OSTs, so that the full OST can be skipped, like:

client# lfs find -ost 13 -size +16G -mtime +10 -print0 |
        xargs -0 lfs migrate -c<ost_count - 1>

The other option would be to migrate a lot of small files that only have data on OST0000, since that avoids moving a lot of extra data , like:


client# lfs find -ost 13 -size -256M -mtime +10 -print0 |
xargs -0 lfs migrate -c1
{noformat

Comment by Ryan Seal [ 15/Mar/22 ]

I ran #lfs find -ost 13 -size -256M -mtime +10 -print0 |xargs -0 lfs migrate -c1 on one of the clients and I am getting the following error:

 

lfs_migrate is currently NOT SAFE for moving in-use files.

Use it only when  you are sure migrated files are not in use.

If emptying an OST that is active on the MDS, new files may

use it. To stop allocating any new objects on OSTNNNN run:

   lct set_param osp.<fname>-OSTNNNN*.max_create_count=0

on each MDS using the OST(s) being emptied

Continue? (y/n?

 

I only have 1 active MDS with 1 MDT mounted and am only trying to migrate objects off ost00 across the other 10 ost's. Before running the migrate I set max_create_count=0 for ost00 and verified the change. After receiving this error, I also set active=0 with the same result. I have confirmed that no new objects have been create since setting max_create_count=0 by checking the in use size of ost00 over the last 24 hours. Is it safe to continue or does the syntax of the lfs_migrate need to be adjusted? If so could you provide that as well?

Comment by Andreas Dilger [ 15/Mar/22 ]

You should not set active=0 on the MDS, since that will prevent it from destroying objects on OST0000. That was the old mechanism (before Lustre 2.4) and is no longer needed with max_create_count=0.

The "NOT SAFE" warning is a bit old, but has not been removed yet for a couple of reasons (LU-13475), but migrating files found with "-mtime +10" should be fine. You can use "lfs_migrate -y ..." to quiet this warning.

Generated at Sat Feb 10 03:04:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.