[LU-9341] PFL: append should not instantiate full layout - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: Lustre 2.10.0
Labels:
- DoM2
- pfl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started.

It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry.

Attachments

Issue Links

is duplicated by

LU-10665 DoM: append to file causes OST component initialization

Resolved

is related to

LU-9479 sanity test 184d 244: don't instantiate PFL component when taking group lock

Open

LU-10176 Data-on-MDT phase II

Open

LU-13420 append to PFL-file without 'eof' component fails

Open

LU-17694 sanity-compr test_184d: last component index number got assigned even if it was not used after layout swap

Open

LU-15727 lod_get_default_lov_striping() misinterprets composite striping for append

Resolved

LU-12738 PFL: append of PFL file should not instantiate full layout

Open

LU-17159 Mark file layouts using append striping

Open

is related to

LU-10782 Enable tiny write append for singly striped non-composite file

Open

LU-8998 Progressive File Layout (PFL)

Resolved

(3 is related to, 2 is related to )

Activity

[LU-9341] PFL: append should not instantiate full layout

Stephane Thiell added a comment - 25/Jul/19 9:12 PM

Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB. I redid a partial scan last night and this is the new results:

[root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p
# NumSamples = 226682; Min = 0.00; Max = 517320932.00
# Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000
# each ∎ represents a count of 1457
    0.0000 -   493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%)
  493.3562 -  1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%)
 1480.0685 -  3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%)
 3453.4931 -  7400.3424 [  2219]: ∎ (0.98%)
 7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%)
15294.0409 - 31081.4379 [  1009]:  (0.45%)
31081.4379 - 62656.2319 [   777]:  (0.34%)
62656.2319 - 125805.8200 [   614]:  (0.27%)
125805.8200 - 252104.9961 [   372]:  (0.16%)
252104.9961 - 504703.3483 [   266]:  (0.12%)
504703.3483 - 1009900.0527 [   307]:  (0.14%)
1009900.0527 - 2020293.4616 [   215]:  (0.09%)
2020293.4616 - 4041080.2794 [    44]:  (0.02%)
4041080.2794 - 8082653.9150 [    27]:  (0.01%)
8082653.9150 - 16165801.1862 [   186]:  (0.08%)
16165801.1862 - 32332095.7286 [     1]:  (0.00%)
32332095.7286 - 64664684.8134 [     0]:  (0.00%)
64664684.8134 - 129329862.9829 [     0]:  (0.00%)
129329862.9829 - 258660219.3219 [     0]:  (0.00%)
258660219.3219 - 517320932.0000 [     1]:  (0.00%)

It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway.

@dauchy,

If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare.

Then, I run something like this:

$ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \;

chkstripe.sh being:

#!/bin/bash

initcnt=$1
path=$2

c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init)

if [[ $c == $initcnt ]]; then
    sz=$(stat -c '%s' "$path")
    echo $sz $path
fi

Stephane Thiell added a comment - 25/Jul/19 9:12 PM Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB. I redid a partial scan last night and this is the new results: [root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p # NumSamples = 226682; Min = 0.00; Max = 517320932.00 # Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000 # each ∎ represents a count of 1457 0.0000 - 493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%) 493.3562 - 1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%) 1480.0685 - 3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%) 3453.4931 - 7400.3424 [ 2219]: ∎ (0.98%) 7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%) 15294.0409 - 31081.4379 [ 1009]: (0.45%) 31081.4379 - 62656.2319 [ 777]: (0.34%) 62656.2319 - 125805.8200 [ 614]: (0.27%) 125805.8200 - 252104.9961 [ 372]: (0.16%) 252104.9961 - 504703.3483 [ 266]: (0.12%) 504703.3483 - 1009900.0527 [ 307]: (0.14%) 1009900.0527 - 2020293.4616 [ 215]: (0.09%) 2020293.4616 - 4041080.2794 [ 44]: (0.02%) 4041080.2794 - 8082653.9150 [ 27]: (0.01%) 8082653.9150 - 16165801.1862 [ 186]: (0.08%) 16165801.1862 - 32332095.7286 [ 1]: (0.00%) 32332095.7286 - 64664684.8134 [ 0]: (0.00%) 64664684.8134 - 129329862.9829 [ 0]: (0.00%) 129329862.9829 - 258660219.3219 [ 0]: (0.00%) 258660219.3219 - 517320932.0000 [ 1]: (0.00%) It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway. @dauchy, If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare. Then, I run something like this: $ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \; chkstripe.sh being: #!/bin/bash initcnt=$1 path=$2 c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init) if [[ $c == $initcnt ]]; then sz=$(stat -c '%s' "$path") echo $sz $path fi

Gerrit Updater added a comment - 25/Jul/19 8:39 PM

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617
Subject: ~~LU-9341~~ lod: Add special O_APPEND striping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

Gerrit Updater added a comment - 25/Jul/19 8:39 PM Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617 Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

Gerrit Updater added a comment - 25/Jul/19 9:35 AM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611
Subject: ~~LU-9341~~ utils: fix lfs find for composite files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

Gerrit Updater added a comment - 25/Jul/19 9:35 AM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611 Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 11:42 PM

Andreas, no, I don't have a size distribution for O_APPEND files, sorry. The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions. And even that might not catch other files that happened to be created with O_APPEND.

If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data.

Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly. Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component. Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end. No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 11:42 PM Andreas, no, I don't have a size distribution for O_APPEND files, sorry. The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions. And even that might not catch other files that happened to be created with O_APPEND. If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data. Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly. Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component. Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end. No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

Andreas Dilger added a comment - 24/Jul/19 11:11 PM

Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness.

My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc.

I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 "LU-10782 llite: simple layout append tiny write") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this.

If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable=[FID] and LU-9982, but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 "LU-11234 lod: add pool selection layout_policy" to specify a different pool for specific file extensions like *.log *.out, etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND), nor are they necessarily simple enough to make a trivial backport to 2.10.

Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

Andreas Dilger added a comment - 24/Jul/19 11:11 PM Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness. My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc. I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 " LU-10782 llite: simple layout append tiny write ") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this. If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable= [FID] and LU-9982 , but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 " LU-11234 lod: add pool selection layout_policy " to specify a different pool for specific file extensions like *.log *.out , etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND ), nor are they necessarily simple enough to make a trivial backport to 2.10. Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:27 PM

dauchy,

Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation. Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components. The only ways we came up with of doing that which are sure to work are fairly heavy handed. (The details and things we tried are buried in the ~40 previous comments on this bug.)

So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:27 PM dauchy , Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation. Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components. The only ways we came up with of doing that which are sure to work are fairly heavy handed. (The details and things we tried are buried in the ~40 previous comments on this bug.) So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:13 PM

"as the lower write latency could improve application performance if they are blocked on the log writes"
I don't think in general that DOM has lower write latency? It's got massively better latency for the first write to a newly created file, but after the first write, don't we expect the latency to be the same? Even if the MDT is flash and the OST is not, I think we expect it to go in cache anyway, right?

But, then again, I suppose APPEND is being used because multiple threads might be writing to this file, and in that case, we're going to spam syncs from lock contention, so it would be very good to be on flash rather than spinning disk. Hmm.

It seems like a "nice to have", but it does seem useful.

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:13 PM "as the lower write latency could improve application performance if they are blocked on the log writes" I don't think in general that DOM has lower write latency? It's got massively better latency for the first write to a newly created file, but after the first write, don't we expect the latency to be the same? Even if the MDT is flash and the OST is not, I think we expect it to go in cache anyway, right? But, then again, I suppose APPEND is being used because multiple threads might be writing to this file, and in that case, we're going to spam syncs from lock contention, so it would be very good to be on flash rather than spinning disk. Hmm. It seems like a "nice to have", but it does seem useful.

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 10:09 PM - edited

Andreas, Mike,

Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness.

Update: Upon re-reading Andreas' comment, maybe the intent was to still use a PFL layout. That brings to mind a possibility of just defining a "create_append_extents=N" option which indicates how many of the default layout PFL extents to define. I think in our case we would set it to 2, to get first part on SSD and have a backstop of single larger stripe on HDD. The end of the last default extent (typically "-1") should be applied to the file, OR *_max_mb if that is defined but I think that would no longer be needed.

Update 2: Thanks Patrick for catching that I said "instantiate" when I should not have. My thought was to still create a special layout, with PFL, but base it on the default layout... truncated to N extents. Then the whole thing can be instantiated.

Thanks!

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 10:09 PM - edited Andreas, Mike, Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness. Update: Upon re-reading Andreas' comment, maybe the intent was to still use a PFL layout. That brings to mind a possibility of just defining a "create_append_extents=N" option which indicates how many of the default layout PFL extents to define. I think in our case we would set it to 2, to get first part on SSD and have a backstop of single larger stripe on HDD. The end of the last default extent (typically "-1") should be applied to the file, OR *_max_mb if that is defined but I think that would no longer be needed. Update 2: Thanks Patrick for catching that I said "instantiate" when I should not have. My thought was to still create a special layout, with PFL, but base it on the default layout... truncated to N extents. Then the whole thing can be instantiated. Thanks!

Andreas Dilger added a comment - 24/Jul/19 9:51 PM

Mike,
could you please take a crack at this. It may be fairly easy for you to implement the basic functionality.

The general consensus is that we should default to create a 1-stripe file when a file is opened with O_APPEND, because there is purely overhead when having multiple stripes in this case due to extra OST object creation, extra locking of all the objects, etc. and no benefits (multiple threads cannot write to the same file anyway). There should be a tunable parameter that disables this feature, something like lctl set_param mdt.*.append_onestripe_enable=0 or ...append_logfile_enable=0 or similar to disable it (enabled by default since this is the most common case).

There could be a mdt.*.append_*_max_mb=N tunable to set the maximum size of the log files. Probably the default for *_max_mb should be unlimited (it will run out of space when OST is full), and use a plain file layout, which provides maximum compatibility for clients and applications. If it is not unlimited, then it will need to create a single PFL component with a limited size, and all PFL-capable clients will return an error if they try to write beyond the end of the component. There definitely may be a benefit in many environments to limit the maximum size of a log file to protect from a runaway job spamming the filesystem (I've wanted this even for local filesystems on occasion).

I was thinking that having a DoM + 1-stripe OST PFL layout could be useful, but this doesn't help in the end because O_APPEND will always instantiate the OST object anyway, and the OST would need to be checked for data each time. That would just be overhead and doesn't make any sense to do in the end compared to a DoM-only file.

However, I think it would make sense to allow creating a DoM-only PFL file if only small log files are used (which seems to be typical), as the lower write latency could improve application performance if they are blocked on the log writes. It could also hurt performance if the MDS is overloaded, and it could consume too much MDT space if the logs got very large, so this behavior shouldn't be enabled by default. Maybe mdt.*.append_*_enable=mdt to enable creating the component on DoM, and change *_max_mb by default (if currently unlimited) for such files to 32MB?

If we wanted to get fancy at some point in the future (if this feature gains interest), we could allow ...append_enable=[FID] together with patch https://review.whamcloud.com/28972 "LU-9982 lustre: clients striping from mapped FID in nodemap" to specify a source FID for the layout template for all O_APPEND files, which allows maximum flexibility instead of trying to specify a variety of layouts via tunable parameters, but this would require that LU-9982 is completed first to allow specifying an arbitrary FID as a layout template instead of the parent directory, so it isn't a goal for the initial patch.

Andreas Dilger added a comment - 24/Jul/19 9:51 PM Mike, could you please take a crack at this. It may be fairly easy for you to implement the basic functionality. The general consensus is that we should default to create a 1-stripe file when a file is opened with O_APPEND , because there is purely overhead when having multiple stripes in this case due to extra OST object creation, extra locking of all the objects, etc. and no benefits (multiple threads cannot write to the same file anyway). There should be a tunable parameter that disables this feature, something like lctl set_param mdt.*.append_onestripe_enable=0 or ...append_logfile_enable=0 or similar to disable it (enabled by default since this is the most common case). There could be a mdt.*.append_*_max_mb=N tunable to set the maximum size of the log files. Probably the default for *_max_mb should be unlimited (it will run out of space when OST is full), and use a plain file layout, which provides maximum compatibility for clients and applications. If it is not unlimited, then it will need to create a single PFL component with a limited size, and all PFL-capable clients will return an error if they try to write beyond the end of the component. There definitely may be a benefit in many environments to limit the maximum size of a log file to protect from a runaway job spamming the filesystem (I've wanted this even for local filesystems on occasion). I was thinking that having a DoM + 1-stripe OST PFL layout could be useful, but this doesn't help in the end because O_APPEND will always instantiate the OST object anyway, and the OST would need to be checked for data each time. That would just be overhead and doesn't make any sense to do in the end compared to a DoM-only file. However, I think it would make sense to allow creating a DoM-only PFL file if only small log files are used (which seems to be typical), as the lower write latency could improve application performance if they are blocked on the log writes. It could also hurt performance if the MDS is overloaded, and it could consume too much MDT space if the logs got very large, so this behavior shouldn't be enabled by default. Maybe mdt.*.append_*_enable=mdt to enable creating the component on DoM, and change *_max_mb by default (if currently unlimited) for such files to 32MB? If we wanted to get fancy at some point in the future (if this feature gains interest), we could allow ...append_enable= [FID] together with patch https://review.whamcloud.com/28972 " LU-9982 lustre: clients striping from mapped FID in nodemap " to specify a source FID for the layout template for all O_APPEND files, which allows maximum flexibility instead of trying to specify a variety of layouts via tunable parameters, but this would require that LU-9982 is completed first to allow specifying an arbitrary FID as a layout template instead of the parent directory, so it isn't a goal for the initial patch.

Nathan Dauchy (Inactive) added a comment - 29/Jun/19 12:04 AM

This issue appears to be causing problems for us as well, primarily for Slurm stdout/stderr files (which, as Stephanie said, can be scattered all over various directories). Using 2.10.5+ servers and 2.10.7+ clients.

The other trigger that I don't see mentioned is probably less common... a seek to some point in the file without actually writing anything. It may not be directly related, but would be an opportunity for efficiency if any changes could also handle "dd if=/dev/zero of=sparse_file bs=1 count=0 seek=100G".

Nathan Dauchy (Inactive) added a comment - 29/Jun/19 12:04 AM This issue appears to be causing problems for us as well, primarily for Slurm stdout/stderr files (which, as Stephanie said, can be scattered all over various directories). Using 2.10.5+ servers and 2.10.7+ clients. The other trigger that I don't see mentioned is probably less common... a seek to some point in the file without actually writing anything. It may not be directly related, but would be an opportunity for efficiency if any changes could also handle "dd if=/dev/zero of=sparse_file bs=1 count=0 seek=100G".

Stephane Thiell added a comment - 08/Apr/19 5:51 PM

Sorry for the lack of context! Yes the above units are in bytes. I sometimes use histogram.py (from https://github.com/bitly/data_hacks ) to display quick histograms from data points, hence the weird ranges - they depend on the input data).

Anyway, yes, these files are all very small (< 128KB). Thanks for the different links, this is very interesting. To be honest, we would be happy already with a simple patch to try the one-stripe option, as long as there is a tunable to enable/disable on the flight. This would allow us to still use "large" PFL layout in general but not waste inodes on OSTs for O_APPEND files.

Stephane Thiell added a comment - 08/Apr/19 5:51 PM Sorry for the lack of context! Yes the above units are in bytes. I sometimes use histogram.py (from https://github.com/bitly/data_hacks ) to display quick histograms from data points, hence the weird ranges - they depend on the input data). Anyway, yes, these files are all very small (< 128KB). Thanks for the different links, this is very interesting. To be honest, we would be happy already with a simple patch to try the one-stripe option, as long as there is a tunable to enable/disable on the flight. This would allow us to still use "large" PFL layout in general but not waste inodes on OSTs for O_APPEND files.

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Apr/17 11:14 PM

Updated:: 29/Apr/25 11:36 PM

Resolved:: 20/Sep/19 2:32 PM