Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9341

PFL: append should not instantiate full layout

Details

    • 3
    • 9223372036854775807

    Description

      Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started.

      It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry.

      Attachments

        Issue Links

          Activity

            [LU-9341] PFL: append should not instantiate full layout

            Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB.  I redid a partial scan last night and this is the new results:

            [root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p
            # NumSamples = 226682; Min = 0.00; Max = 517320932.00
            # Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000
            # each ∎ represents a count of 1457
                0.0000 -   493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%)
              493.3562 -  1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%)
             1480.0685 -  3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%)
             3453.4931 -  7400.3424 [  2219]: ∎ (0.98%)
             7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%)
            15294.0409 - 31081.4379 [  1009]:  (0.45%)
            31081.4379 - 62656.2319 [   777]:  (0.34%)
            62656.2319 - 125805.8200 [   614]:  (0.27%)
            125805.8200 - 252104.9961 [   372]:  (0.16%)
            252104.9961 - 504703.3483 [   266]:  (0.12%)
            504703.3483 - 1009900.0527 [   307]:  (0.14%)
            1009900.0527 - 2020293.4616 [   215]:  (0.09%)
            2020293.4616 - 4041080.2794 [    44]:  (0.02%)
            4041080.2794 - 8082653.9150 [    27]:  (0.01%)
            8082653.9150 - 16165801.1862 [   186]:  (0.08%)
            16165801.1862 - 32332095.7286 [     1]:  (0.00%)
            32332095.7286 - 64664684.8134 [     0]:  (0.00%)
            64664684.8134 - 129329862.9829 [     0]:  (0.00%)
            129329862.9829 - 258660219.3219 [     0]:  (0.00%)
            258660219.3219 - 517320932.0000 [     1]:  (0.00%)
            

            It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway.

             

            @dauchy,

            If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting  where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare.

            Then, I run something like this:

            $ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \;

            chkstripe.sh being:

            #!/bin/bash
            
            initcnt=$1
            path=$2
            
            c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init)
            
            if [[ $c == $initcnt ]]; then
                sz=$(stat -c '%s' "$path")
                echo $sz $path
            fi
            

             

            sthiell Stephane Thiell added a comment - Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB.  I redid a partial scan last night and this is the new results: [root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p # NumSamples = 226682; Min = 0.00; Max = 517320932.00 # Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000 # each ∎ represents a count of 1457 0.0000 - 493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%) 493.3562 - 1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%) 1480.0685 - 3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%) 3453.4931 - 7400.3424 [ 2219]: ∎ (0.98%) 7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%) 15294.0409 - 31081.4379 [ 1009]: (0.45%) 31081.4379 - 62656.2319 [ 777]: (0.34%) 62656.2319 - 125805.8200 [ 614]: (0.27%) 125805.8200 - 252104.9961 [ 372]: (0.16%) 252104.9961 - 504703.3483 [ 266]: (0.12%) 504703.3483 - 1009900.0527 [ 307]: (0.14%) 1009900.0527 - 2020293.4616 [ 215]: (0.09%) 2020293.4616 - 4041080.2794 [ 44]: (0.02%) 4041080.2794 - 8082653.9150 [ 27]: (0.01%) 8082653.9150 - 16165801.1862 [ 186]: (0.08%) 16165801.1862 - 32332095.7286 [ 1]: (0.00%) 32332095.7286 - 64664684.8134 [ 0]: (0.00%) 64664684.8134 - 129329862.9829 [ 0]: (0.00%) 129329862.9829 - 258660219.3219 [ 0]: (0.00%) 258660219.3219 - 517320932.0000 [ 1]: (0.00%) It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway.   @dauchy, If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting  where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare. Then, I run something like this: $ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \; chkstripe.sh being: #!/bin/bash initcnt=$1 path=$2 c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init) if [[ $c == $initcnt ]]; then sz=$(stat -c '%s' "$path") echo $sz $path fi  

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617
            Subject: LU-9341 lod: Add special O_APPEND striping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617 Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611
            Subject: LU-9341 utils: fix lfs find for composite files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611 Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

            Andreas, no, I don't have a size distribution for O_APPEND files, sorry.  The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions.  And even that might not catch other files that happened to be created with O_APPEND.

            If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data.

            Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly.  Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component.  Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end.  No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

            dauchy Nathan Dauchy (Inactive) added a comment - Andreas, no, I don't have a size distribution for O_APPEND files, sorry.  The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions.  And even that might not catch other files that happened to be created with O_APPEND. If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data. Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly.  Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component.  Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end.  No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

            Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness.

            My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc.

            I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 "LU-10782 llite: simple layout append tiny write") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this.

            If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable=[FID] and LU-9982, but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 "LU-11234 lod: add pool selection layout_policy" to specify a different pool for specific file extensions like *.log *.out, etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND), nor are they necessarily simple enough to make a trivial backport to 2.10.

            Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

            adilger Andreas Dilger added a comment - Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness. My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc. I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 " LU-10782 llite: simple layout append tiny write ") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this. If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable= [FID] and LU-9982 , but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 " LU-11234 lod: add pool selection layout_policy " to specify a different pool for specific file extensions like *.log *.out , etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND ), nor are they necessarily simple enough to make a trivial backport to 2.10. Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

            dauchy,

            Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation.  Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components.  The only ways we came up with of doing that which are sure to work are fairly heavy handed.  (The details and things we tried are buried in the ~40 previous comments on this bug.)

            So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

            pfarrell Patrick Farrell (Inactive) added a comment - dauchy , Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation.  Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components.  The only ways we came up with of doing that which are sure to work are fairly heavy handed.  (The details and things we tried are buried in the ~40 previous comments on this bug.) So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

            "as the lower write latency could improve application performance if they are blocked on the log writes"
            I don't think in general that DOM has lower write latency?  It's got massively better latency for the first write to a newly created file, but after the first write, don't we expect the latency to be the same?  Even if the MDT is flash and the OST is not, I think we expect it to go in cache anyway, right?

            But, then again, I suppose APPEND is being used because multiple threads might be writing to this file, and in that case, we're going to spam syncs from lock contention, so it would be very good to be on flash rather than spinning disk.  Hmm.

            It seems like a "nice to have", but it does seem useful.

            pfarrell Patrick Farrell (Inactive) added a comment - "as the lower write latency could improve application performance if they are blocked on the log writes" I don't think in general that DOM has lower write latency?  It's got massively better latency for the first write to a newly created file, but after the first write, don't we expect the latency to be the same?  Even if the MDT is flash and the OST is not, I think we expect it to go in cache anyway, right? But, then again, I suppose APPEND is being used because multiple threads might be writing to this file, and in that case, we're going to spam syncs from lock contention, so it would be very good to be on flash rather than spinning disk.  Hmm. It seems like a "nice to have", but it does seem useful.
            dauchy Nathan Dauchy (Inactive) added a comment - - edited

            Andreas, Mike,

            Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default?  That would allow admins to set whatever count fits best with the general workload.  It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S".  (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target.  Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!)  And I guess "create_append_stripe_width=N" for completeness.

            Update: Upon re-reading Andreas' comment, maybe the intent was to still use a PFL layout. That brings to mind a possibility of just defining a "create_append_extents=N" option which indicates how many of the default layout PFL extents to define.  I think in our case we would set it to 2, to get first part on SSD and have a backstop of single larger stripe on HDD.  The end of the last default extent (typically "-1") should be applied to the file, OR *_max_mb if that is defined but I think that would no longer be needed.

            Update 2: Thanks Patrick for catching that I said "instantiate" when I should not have.  My thought was to still create a special layout, with PFL, but base it on the default layout... truncated to N extents.  Then the whole thing can be instantiated.  

            Thanks!

            dauchy Nathan Dauchy (Inactive) added a comment - - edited Andreas, Mike, Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default?  That would allow admins to set whatever count fits best with the general workload.  It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S".  (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target.  Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!)  And I guess "create_append_stripe_width=N" for completeness. Update: Upon re-reading Andreas' comment, maybe the intent was to still use a PFL layout. That brings to mind a possibility of just defining a "create_append_extents=N" option which indicates how many of the default layout PFL extents to define.  I think in our case we would set it to 2, to get first part on SSD and have a backstop of single larger stripe on HDD.  The end of the last default extent (typically "-1") should be applied to the file, OR *_max_mb if that is defined but I think that would no longer be needed. Update 2: Thanks Patrick for catching that I said "instantiate" when I should not have.  My thought was to still create a special layout, with PFL, but base it on the default layout... truncated to N extents.  Then the whole thing can be instantiated.   Thanks!

            Mike,
            could you please take a crack at this. It may be fairly easy for you to implement the basic functionality.

            The general consensus is that we should default to create a 1-stripe file when a file is opened with O_APPEND, because there is purely overhead when having multiple stripes in this case due to extra OST object creation, extra locking of all the objects, etc. and no benefits (multiple threads cannot write to the same file anyway). There should be a tunable parameter that disables this feature, something like lctl set_param mdt.*.append_onestripe_enable=0 or ...append_logfile_enable=0 or similar to disable it (enabled by default since this is the most common case).

            There could be a mdt.*.append_*_max_mb=N tunable to set the maximum size of the log files. Probably the default for *_max_mb should be unlimited (it will run out of space when OST is full), and use a plain file layout, which provides maximum compatibility for clients and applications. If it is not unlimited, then it will need to create a single PFL component with a limited size, and all PFL-capable clients will return an error if they try to write beyond the end of the component. There definitely may be a benefit in many environments to limit the maximum size of a log file to protect from a runaway job spamming the filesystem (I've wanted this even for local filesystems on occasion).

            I was thinking that having a DoM + 1-stripe OST PFL layout could be useful, but this doesn't help in the end because O_APPEND will always instantiate the OST object anyway, and the OST would need to be checked for data each time. That would just be overhead and doesn't make any sense to do in the end compared to a DoM-only file.

            However, I think it would make sense to allow creating a DoM-only PFL file if only small log files are used (which seems to be typical), as the lower write latency could improve application performance if they are blocked on the log writes. It could also hurt performance if the MDS is overloaded, and it could consume too much MDT space if the logs got very large, so this behavior shouldn't be enabled by default. Maybe mdt.*.append_*_enable=mdt to enable creating the component on DoM, and change *_max_mb by default (if currently unlimited) for such files to 32MB?

            If we wanted to get fancy at some point in the future (if this feature gains interest), we could allow ...append_enable=[FID] together with patch https://review.whamcloud.com/28972 "LU-9982 lustre: clients striping from mapped FID in nodemap" to specify a source FID for the layout template for all O_APPEND files, which allows maximum flexibility instead of trying to specify a variety of layouts via tunable parameters, but this would require that LU-9982 is completed first to allow specifying an arbitrary FID as a layout template instead of the parent directory, so it isn't a goal for the initial patch.

            adilger Andreas Dilger added a comment - Mike, could you please take a crack at this. It may be fairly easy for you to implement the basic functionality. The general consensus is that we should default to create a 1-stripe file when a file is opened with O_APPEND , because there is purely overhead when having multiple stripes in this case due to extra OST object creation, extra locking of all the objects, etc. and no benefits (multiple threads cannot write to the same file anyway). There should be a tunable parameter that disables this feature, something like lctl set_param mdt.*.append_onestripe_enable=0 or ...append_logfile_enable=0 or similar to disable it (enabled by default since this is the most common case). There could be a mdt.*.append_*_max_mb=N tunable to set the maximum size of the log files. Probably the default for *_max_mb should be unlimited (it will run out of space when OST is full), and use a plain file layout, which provides maximum compatibility for clients and applications. If it is not unlimited, then it will need to create a single PFL component with a limited size, and all PFL-capable clients will return an error if they try to write beyond the end of the component. There definitely may be a benefit in many environments to limit the maximum size of a log file to protect from a runaway job spamming the filesystem (I've wanted this even for local filesystems on occasion). I was thinking that having a DoM + 1-stripe OST PFL layout could be useful, but this doesn't help in the end because O_APPEND will always instantiate the OST object anyway, and the OST would need to be checked for data each time. That would just be overhead and doesn't make any sense to do in the end compared to a DoM-only file. However, I think it would make sense to allow creating a DoM-only PFL file if only small log files are used (which seems to be typical), as the lower write latency could improve application performance if they are blocked on the log writes. It could also hurt performance if the MDS is overloaded, and it could consume too much MDT space if the logs got very large, so this behavior shouldn't be enabled by default. Maybe mdt.*.append_*_enable=mdt to enable creating the component on DoM, and change *_max_mb by default (if currently unlimited) for such files to 32MB? If we wanted to get fancy at some point in the future (if this feature gains interest), we could allow ...append_enable= [FID] together with patch https://review.whamcloud.com/28972 " LU-9982 lustre: clients striping from mapped FID in nodemap " to specify a source FID for the layout template for all O_APPEND files, which allows maximum flexibility instead of trying to specify a variety of layouts via tunable parameters, but this would require that LU-9982 is completed first to allow specifying an arbitrary FID as a layout template instead of the parent directory, so it isn't a goal for the initial patch.

            This issue appears to be causing problems for us as well, primarily for Slurm stdout/stderr files (which, as Stephanie said, can be scattered all over various directories). Using 2.10.5+ servers and 2.10.7+ clients.

            The other trigger that I don't see mentioned is probably less common... a seek to some point in the file without actually writing anything. It may not be directly related, but would be an opportunity for efficiency if any changes could also handle "dd if=/dev/zero of=sparse_file bs=1 count=0 seek=100G".

            dauchy Nathan Dauchy (Inactive) added a comment - This issue appears to be causing problems for us as well, primarily for Slurm stdout/stderr files (which, as Stephanie said, can be scattered all over various directories). Using 2.10.5+ servers and 2.10.7+ clients. The other trigger that I don't see mentioned is probably less common... a seek to some point in the file without actually writing anything. It may not be directly related, but would be an opportunity for efficiency if any changes could also handle "dd if=/dev/zero of=sparse_file bs=1 count=0 seek=100G".

            Sorry for the lack of context! Yes the above units are in bytes. I sometimes use histogram.py (from https://github.com/bitly/data_hacks ) to display quick histograms from data points, hence the weird ranges - they depend on the input data).

            Anyway, yes, these files are all very small (< 128KB). Thanks for the different links, this is very interesting. To be honest, we would be happy already with a simple patch to try the one-stripe option, as long as there is a tunable to enable/disable on the flight. This would allow us to still use "large" PFL layout in general but not waste inodes on OSTs for O_APPEND files.

            sthiell Stephane Thiell added a comment - Sorry for the lack of context! Yes the above units are in bytes. I sometimes use histogram.py (from https://github.com/bitly/data_hacks  ) to display quick histograms from data points, hence the weird ranges - they depend on the input data). Anyway, yes, these files are all very small (< 128KB). Thanks for the different links, this is very interesting. To be honest, we would be happy already with a simple patch to try the one-stripe option, as long as there is a tunable to enable/disable on the flight. This would allow us to still use "large" PFL layout in general but not waste inodes on OSTs for O_APPEND files.

            People

              pfarrell Patrick Farrell (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: