[LU-9341] PFL: append should not instantiate full layout - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: Lustre 2.10.0
Labels:
- DoM2
- pfl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started.

It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry.

Attachments

Issue Links

is duplicated by

LU-10665 DoM: append to file causes OST component initialization

Resolved

is related to

LU-9479 sanity test 184d 244: don't instantiate PFL component when taking group lock

Open

LU-10176 Data-on-MDT phase II

Open

LU-13420 append to PFL-file without 'eof' component fails

Open

LU-17694 sanity-compr test_184d: last component index number got assigned even if it was not used after layout swap

Open

LU-15727 lod_get_default_lov_striping() misinterprets composite striping for append

Resolved

LU-12738 PFL: append of PFL file should not instantiate full layout

Open

LU-17159 Mark file layouts using append striping

Open

is related to

LU-10782 Enable tiny write append for singly striped non-composite file

Open

LU-8998 Progressive File Layout (PFL)

Resolved

(3 is related to, 2 is related to )

Activity

[LU-9341] PFL: append should not instantiate full layout

Stephane Thiell added a comment - 31/Jul/19 6:08 PM

Thanks Nathan for the improved commands! I also ran the command using the patched lfs. This time I was able to get a list of such files on the whole filesystem (444M inodes total), however re: the size profile I only have a partial scan at this point.

I ran:

# lfs find /fir -type f -comp-count +1 -stripe_count=16 -size -200G >/tmp/lfs_find_list

Results: 15.3M of files are like that (~3.5%).

suffix analysis:

# cat /tmp/lfs_find_list | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15
1495272 rst
1462325 ppairs
1428921 csv
1306429 txt
1065879 score
1065836 align
1031148 out
 743603 tmp
 591077 input
 476363 log
 266451 sam
 261306 err
 179451 html
 173386 stdout
  84526 bash

We've identified the .csv files are being generated by https://github.com/Sage-Bionetworks/synapser and our researchers are in touch with the developers to avoid the O_APPEND in that case (as it's really not needed). The out/log/err ones are mostly Slurm logs. We plan to have a look at the other file types.

As for the size profile analysis, this is the result on almost 10% of them:

# cat /tmp/sz_all | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i"
# NumSamples = 1372517; Min = 0.00; Max = 206874793928.00
# each ∎ represents a count of 17442
         0 -     197291 [1308181]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.31%)
    197291 -     591874 [ 10154]:  (0.74%)
    591874 -    1381039 [ 24264]: ∎ (1.77%)
   1381039 -    2959370 [  7160]:  (0.52%)
   2959370 -    6116032 [ 13260]:  (0.97%)
   6116032 -   12429356 [  8819]:  (0.64%)
  12429356 -   25056003 [   426]:  (0.03%)
  25056003 -   50309298 [    14]:  (0.00%)
  50309298 -  100815887 [    21]:  (0.00%)
 100815887 -  201829067 [    42]:  (0.00%)
 201829067 -  403855425 [    19]:  (0.00%)
 403855425 -  807908143 [    25]:  (0.00%)
 807908143 - 1616013577 [     2]:  (0.00%)
1616013577 - 3232224446 [    15]:  (0.00%)
3232224446 - 6464646184 [    35]:  (0.00%)
6464646184 - 12929489659 [    29]:  (0.00%)
12929489659 - 25859176611 [     4]:  (0.00%)
25859176611 - 51718550513 [     2]:  (0.00%)
51718550513 - 103437298318 [    31]:  (0.00%)
103437298318 - 206874793928 [    14]:  (0.00%)

Stephane Thiell added a comment - 31/Jul/19 6:08 PM Thanks Nathan for the improved commands! I also ran the command using the patched lfs. This time I was able to get a list of such files on the whole filesystem (444M inodes total), however re: the size profile I only have a partial scan at this point. I ran: # lfs find /fir -type f -comp-count +1 -stripe_count=16 -size -200G >/tmp/lfs_find_list Results: 15.3M of files are like that (~3.5%). suffix analysis: # cat /tmp/lfs_find_list | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15 1495272 rst 1462325 ppairs 1428921 csv 1306429 txt 1065879 score 1065836 align 1031148 out 743603 tmp 591077 input 476363 log 266451 sam 261306 err 179451 html 173386 stdout 84526 bash We've identified the .csv files are being generated by https://github.com/Sage-Bionetworks/synapser and our researchers are in touch with the developers to avoid the O_APPEND in that case (as it's really not needed). The out/log/err ones are mostly Slurm logs. We plan to have a look at the other file types. As for the size profile analysis, this is the result on almost 10% of them: # cat /tmp/sz_all | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i" # NumSamples = 1372517; Min = 0.00; Max = 206874793928.00 # each ∎ represents a count of 17442 0 - 197291 [1308181]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.31%) 197291 - 591874 [ 10154]: (0.74%) 591874 - 1381039 [ 24264]: ∎ (1.77%) 1381039 - 2959370 [ 7160]: (0.52%) 2959370 - 6116032 [ 13260]: (0.97%) 6116032 - 12429356 [ 8819]: (0.64%) 12429356 - 25056003 [ 426]: (0.03%) 25056003 - 50309298 [ 14]: (0.00%) 50309298 - 100815887 [ 21]: (0.00%) 100815887 - 201829067 [ 42]: (0.00%) 201829067 - 403855425 [ 19]: (0.00%) 403855425 - 807908143 [ 25]: (0.00%) 807908143 - 1616013577 [ 2]: (0.00%) 1616013577 - 3232224446 [ 15]: (0.00%) 3232224446 - 6464646184 [ 35]: (0.00%) 6464646184 - 12929489659 [ 29]: (0.00%) 12929489659 - 25859176611 [ 4]: (0.00%) 25859176611 - 51718550513 [ 2]: (0.00%) 51718550513 - 103437298318 [ 31]: (0.00%) 103437298318 - 206874793928 [ 14]: (0.00%)

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:17 AM - edited

Some further analysis of the file types which are likely opened with O_APPEND...

Most of the files overall are indeed "log" files (some created by slurm, some by user scripts) or have a suffix that leads me to believe are batch job stdout/stderr:

# find results/ -type f -size +0 | xargs cat | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15
1100802 log
 190166 txt
  13387 grib2
   5102 grb2
   4396 0
   2176 1
   1218 2
    937 3
    771 out
    634 err
    478 41
    337 mk
    230 sh
    151 conf
    142 json
# find results/ -type f -size +0 | xargs cat | egrep -c "\.o[0-9]{5,8}$"
50619
# find results/ -type f -size +0 | xargs cat | egrep -c "\.e[0-9]{5,8}$"
1027

The majority of the files greater than 10 MB are ".grb2" or ".grib2" binary files. I don't know if the app that creates those uses append for multi-writer reasons similar to slurm.

# find results/ -type f -size +0 | xargs awk '{if ($1>10000000) print }' | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 4
   5102 grb2
    496 log
     57 grib2
     31 out

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:17 AM - edited Some further analysis of the file types which are likely opened with O_APPEND... Most of the files overall are indeed "log" files (some created by slurm, some by user scripts) or have a suffix that leads me to believe are batch job stdout/stderr: # find results/ -type f -size +0 | xargs cat | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15 1100802 log 190166 txt 13387 grib2 5102 grb2 4396 0 2176 1 1218 2 937 3 771 out 634 err 478 41 337 mk 230 sh 151 conf 142 json # find results/ -type f -size +0 | xargs cat | egrep -c "\.o[0-9]{5,8}$" 50619 # find results/ -type f -size +0 | xargs cat | egrep -c "\.e[0-9]{5,8}$" 1027 The majority of the files greater than 10 MB are ".grb2" or ".grib2" binary files. I don't know if the app that creates those uses append for multi-writer reasons similar to slurm. # find results/ -type f -size +0 | xargs awk '{if ($1>10000000) print }' | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 4 5102 grb2 496 log 57 grib2 31 out

Stephane Thiell added a comment - 27/Jul/19 12:13 AM

Thanks Andreas! This is awesome. We might be able to do a full scan with that. I'll let it run and report back!

Stephane Thiell added a comment - 27/Jul/19 12:13 AM Thanks Andreas! This is awesome. We might be able to do a full scan with that. I'll let it run and report back!

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:12 AM - edited

I finally have some data for file size distribution for the NOAA system.

Update & Corrections: Reducing lru_max_age and dropping cache a few times on the client prevented the eviction and hang, and 'stat' provides more accurate file sizes. I restriped the 3 largest files and reran to get a clearer picture. Total scan took ~12 hours.

I used a scanning method combining the suggestions from Andreas and Stephane, and adding in some parallelism, running multiple instances of the following at project subdir levels:

lfs find $proj -type f -size -34359738360 -stripe_count 46 -comp-count 5 |
    xargs -r -P 8 -I {} /bin/bash -c '[ $(lfs getstripe -c {}) -eq 32 ] && stat -c "%s %n" {}'

Here are the results with the same histogram reporting tool so you can see the differences from what Stephane reported:

# find results/ -type f -size +0 | xargs awk '{print $1}' | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i"
# NumSamples = 1377552; Min = 0.00; Max = 356778002.00
# each ∎ represents a count of 2435
         0 -        340 [  3802]: ∎ (0.28%)
       340 -       1020 [ 89830]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.52%)
      1020 -       2381 [164156]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.92%)
      2381 -       5103 [182637]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.26%)
      5103 -      10547 [133979]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.73%)
     10547 -      21435 [156886]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.39%)
     21435 -      43211 [127305]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.24%)
     43211 -      86763 [ 72366]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (5.25%)
     86763 -     173867 [127675]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.27%)
    173867 -     348076 [ 84564]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.14%)
    348076 -     696492 [ 23804]: ∎∎∎∎∎∎∎∎∎ (1.73%)
    696492 -    1393325 [ 20384]: ∎∎∎∎∎∎∎∎ (1.48%)
   1393325 -    2786990 [118037]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.57%)
   2786990 -    5574321 [ 63407]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (4.60%)
   5574321 -   11148982 [  2558]: ∎ (0.19%)
  11148982 -   22298306 [  2012]:  (0.15%)
  22298306 -   44596952 [  3835]: ∎ (0.28%)
  44596952 -   89194245 [   288]:  (0.02%)
  89194245 -  178388830 [     8]:  (0.00%)
 178388830 -  356778001 [    18]:  (0.00%)

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:12 AM - edited I finally have some data for file size distribution for the NOAA system. Update & Corrections: Reducing lru_max_age and dropping cache a few times on the client prevented the eviction and hang, and 'stat' provides more accurate file sizes. I restriped the 3 largest files and reran to get a clearer picture. Total scan took ~12 hours. I used a scanning method combining the suggestions from Andreas and Stephane, and adding in some parallelism, running multiple instances of the following at project subdir levels: lfs find $proj -type f -size -34359738360 -stripe_count 46 -comp-count 5 | xargs -r -P 8 -I {} /bin/bash -c '[ $(lfs getstripe -c {}) -eq 32 ] && stat -c "%s %n" {}' Here are the results with the same histogram reporting tool so you can see the differences from what Stephane reported: # find results/ -type f -size +0 | xargs awk '{print $1}' | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i" # NumSamples = 1377552; Min = 0.00; Max = 356778002.00 # each ∎ represents a count of 2435 0 - 340 [ 3802]: ∎ (0.28%) 340 - 1020 [ 89830]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.52%) 1020 - 2381 [164156]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.92%) 2381 - 5103 [182637]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.26%) 5103 - 10547 [133979]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.73%) 10547 - 21435 [156886]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.39%) 21435 - 43211 [127305]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.24%) 43211 - 86763 [ 72366]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (5.25%) 86763 - 173867 [127675]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.27%) 173867 - 348076 [ 84564]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.14%) 348076 - 696492 [ 23804]: ∎∎∎∎∎∎∎∎∎ (1.73%) 696492 - 1393325 [ 20384]: ∎∎∎∎∎∎∎∎ (1.48%) 1393325 - 2786990 [118037]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.57%) 2786990 - 5574321 [ 63407]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (4.60%) 5574321 - 11148982 [ 2558]: ∎ (0.19%) 11148982 - 22298306 [ 2012]: (0.15%) 22298306 - 44596952 [ 3835]: ∎ (0.28%) 44596952 - 89194245 [ 288]: (0.02%) 89194245 - 178388830 [ 8]: (0.00%) 178388830 - 356778001 [ 18]: (0.00%)

Andreas Dilger added a comment - 26/Jul/19 5:45 AM - edited

I thought I had commented yesterday, but I guess I got distracted and forgot to commit it.

If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data.

I was going to suggest to use "lfs find" to find files with multiple PFL components, which have a ful stripe count but have a smaller size than needed to instantiate the last component. However, it looks like I found a small bug in lfs find when checking the stripe count of PFL files that doesn't match the expected implementation. The attached patch fixes this issue, and only changes lfs and does not need any changes to the client or server code.

With this patch, you can run something similar to "lfs find /lfs1 -type f -comp-count +1 -stripe_count=M -size -N", where M is the stripe count of the last component (i.e. file is fully instantiated), and N is a size smaller than what would normally be needed to instantiate that component. This will list files that are (very likely) created with O_APPEND.

Andreas Dilger added a comment - 26/Jul/19 5:45 AM - edited I thought I had commented yesterday, but I guess I got distracted and forgot to commit it. If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data. I was going to suggest to use " lfs find " to find files with multiple PFL components, which have a ful stripe count but have a smaller size than needed to instantiate the last component. However, it looks like I found a small bug in lfs find when checking the stripe count of PFL files that doesn't match the expected implementation. The attached patch fixes this issue, and only changes lfs and does not need any changes to the client or server code. With this patch, you can run something similar to " lfs find /lfs1 -type f -comp-count +1 -stripe_count=M -size -N ", where M is the stripe count of the last component (i.e. file is fully instantiated), and N is a size smaller than what would normally be needed to instantiate that component. This will list files that are (very likely) created with O_APPEND .

Stephane Thiell added a comment - 25/Jul/19 9:12 PM

Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB. I redid a partial scan last night and this is the new results:

[root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p
# NumSamples = 226682; Min = 0.00; Max = 517320932.00
# Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000
# each ∎ represents a count of 1457
    0.0000 -   493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%)
  493.3562 -  1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%)
 1480.0685 -  3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%)
 3453.4931 -  7400.3424 [  2219]: ∎ (0.98%)
 7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%)
15294.0409 - 31081.4379 [  1009]:  (0.45%)
31081.4379 - 62656.2319 [   777]:  (0.34%)
62656.2319 - 125805.8200 [   614]:  (0.27%)
125805.8200 - 252104.9961 [   372]:  (0.16%)
252104.9961 - 504703.3483 [   266]:  (0.12%)
504703.3483 - 1009900.0527 [   307]:  (0.14%)
1009900.0527 - 2020293.4616 [   215]:  (0.09%)
2020293.4616 - 4041080.2794 [    44]:  (0.02%)
4041080.2794 - 8082653.9150 [    27]:  (0.01%)
8082653.9150 - 16165801.1862 [   186]:  (0.08%)
16165801.1862 - 32332095.7286 [     1]:  (0.00%)
32332095.7286 - 64664684.8134 [     0]:  (0.00%)
64664684.8134 - 129329862.9829 [     0]:  (0.00%)
129329862.9829 - 258660219.3219 [     0]:  (0.00%)
258660219.3219 - 517320932.0000 [     1]:  (0.00%)

It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway.

@dauchy,

If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare.

Then, I run something like this:

$ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \;

chkstripe.sh being:

#!/bin/bash

initcnt=$1
path=$2

c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init)

if [[ $c == $initcnt ]]; then
    sz=$(stat -c '%s' "$path")
    echo $sz $path
fi

Stephane Thiell added a comment - 25/Jul/19 9:12 PM Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB. I redid a partial scan last night and this is the new results: [root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p # NumSamples = 226682; Min = 0.00; Max = 517320932.00 # Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000 # each ∎ represents a count of 1457 0.0000 - 493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%) 493.3562 - 1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%) 1480.0685 - 3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%) 3453.4931 - 7400.3424 [ 2219]: ∎ (0.98%) 7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%) 15294.0409 - 31081.4379 [ 1009]: (0.45%) 31081.4379 - 62656.2319 [ 777]: (0.34%) 62656.2319 - 125805.8200 [ 614]: (0.27%) 125805.8200 - 252104.9961 [ 372]: (0.16%) 252104.9961 - 504703.3483 [ 266]: (0.12%) 504703.3483 - 1009900.0527 [ 307]: (0.14%) 1009900.0527 - 2020293.4616 [ 215]: (0.09%) 2020293.4616 - 4041080.2794 [ 44]: (0.02%) 4041080.2794 - 8082653.9150 [ 27]: (0.01%) 8082653.9150 - 16165801.1862 [ 186]: (0.08%) 16165801.1862 - 32332095.7286 [ 1]: (0.00%) 32332095.7286 - 64664684.8134 [ 0]: (0.00%) 64664684.8134 - 129329862.9829 [ 0]: (0.00%) 129329862.9829 - 258660219.3219 [ 0]: (0.00%) 258660219.3219 - 517320932.0000 [ 1]: (0.00%) It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway. @dauchy, If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare. Then, I run something like this: $ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \; chkstripe.sh being: #!/bin/bash initcnt=$1 path=$2 c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init) if [[ $c == $initcnt ]]; then sz=$(stat -c '%s' "$path") echo $sz $path fi

Gerrit Updater added a comment - 25/Jul/19 8:39 PM

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617
Subject: ~~LU-9341~~ lod: Add special O_APPEND striping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

Gerrit Updater added a comment - 25/Jul/19 8:39 PM Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617 Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0d002cacdf58c13a0d5dfe3681536013cf529da4

Gerrit Updater added a comment - 25/Jul/19 9:35 AM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611
Subject: ~~LU-9341~~ utils: fix lfs find for composite files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

Gerrit Updater added a comment - 25/Jul/19 9:35 AM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611 Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f322f65a011c9e888b28981b1eee1bd34d0d93ae

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 11:42 PM

Andreas, no, I don't have a size distribution for O_APPEND files, sorry. The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions. And even that might not catch other files that happened to be created with O_APPEND.

If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data.

Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly. Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component. Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end. No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

Nathan Dauchy (Inactive) added a comment - 24/Jul/19 11:42 PM Andreas, no, I don't have a size distribution for O_APPEND files, sorry. The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions. And even that might not catch other files that happened to be created with O_APPEND. If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data. Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly. Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component. Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end. No additional pool specification needed, no max size limit surprises, the system "just works" for the users.

Andreas Dilger added a comment - 24/Jul/19 11:11 PM

Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness.

My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc.

I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 "LU-10782 llite: simple layout append tiny write") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this.

If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable=[FID] and LU-9982, but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 "LU-11234 lod: add pool selection layout_policy" to specify a different pool for specific file extensions like *.log *.out, etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND), nor are they necessarily simple enough to make a trivial backport to 2.10.

Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

Andreas Dilger added a comment - 24/Jul/19 11:11 PM Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness. My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc. I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 " LU-10782 llite: simple layout append tiny write ") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this. If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable= [FID] and LU-9982 , but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 " LU-11234 lod: add pool selection layout_policy " to specify a different pool for specific file extensions like *.log *.out , etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND ), nor are they necessarily simple enough to make a trivial backport to 2.10. Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least).

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:27 PM

dauchy,

Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation. Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components. The only ways we came up with of doing that which are sure to work are fairly heavy handed. (The details and things we tried are buried in the ~40 previous comments on this bug.)

So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

Patrick Farrell (Inactive) added a comment - 24/Jul/19 10:27 PM dauchy , Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation. Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components. The only ways we came up with of doing that which are sure to work are fairly heavy handed. (The details and things we tried are buried in the ~40 previous comments on this bug.) So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement.

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Apr/17 11:14 PM

Updated:: 29/Apr/25 11:36 PM

Resolved:: 20/Sep/19 2:32 PM