[LU-9341] PFL: append should not instantiate full layout - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: Lustre 2.10.0
Labels:
- DoM2
- pfl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started.

It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry.

Attachments

Issue Links

is duplicated by

LU-10665 DoM: append to file causes OST component initialization

Resolved

is related to

LU-9479 sanity test 184d 244: don't instantiate PFL component when taking group lock

Open

LU-10176 Data-on-MDT phase II

Open

LU-13420 append to PFL-file without 'eof' component fails

Open

LU-17694 sanity-compr test_184d: last component index number got assigned even if it was not used after layout swap

Open

LU-15727 lod_get_default_lov_striping() misinterprets composite striping for append

Resolved

LU-12738 PFL: append of PFL file should not instantiate full layout

Open

LU-17159 Mark file layouts using append striping

Open

is related to

LU-10782 Enable tiny write append for singly striped non-composite file

Open

LU-8998 Progressive File Layout (PFL)

Resolved

(3 is related to, 2 is related to )

Activity

[LU-9341] PFL: append should not instantiate full layout

Nathan Dauchy (Inactive) added a comment - 23/Oct/19 5:37 PM

Can this get pulled back into b2_10 and/or b2_12 for LTS?

Nathan Dauchy (Inactive) added a comment - 23/Oct/19 5:37 PM Can this get pulled back into b2_10 and/or b2_12 for LTS?

Peter Jones added a comment - 20/Sep/19 2:32 PM

Landed for 2.13

Peter Jones added a comment - 20/Sep/19 2:32 PM Landed for 2.13

Gerrit Updater added a comment - 20/Sep/19 7:53 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35617/
Subject: ~~LU-9341~~ lod: Add special O_APPEND striping
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e2ac6e1eaa108eef3493837e9bd881629582ea1d

Gerrit Updater added a comment - 20/Sep/19 7:53 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35617/ Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: master Current Patch Set: Commit: e2ac6e1eaa108eef3493837e9bd881629582ea1d

Patrick Farrell (Inactive) added a comment - 09/Sep/19 7:20 PM

Opened LU-12738 to track remaining work, this can be closed once https://review.whamcloud.com/#/c/35617/ lands.

Patrick Farrell (Inactive) added a comment - 09/Sep/19 7:20 PM Opened LU-12738 to track remaining work, this can be closed once https://review.whamcloud.com/#/c/35617/ lands.

Gerrit Updater added a comment - 09/Aug/19 4:39 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35611/
Subject: ~~LU-9341~~ utils: fix lfs find for composite files
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 72479a52be5f77f601d8234d957f5d6176edf6e8

Gerrit Updater added a comment - 09/Aug/19 4:39 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35611/ Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: master Current Patch Set: Commit: 72479a52be5f77f601d8234d957f5d6176edf6e8

Patrick Farrell (Inactive) added a comment - 06/Aug/19 3:59 PM

spitzcor and vitaly_fertman of Cray indicated they're still interested in taking aim directly at the problem, and asked me to break down the approach I had most recently settled on. I'm still a fan of the current "special layout for O_APPEND" patch, but this fix could be complementary (and it may not be available for a while - The current plan is not too bad, but it's not trivial either). So I'll do that here.

There are two basic problems/constraints we are trying to meet.

Append writes must be atomic, meaning two things:
1. No "write tearing". Every byte of a particular O_APPEND write must be adjacent, no gaps. (This is true of any write, but it is more relevant for O_APPEND writes, which try to go to EOF, so if we are not careful, a restarted append write will restart at a new EOF.)

2. They must not be mixed with other O_APPEND writes. If two O_APPEND writes, A and B, are racing, either AB or BA is valid, but it is not valid for any part of A or B to overwrite the other one. This is of course not true of regular writes, which are started at a specific offset. This means that if a regular write is racing with an O_APPEND write, it can overwrite part of the O_APPEND write. This is acceptable.

The first problem (write tearing) is solved by getting the file size before starting the write and using the same one throughout the O_APPEND write. (The current code checks the file size repeatedly, I believe for every iteration of cl_io_loop.) This means that if another write races with our O_APPEND write we will not 'tear' the O_APPEND write by moving to a new EOF in the middle of the write. Note that we must also retain the size across i/o restarts, for the case where we have to update the layout in the middle of an O_APPEND write.

The second problem requires that we allow only one O_APPEND write at a time. There are probably a few ways to solve this, but I think the correct way (it is definitely the simplest way) is to add another bit to the MDT IBITS lock, an O_APPEND bit. All O_APPEND writes must ask for this bit in PW mode before starting to write. (We cannot use the layout lock for this exclusion because the server revokes our layout lock when we have to update the layout.) This lock must be held across i/o restarts, so it should be taken before the layout lock. (Or it could possibly be taken with the layout lock? We have to be careful about ordering/pairing issues with the layout and append bits, I have not thought about this carefully yet.)

Note that excluding O_APPEND writes does require excluding multiple O_APPEND writes on the same node as well. This can be done using the local tree_lock in the write path, locking it from 0 to EOF.

This combination of things should allow not instantiating the full layout and locking every object. It's a fair bit of work.

Note of course that this is a split client/server solution, so it will need a compatibility flag so the client knows it can use the O_APPEND flag. The good news is that this should interop safely with older clients - The older clients simply instantiate and lock everything for O_APPEND, which will give the correct exclusion vs newer clients.

Patrick Farrell (Inactive) added a comment - 06/Aug/19 3:59 PM spitzcor and vitaly_fertman of Cray indicated they're still interested in taking aim directly at the problem, and asked me to break down the approach I had most recently settled on. I'm still a fan of the current "special layout for O_APPEND" patch, but this fix could be complementary (and it may not be available for a while - The current plan is not too bad, but it's not trivial either). So I'll do that here. There are two basic problems/constraints we are trying to meet. Append writes must be atomic, meaning two things: 1. No "write tearing". Every byte of a particular O_APPEND write must be adjacent, no gaps. (This is true of any write, but it is more relevant for O_APPEND writes, which try to go to EOF, so if we are not careful, a restarted append write will restart at a new EOF.) 2. They must not be mixed with other O_APPEND writes. If two O_APPEND writes, A and B, are racing, either AB or BA is valid, but it is not valid for any part of A or B to overwrite the other one. This is of course not true of regular writes, which are started at a specific offset. This means that if a regular write is racing with an O_APPEND write, it can overwrite part of the O_APPEND write. This is acceptable. The first problem (write tearing) is solved by getting the file size before starting the write and using the same one throughout the O_APPEND write. (The current code checks the file size repeatedly, I believe for every iteration of cl_io_loop.) This means that if another write races with our O_APPEND write we will not 'tear' the O_APPEND write by moving to a new EOF in the middle of the write. Note that we must also retain the size across i/o restarts, for the case where we have to update the layout in the middle of an O_APPEND write. The second problem requires that we allow only one O_APPEND write at a time. There are probably a few ways to solve this, but I think the correct way (it is definitely the simplest way) is to add another bit to the MDT IBITS lock, an O_APPEND bit. All O_APPEND writes must ask for this bit in PW mode before starting to write. (We cannot use the layout lock for this exclusion because the server revokes our layout lock when we have to update the layout.) This lock must be held across i/o restarts, so it should be taken before the layout lock. (Or it could possibly be taken with the layout lock? We have to be careful about ordering/pairing issues with the layout and append bits, I have not thought about this carefully yet.) Note that excluding O_APPEND writes does require excluding multiple O_APPEND writes on the same node as well. This can be done using the local tree_lock in the write path, locking it from 0 to EOF. This combination of things should allow not instantiating the full layout and locking every object. It's a fair bit of work. Note of course that this is a split client/server solution, so it will need a compatibility flag so the client knows it can use the O_APPEND flag. The good news is that this should interop safely with older clients - The older clients simply instantiate and lock everything for O_APPEND, which will give the correct exclusion vs newer clients.

Patrick Farrell (Inactive) added a comment - 06/Aug/19 3:31 PM

dauchy,

OK, that seems like a reasonable request. adilger, let me know if you have any objections, but I'll see about adding it to the current patch.

Patrick Farrell (Inactive) added a comment - 06/Aug/19 3:31 PM dauchy , OK, that seems like a reasonable request. adilger , let me know if you have any objections, but I'll see about adding it to the current patch.

Nathan Dauchy (Inactive) added a comment - 02/Aug/19 5:20 PM

I guess append_layout_pool=<pool> wouldn't be too much more complex.

Patrick, Andreas,

Please do put a parameter to control the pool for O_APPEND files into the patch. Based on the scan data from our system and that Stephane provided, I think we (and others using SSDs for the first component of PFL) could accommodate all of these files on the SSD OST pool, which would handle the small writes of log entries much better. I'm OK with handling those few "large" ones with periodic scans and "lfs migrate" to HDD to prevent the SSDs from filling up... at least for a while until a more extensive change for this issue is tackled.

Thanks for your consideration,
Nathan

Nathan Dauchy (Inactive) added a comment - 02/Aug/19 5:20 PM I guess append_layout_pool=<pool> wouldn't be too much more complex. Patrick, Andreas, Please do put a parameter to control the pool for O_APPEND files into the patch. Based on the scan data from our system and that Stephane provided, I think we (and others using SSDs for the first component of PFL) could accommodate all of these files on the SSD OST pool, which would handle the small writes of log entries much better. I'm OK with handling those few "large" ones with periodic scans and "lfs migrate" to HDD to prevent the SSDs from filling up... at least for a while until a more extensive change for this issue is tackled. Thanks for your consideration, Nathan

Stephane Thiell added a comment - 31/Jul/19 6:08 PM

Thanks Nathan for the improved commands! I also ran the command using the patched lfs. This time I was able to get a list of such files on the whole filesystem (444M inodes total), however re: the size profile I only have a partial scan at this point.

I ran:

# lfs find /fir -type f -comp-count +1 -stripe_count=16 -size -200G >/tmp/lfs_find_list

Results: 15.3M of files are like that (~3.5%).

suffix analysis:

# cat /tmp/lfs_find_list | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15
1495272 rst
1462325 ppairs
1428921 csv
1306429 txt
1065879 score
1065836 align
1031148 out
 743603 tmp
 591077 input
 476363 log
 266451 sam
 261306 err
 179451 html
 173386 stdout
  84526 bash

We've identified the .csv files are being generated by https://github.com/Sage-Bionetworks/synapser and our researchers are in touch with the developers to avoid the O_APPEND in that case (as it's really not needed). The out/log/err ones are mostly Slurm logs. We plan to have a look at the other file types.

As for the size profile analysis, this is the result on almost 10% of them:

# cat /tmp/sz_all | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i"
# NumSamples = 1372517; Min = 0.00; Max = 206874793928.00
# each ∎ represents a count of 17442
         0 -     197291 [1308181]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.31%)
    197291 -     591874 [ 10154]:  (0.74%)
    591874 -    1381039 [ 24264]: ∎ (1.77%)
   1381039 -    2959370 [  7160]:  (0.52%)
   2959370 -    6116032 [ 13260]:  (0.97%)
   6116032 -   12429356 [  8819]:  (0.64%)
  12429356 -   25056003 [   426]:  (0.03%)
  25056003 -   50309298 [    14]:  (0.00%)
  50309298 -  100815887 [    21]:  (0.00%)
 100815887 -  201829067 [    42]:  (0.00%)
 201829067 -  403855425 [    19]:  (0.00%)
 403855425 -  807908143 [    25]:  (0.00%)
 807908143 - 1616013577 [     2]:  (0.00%)
1616013577 - 3232224446 [    15]:  (0.00%)
3232224446 - 6464646184 [    35]:  (0.00%)
6464646184 - 12929489659 [    29]:  (0.00%)
12929489659 - 25859176611 [     4]:  (0.00%)
25859176611 - 51718550513 [     2]:  (0.00%)
51718550513 - 103437298318 [    31]:  (0.00%)
103437298318 - 206874793928 [    14]:  (0.00%)

Stephane Thiell added a comment - 31/Jul/19 6:08 PM Thanks Nathan for the improved commands! I also ran the command using the patched lfs. This time I was able to get a list of such files on the whole filesystem (444M inodes total), however re: the size profile I only have a partial scan at this point. I ran: # lfs find /fir -type f -comp-count +1 -stripe_count=16 -size -200G >/tmp/lfs_find_list Results: 15.3M of files are like that (~3.5%). suffix analysis: # cat /tmp/lfs_find_list | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15 1495272 rst 1462325 ppairs 1428921 csv 1306429 txt 1065879 score 1065836 align 1031148 out 743603 tmp 591077 input 476363 log 266451 sam 261306 err 179451 html 173386 stdout 84526 bash We've identified the .csv files are being generated by https://github.com/Sage-Bionetworks/synapser and our researchers are in touch with the developers to avoid the O_APPEND in that case (as it's really not needed). The out/log/err ones are mostly Slurm logs. We plan to have a look at the other file types. As for the size profile analysis, this is the result on almost 10% of them: # cat /tmp/sz_all | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i" # NumSamples = 1372517; Min = 0.00; Max = 206874793928.00 # each ∎ represents a count of 17442 0 - 197291 [1308181]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.31%) 197291 - 591874 [ 10154]: (0.74%) 591874 - 1381039 [ 24264]: ∎ (1.77%) 1381039 - 2959370 [ 7160]: (0.52%) 2959370 - 6116032 [ 13260]: (0.97%) 6116032 - 12429356 [ 8819]: (0.64%) 12429356 - 25056003 [ 426]: (0.03%) 25056003 - 50309298 [ 14]: (0.00%) 50309298 - 100815887 [ 21]: (0.00%) 100815887 - 201829067 [ 42]: (0.00%) 201829067 - 403855425 [ 19]: (0.00%) 403855425 - 807908143 [ 25]: (0.00%) 807908143 - 1616013577 [ 2]: (0.00%) 1616013577 - 3232224446 [ 15]: (0.00%) 3232224446 - 6464646184 [ 35]: (0.00%) 6464646184 - 12929489659 [ 29]: (0.00%) 12929489659 - 25859176611 [ 4]: (0.00%) 25859176611 - 51718550513 [ 2]: (0.00%) 51718550513 - 103437298318 [ 31]: (0.00%) 103437298318 - 206874793928 [ 14]: (0.00%)

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:17 AM - edited

Some further analysis of the file types which are likely opened with O_APPEND...

Most of the files overall are indeed "log" files (some created by slurm, some by user scripts) or have a suffix that leads me to believe are batch job stdout/stderr:

# find results/ -type f -size +0 | xargs cat | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15
1100802 log
 190166 txt
  13387 grib2
   5102 grb2
   4396 0
   2176 1
   1218 2
    937 3
    771 out
    634 err
    478 41
    337 mk
    230 sh
    151 conf
    142 json
# find results/ -type f -size +0 | xargs cat | egrep -c "\.o[0-9]{5,8}$"
50619
# find results/ -type f -size +0 | xargs cat | egrep -c "\.e[0-9]{5,8}$"
1027

The majority of the files greater than 10 MB are ".grb2" or ".grib2" binary files. I don't know if the app that creates those uses append for multi-writer reasons similar to slurm.

# find results/ -type f -size +0 | xargs awk '{if ($1>10000000) print }' | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 4
   5102 grb2
    496 log
     57 grib2
     31 out

Nathan Dauchy (Inactive) added a comment - 27/Jul/19 12:17 AM - edited Some further analysis of the file types which are likely opened with O_APPEND... Most of the files overall are indeed "log" files (some created by slurm, some by user scripts) or have a suffix that leads me to believe are batch job stdout/stderr: # find results/ -type f -size +0 | xargs cat | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15 1100802 log 190166 txt 13387 grib2 5102 grb2 4396 0 2176 1 1218 2 937 3 771 out 634 err 478 41 337 mk 230 sh 151 conf 142 json # find results/ -type f -size +0 | xargs cat | egrep -c "\.o[0-9]{5,8}$" 50619 # find results/ -type f -size +0 | xargs cat | egrep -c "\.e[0-9]{5,8}$" 1027 The majority of the files greater than 10 MB are ".grb2" or ".grib2" binary files. I don't know if the app that creates those uses append for multi-writer reasons similar to slurm. # find results/ -type f -size +0 | xargs awk '{if ($1>10000000) print }' | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 4 5102 grb2 496 log 57 grib2 31 out

Stephane Thiell added a comment - 27/Jul/19 12:13 AM

Thanks Andreas! This is awesome. We might be able to do a full scan with that. I'll let it run and report back!

Stephane Thiell added a comment - 27/Jul/19 12:13 AM Thanks Andreas! This is awesome. We might be able to do a full scan with that. I'll let it run and report back!

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Apr/17 11:14 PM

Updated:: 29/Apr/25 11:36 PM

Resolved:: 20/Sep/19 2:32 PM