[LU-9341] PFL: append should not instantiate full layout Created: 13/Apr/17 Updated: 21/Mar/23 Resolved: 20/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DoM2, pfl | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started. It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/17 ] |
|
When a component is not instantiated, there is no objects attached to it, then it's impossible to acquire ldlm lock. I don't worry about this much though because append write is almost certainly used for log file. |
| Comment by Andreas Dilger [ 15/Apr/17 ] |
|
I agree that appending write isn't a very common case, but at the same time we don't want every file that has an O_APPEND write to instantiate 100 stripes because that is the stripe count of the last component. That is especially true because such files will typically not get very large. It would be better to have appending writes make some reasonable estimate of how much space they will need (e.g. 10x the current write size beyond EOF) and only instantiate the next component if needed. If the client gets the DLM write lock(s) for the file and it turns out another component needs to be instantiated the client should retry, but this would likely be a very rare case. |
| Comment by Jinshan Xiong (Inactive) [ 17/Apr/17 ] |
|
Yes, that will work but needs some extra effort. Let's track this ticket and address it later(hopefully). Btw, what's the exact semantics of append write in terms of concurrent writes from the other clients? During the time of append write, should the writes from other clients be delayed until the append write is complete? |
| Comment by Andreas Dilger [ 05/May/17 ] |
|
This is already the case that append writes are totally serialized, because we don't know what the new EOF will be until the previous append is complete. Each appending client must hold the extent locks on all stripes during the append. As a result, we can have a good idea of whether we will need to instantiate a new component at write time. |
| Comment by Andreas Dilger [ 05/May/17 ] |
|
One major drawback of always instantiating all components for PFL files is that there will be a bad interaction between the requirement for append to always lock all stripes from [current_size,EOF) to prevent other clients from appending at the same time, and the last PFL component having a large number of stripes. This could hurt performance for appending to PFL files significantly, and isn't really necessary. We typically know at the time of append what the file size is (especially in the common "single client appending" case) and what the write size is, so we'll know whether there is a need to instantiate a new component or not. |
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] |
|
Hmm. Currently, append writes don't work to PFL files that don't have a layout to infinity. This is problematic (ie totally unworkable I guess I'll mark this one related to |
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] |
|
Perhaps simply have append take a lock until the end of the instantiated components? Is that truly safe, though? What if I/O from another client moved EOF down? I suppose that would be a race without a defined result per POSIX. So - Could we do that? Append writes lock to the end of instantiated components? But what happens if another client modifies the file at some further point than what we've locked, instantiating a further layout component and changing the file size? If our client doing appending writes again after that (still with a lock on the earlier region), would it know the file size had changed and act accordingly? Appending currently depends on locking the file to EOF to have correct size information, I believe. It looks to me like we'd have to rewrite appending to explicitly ask for updated size (glimpse, etc, similar to vvp_prep_size) if it no longer took a full file lock. |
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] |
|
Actually, we could presumably do better with a layout related check, only doing that glimpse if the layout had changed (since we lock not to EOF but to end of current instantiated layout). |
| Comment by Patrick Farrell (Inactive) [ 18/Feb/18 ] |
|
Self extending layouts are not compatible with O_APPEND at this time because O_APPEND locks to EOF and instantiates all components. |
| Comment by Patrick Farrell (Inactive) [ 21/Feb/18 ] |
|
Quoting Andreas in
Quoting Andreas from
I'm working on a patch, Zhang or Andreas, you could assign this to me if you'd like. Doesn't feel right to just take it, but I'm working it now. |
| Comment by Gerrit Updater [ 24/Feb/18 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/31411 |
| Comment by Andreas Dilger [ 09/Mar/18 ] |
|
One thought on this, is that if the "append lite" implementation in Patrick's patch is holding the [0-EOF] lock(s) on the stripe(s) of initialized components holding the current file_size and doing an append, then it should not be possible for other threads to do an append at the same time. Other threads would also block on the same stripe(s) during their append. It would be possible for another racing thread to do a non-append write() in an uninitialized component at some higher offset, but that shouldn't be considered a defect, since it is racy even without this optimization and would still result in the expected behaviour (append data at previous file_size, new data at specified offset). Also, an old client doing a full append would grab all of the [0-EOF] extent locks and would conflict with the "append lite" client, so there should not be any race there. Conceivably, the appending client only needs to hold locks on the component covering the current file_size and does not need to hold locks on the earlier component(s). In theory, it would be enough to only request the lock on [expected_file_size, EOF] for the append itself, since the appending write (and any other just-finished append will only extend the file beyond expected_file_size (which may be smaller than the actual file size when the lock is granted). In the rare case that the file was extended into the next component (or truncated shorter) by another thread before the lock holding file_size was granted then the client would need to retry with the new size. This would allow other threads to read earlier parts of the file without contention. That might be useful for producer-consumer workloads or libraries like PLFS that are append-heavy, and not just log files. I've long thought about this optimization, but append performance was never a priority. If we are already adding the append retry loop in the 31411 patch, we may as well improve the performance even more. Is there some other case that is of concern that I'm missing? |
| Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ] |
|
Hmmm, interesting, I see what you mean. Jinshan made the point that because we do the i/o on a stripe by stripe basis, the i/o restart can come after we've started the i/o, making it non-atomic. For this case (checking file size and restarting), I think that could be fixed pretty easily - we just need to not keep checking file size each time we go through vvp_io_write_start. As noted in the patch description, we just need a correct file size from while we were appending, and to keep from colliding with other append writes. This is because as you noted, racing normal writes to specific higher offsets are allowed as a valid race condition. A more pertinent concern is instantiating new components. If we have to instantiate a new component (or modify an existing one - I'm thinking of All of this is resolved if we hold the layout lock throughout, though - Any and all restarts we have to do are done for and by us. (I think - this assumes the layout changes are done with the client holding the lock, which I am not entirely sure is the case?) |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/18 ] |
|
Oof. Holding the lock throughout also poses some challenges, in that we still need to communicate the layout request to the server, and that's part of a lock request. I'm not sure how to disentangle that, but I believe it's necessary for us to get this working correctly. Here's the idea: Do i/o, potentially instantiating a new component or components during it Release the layout lock Holding the layout lock throughout (and not releasing it) seems to be necessary to guarantee atomicity of the append write, ie that it all ends up in the same place & together without another append write interfering. My impression is that we cannot do this safely at layout boundaries except by just preventing layout changes. The problem that comes up with holding an exclusive layout lock throughout is that instantiating a new component generates a lock request which conflicts, ie, the server takes an EX lock to do the layout instantiation. It's conceivably possible to change the server to allow it to use the client side EX lock, but... oof. And if we let the client lock get revoked, we're back where we started. So I see no straightforward way to really solve this at the moment without client and server side changes, at least not at the moment. The only partial possibility I see is if the append could disguise the size update until it's completed the entire write, but my impression is that's not possible, the size is updated on each stripe as writes to it complete. |
| Comment by Mikhail Pershin [ 14/Mar/18 ] |
|
I want to note also that for DOM files this has much worse consequences, literally any append to small file makes it looking as big because of next component initialization and all DOM optimizations stop working for such files. I was also looking at this issue in cl_io_loop(): - cl_io_iter_init() <---- here IO in initialized and LOV provides IO range (which is also lock range). - cl_io_lock() <--- here lock is taken on provided range - cl_io_start() <--- here VVP gets real size when lock is taken and calculate real append position/count - cl_io_end() - cl_io_unlock() - cl_io_iter_fini() The problem with instantiation appears in lov_io_slice_init() when IO range first is initialized to [0; EOF] due to append and then checked for overlapping |
| Comment by Zhenyu Xu [ 15/Mar/18 ] |
|
To note that range lock is applied upon each osc object with their respective ranges, so the newly instantiated objects also need to take their DLM lock ranges. So with uninstantiated osc objects, [0, EOF) lock cannot effectively applied. I am considering this:
The problem I have is that when the append write goes beyond the last instantiated component, we'd restart the IO, and that will relinquish the lock range [pos, pos+count-1], other write could race in and makes this append write in the dilemma. |
| Comment by Jinshan Xiong [ 15/Mar/18 ] |
|
For normal write, people probably don't care about write atomicity. But people definitely care about it for append write. |
| Comment by Patrick Farrell (Inactive) [ 15/Mar/18 ] |
|
I agree strongly w/Jinshan about append write & atomicity. The problem Bobijam noted about i/o restarts is the problem that I couldn't see any way to solve except by holding an exclusive layout lock throughout, and the problem that I ran in to is that the server takes an exclusive lock to create the new layout. If the server could use the lock the client already has, that would resolve the issue. Is that a viable option, though? Two choices come to mind - Have the client ask for an exclusive lock whenever it wants to do this kind of operation and then have the server always use it (this seems too heavy), or have the server check to see if there is an exclusive lock from a client (which would happen for O_APPEND but not otherwise) and use it if it finds one...? Either choice seems like it might be problematic. Any thoughts? |
| Comment by Andreas Dilger [ 17/Mar/18 ] |
|
My thought here is that the client can make an initial attempt at locking the file based on the current size and layout, and then re-verify once it actually has the locks. If (as is normally the case) the locks and layout are still converting the size and no new component is added, then it goes ahead with the append write, possibly with a new offset. In the rare case, the new size + length cross into an uninitialized component boundary or a new component was added that has unlocked objects, then the client instantiated the new component and restarts the append. Rinse and repeat as necessary. Patrick, the major problem of holding the layout lock over the whole append (or any other operation holding locks on two servers) is "cascading aborts". This happens when a client holds a lock and becomes blocked waiting for a lock on a second (failing) server. While it is waiting on the second server (possibly for a long time), it gets a lock cancel on the first (referenced) lock that it can't release, and is evicted from the first server, even though both of those nodes are fully functional. Hence, we prefer not to hold locks on multiple servers at one time. |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/18 ] |
|
Ah, OK, that makes sense. But don't we do that anyway for append, Andreas? Take all the OST locks first? Since we need it to maintain the atomicity semantic? Keep forgetting to mention this... Can't we just handle all of the layout instantiation stuff early, and at once? Keep the same locking approach as before, where we lock the layout and hold that lock, but release it if required to do any instantiation. Just do all of the layout stuff before i/o and then do I/O, restarting each time if we have to actually change the layout? I assume that requires some tweaks, but it should work. To be 100% clear, here's the order of operations: If yes, send request to server and restart i/o (unlocking the layout as part of the restart) Take OST LDLM locks Any thoughts? |
| Comment by Andreas Dilger [ 04/Apr/18 ] |
I thought that was pretty much what I wrote in my previous comment? We definitely wouldn't want to instantiate all of the components, but in some cases we might want to instantiate the next component before the append is started. |
| Comment by Joseph Gmitter (Inactive) [ 05/Sep/18 ] |
|
Hi Patrick, Are you still planning to implement a fix for the append issue? We are looking at the open DoM issues and this thread has been inactive since your original patch. Are you planning to have a look at an alternative fix? Thanks. Joe |
| Comment by Patrick Farrell (Inactive) [ 05/Sep/18 ] |
|
Yes, it’s definitely still on the to-do list, but it’s fallen down it a bit. I do intend to get to it - it’s relevant for Cray in support of the self extending layouts work. But I don’t have a timeline for when. |
| Comment by Stephane Thiell [ 28/Feb/19 ] |
|
Today we were tracking why our inode usage on the OSTs were abnormally high (with 2.12 on Fir) and I discovered this PFL limitation and ticket. We really liked the idea of PFL because of our very diverse workloads on Sherlock, and use it with an aggressive configuration for very large files (some are 20TB+) but we also have TONS of small files on scratch. For instance Slurm log files, likely often re-opened in append mode, have ALL their PFL components instantiated, which ends up using a lot of inodes on the OSTs. We also use DOM so PFL seems now very counter-productive for small files that are open in append mode. We'll probably have to dramatically reduce the use of PFL until this issue is fixed. Any ETA for a fix? $ ls -l job.err
-rw-r--r-- 1 apatel6 suncat 2533 Feb 15 20:25 job.err
$ lfs getstripe job.err
job.err
lcm_layout_gen: 7
lcm_mirror_count: 1
lcm_entry_count: 6
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 131072
lmm_stripe_count: 0
lmm_stripe_size: 131072
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 131072
lcme_extent.e_end: 16777216
lmm_stripe_count: 1
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 21
lmm_objects:
- 0: { l_ost_idx: 21, l_fid: [0x740000400:0x14a072:0x0] }
lcme_id: 3
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 16777216
lcme_extent.e_end: 1073741824
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 16
lmm_objects:
- 0: { l_ost_idx: 16, l_fid: [0x880000400:0x14a007:0x0] }
- 1: { l_ost_idx: 7, l_fid: [0xb00000402:0x14a20d:0x0] }
lcme_id: 4
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 34359738368
lmm_stripe_count: 4
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 45
lmm_objects:
- 0: { l_ost_idx: 45, l_fid: [0xa40000402:0x149f48:0x0] }
- 1: { l_ost_idx: 40, l_fid: [0xa00000402:0x14a03f:0x0] }
- 2: { l_ost_idx: 4, l_fid: [0x8c0000402:0x149f68:0x0] }
- 3: { l_ost_idx: 29, l_fid: [0xcc0000402:0x14a0ed:0x0] }
lcme_id: 5
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 34359738368
lcme_extent.e_end: 274877906944
lmm_stripe_count: 8
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 28
lmm_objects:
- 0: { l_ost_idx: 28, l_fid: [0xd80000402:0x14a18e:0x0] }
- 1: { l_ost_idx: 23, l_fid: [0x940000402:0x14a023:0x0] }
- 2: { l_ost_idx: 20, l_fid: [0x9c0000402:0x14a06f:0x0] }
- 3: { l_ost_idx: 1, l_fid: [0xb40000402:0x14a3c1:0x0] }
- 4: { l_ost_idx: 47, l_fid: [0xa80000402:0x14a07a:0x0] }
- 5: { l_ost_idx: 38, l_fid: [0xe40000402:0x14a0e6:0x0] }
- 6: { l_ost_idx: 6, l_fid: [0xc40000402:0x149fce:0x0] }
- 7: { l_ost_idx: 27, l_fid: [0xd00000402:0x14a246:0x0] }
lcme_id: 6
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 274877906944
lcme_extent.e_end: EOF
lmm_stripe_count: 16
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 30
lmm_objects:
- 0: { l_ost_idx: 30, l_fid: [0xdc0000402:0x14a14b:0x0] }
- 1: { l_ost_idx: 15, l_fid: [0xac0000402:0x14a00e:0x0] }
- 2: { l_ost_idx: 22, l_fid: [0xe00000402:0x14a0ae:0x0] }
- 3: { l_ost_idx: 3, l_fid: [0xbc0000402:0x14a177:0x0] }
- 4: { l_ost_idx: 37, l_fid: [0xc00000402:0x14a0ba:0x0] }
- 5: { l_ost_idx: 36, l_fid: [0xe80000402:0x14a14e:0x0] }
- 6: { l_ost_idx: 8, l_fid: [0xc80000402:0x149f98:0x0] }
- 7: { l_ost_idx: 35, l_fid: [0xd40000402:0x14a110:0x0] }
- 8: { l_ost_idx: 34, l_fid: [0xec0000402:0x14a140:0x0] }
- 9: { l_ost_idx: 13, l_fid: [0x300000400:0x149f2b:0x0] }
- 10: { l_ost_idx: 12, l_fid: [0x3c0000400:0x14a1ac:0x0] }
- 11: { l_ost_idx: 11, l_fid: [0x380000400:0x14a079:0x0] }
- 12: { l_ost_idx: 39, l_fid: [0x340000400:0x14a0dc:0x0] }
- 13: { l_ost_idx: 46, l_fid: [0x4c0000400:0x14a139:0x0] }
- 14: { l_ost_idx: 10, l_fid: [0x580000400:0x14a204:0x0] }
- 15: { l_ost_idx: 31, l_fid: [0x480000400:0x149fab:0x0] }
|
| Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ] |
|
Stephane, We're targeting this for 2.13 (really, we are - it's not just a placeholder). It is very helpful to have your description of the real world problems this is causing - I hadn't really seen this as a practical issue prior to your comment. Edited to add: |
| Comment by Stephane Thiell [ 28/Feb/19 ] |
|
Thanks for the quick update! That's good news, and there is hope! Could you please let me know when a patch is ready, as we'd be very happy to test it and report back. And yes, this should definitively be added to 2.12 LTS |
| Comment by Andreas Dilger [ 01/Mar/19 ] |
|
Stephane, there are really very few HPC workloads that use O_APPEND very much, so this has always been a low priority to fix. For the short term, I'd suggest to set a plain 1-stripe file layout on the SLURM log directory to avoid this issue. You can use "lfs migrate -c1" to migrate the old log files to release their many unused objects. |
| Comment by Stephane Thiell [ 01/Mar/19 ] |
|
Hi Andreas, Just to clarify, I meant the job error and output files, that are user-accessible and can be located anywhere, and a lot of them end up in /scratch (I didn't mean the slurm daemon log files). After seeing your message last night, I started scanning small files on our new 2.12 scratch filesystem that have all layout components instantiated, and this morning I checked the results. Basically, you're right: the vast majority of the small and medium-sized files are not fully instantiated, so this is a relief. And Sherlock - our Lustre client cluster here - is doing all kinds of Research Computing workloads - so not only traditional HPC workloads, but also AI, Genomics, and many other fields of science, even our Music and English Departments are using it. I couldn't scan everything of course, but with a 1-night partial scan of our scratch, I found the following files that were created with O_APPEND:
|
| Comment by Gerrit Updater [ 14/Mar/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34425 |
| Comment by Andreas Dilger [ 29/Mar/19 ] |
Stephane, would it be possible for you to collect size statistics for these log files that are created with O_APPEND? I'm wondering if it makes sense to always create files opened with O_APPEND with stripe_count=1? There is no parallelism possible with O_APPEND files (all writers are serialized because they always need to write at EOF), so if we think about this at a high level, creating such files on a single OST is the most efficient solution. One potential issue is if the log files become extremely large, in which case having them striped across multiple OSTs would balance the space usage better. Also, size-limited PFL layouts may be useful for log files, to ensure they don't grow too large. Finally, we may still want to honor the OST pool selection from the first component (e.g. flash pool) for the log files. That implies there may need to be some "smarts" for handling the layout for O_APPEND files. On the one hand, I'm not a big fan of hard-coding heuristics into the filesystem, but on the other hand it is desirable to "do the right thing" when we know what that is so that users do not have to "do the wrong thing first" before they figure it out repeatedly themselves. |
| Comment by Andreas Dilger [ 29/Mar/19 ] |
|
I don't see it mentioned in the comments here, but one proposal from Bobijam that is implemented in the https://review.whamcloud.com/34425 patch is to change the client to use the lock on the 0th stripe in the file as the proxy for "lock the whole file for append". This should serialize against all similarly-modified clients, as well as older clients that are locking [0,EOF] on all of the stripes. Unfortunately, Patrick's last comment in 34425 is "I this approach is fatally flawed", but I'm not sure why that is the case. One issue that came to my mind with the above approach is that it would no longer be possible to optimize the producer-consumer workload, where the producer(s) must get a PW lock covering only [file_size,EOF] during append, and consumer(s) can get PR and/or PW locks in the range [0,size] without conflict. This makes me think again of comment-224988 where we speculatively pre-instantiate the next component if it is within current_file_end+N*write_size, then lock the range [current_file_end & PAGE_MASK,EOF] and then verify that (a) the current_file_end under lock is still within the locked range (i.e. no racing truncate()), and (b) current_file_end+write_size does not extend into an uninitialized component before any writes are done, otherwise drop the locks and instantiate the next component. That should ensure we never need to instantiate a component while holding locks during append, which seems to be a significant complication. |
| Comment by Patrick Farrell (Inactive) [ 29/Mar/19 ] |
|
So, here's the email chain. I am going to follow it up with a quick note about my current proposal. (not done implementing it, partly because I'm sort of waiting for feedback before I get too far in): From: Patrick Farrell <pfarrell@whamcloud.com>
Bobijam, Mike,
I have been working on
Bobi's idea of taking an LDLM write lock on the first object works, but when the layout changes (as part of restarting i/o), all of the cl objects are destroyed (lov_conf_set->lov_layout_change->cl_object_prune). This is OK, since the LDLM resource is not and we can hold a reference on the LDLM lock.
But this made me think of something bigger:
Locking the first object of the file and holding that lock across i/o requires that the first object not change, so other clients will try to lock the same thing.
But a possible layout change is choosing a new replica for write, isn't it? So now we do not have the same "first object", and another client would not lock it. That client will no longer be excluded like we expect. I think this is fatal to this idea...
I think maybe we need to return to having the client take an exclusive layout lock, and then having the server use that lock to complete the layout change.
I will think about that one.
From: Bobijam Xu
Here’s my thought: during a write session, the chosen mirror/replica will not be altered until other mirrors/replicas has been re-synched.
When a file is under written, the layout’s flr state is changed from RDONLY to WRITE_PENDING, another incoming write will also find out the file is in WRITE_PENDING state. lod_declare_layout_change()->lod_declare_update_write_pending() will still choose the same mirror/replica which has been under written as the primary mirror/replica and stale other mirror components, so the “first object” will not change.
Only after the file got resynced could write choose another mirror/replica to write upon, that is out of the append write session I think. – Best regards,
Bobi Jam From: Patrick Farrell <pfarrell@whamcloud.com>
Hmm, I feel like maybe there is still an issue...? But it relies on details of how resync, etc, works.
Append: Get layout Start i/o Get layout Do i/o Release first object EXT lock
Layout change: WRITE_PENDING RDONLY
Couldn't the entire "RDONLY-
Also, is it possible that because we hold the EXT lock while waiting for the layout lock (in the append case) we could deadlock with resync?
From: Bobijam Xu
The resync is not in the write cycle, currently resync could only be issued by another process which leverages lease lock. A write only changes file state “RDONLY-
– Best regards,
Bobi Jam From: Patrick Farrell
Right - I know resync is not in the write cycle. Sorry that wasn't clear.
I don't see why we can't have a write, then a resync, after "Get first object EXT lock" and before the i/o completes.
I don't think taking that first extent lock prevents another client from changing the layout before we complete the append. Once the layout is changed, another append write starts, and will ask for a different lock and not be excluded by the earlier append write.
From: Patrick Farrell <pfarrell@whamcloud.com>
Any further thoughts on this one? Am I wrong about the race here? If not:
So: take local write tree lock Do append as normal write at current file size release local write tree lock release append lock
From: Bobijam Xu That’s at cost even for non-race append write. I’m thinking that the append write/resync race is a rare case, is it possible that the append write check the layout version to make sure that the layout change is made by another resync thread instead of its own to instantiate new component, so that it can return error reporting the rare append write failure?
– Best regards,
Bobi Jam "That’s at cost even for non-race append write."
Since it's a full file lock, basically, it seems like the place to put it is in the "file", on the MDS side.
"I’m thinking that the append write/resync race is a rare case, is it possible that the append write check the layout version to make sure that the layout change is made by another resync thread instead of its own to instantiate new component, so that it can return error reporting the rare append write failure?" I agree it's probably a rare case... Hmm. But what if it's another (non-append) write that changes the layout generation (by instantiating something, I guess)? How do we recognize that case [which is OK] and not error then too?
I guess I am also a little worried that we hold an EXT lock when we check the layout? I cannot see a specific problem, but it is kind of the opposite of the usual order:
Normally the order is layout refresh-
With "lock first object", we do:
I cannot see any specific problem with that (except the append write/resync race we noted), but it seems possibly dangerous.
Regards,
|
| Comment by Patrick Farrell (Inactive) [ 29/Mar/19 ] |
|
To summarize the problem I see with "lock the first object", the "first object" definition can change if we (for example) switch the preferred write replica after that lock is taken but before the write happens. Then any other clients coming along will not see the same "first object" and will not be excluded. There's more in the email chain, but that's the main issue I see. So my current proposal is to stick an "append bit" in the IBITS lock, and use that for mutual exclusion of append writes. Basically, take that lock before starting an append, then check the file size, and do the append as a "normal" write at that offset. (Note that we do not re-check file size on i/o restart - This is so if EOF changes in the middle of our i/o due to another i/o (non-append, as appends are excluded), we do not split the append write.) For servers that do not support the new "append bit", we would fall back to the old behavior. In terms of exclusion vs clients that do not support the new lock bit, we will still take a lock for our write, which takes us a long way towards that exclusion. Unfortunately, in the "old and new client interop" situation, some of the complex cases are not fully covered. But it's stuff like "if there is an i/o restart for instantiation while two clients (one old, one new, with a new server) are racing appends, it is possible in some scenarios for the append writes to end up conflicting rather than being kept fully atomic". |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/19 ] |
|
Bobijam replied noting that he agreed the overhead is similar in both cases. Unless there are other objections, I'm going to push ahead with the "add an append bit" approach. |
| Comment by Andreas Dilger [ 02/Apr/19 ] |
|
The problem with the append bit, as I see it, is that this is essentially a full-file lock, or does the append not only cover the last byte? Ideally if we are changing this code we could optimized the producer consumer append workload as well. The other drawback of the append bit is that this puts the locking load entirely on the MDS, rather than scaling with the OSTs. How is the append bit lock different than a lock [size,EOF] for the stripe that holds the current file size, with pre-instantiation and a retry if further instantiation is needed? |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/19 ] |
|
So, it is basically a full file lock, but it's only for appends. Other i/o will not take this lock. It is necessary to keep appends excluded from one another so they do not end up mixed together - to retain atomicity. It is not necessary to exclude writes or reads (beyond the normal EXT locking just on the i/o bytes), since those are legitimate race conditions. It does put the locking load on the MDS, definitely. A possible optimization is we could request this at open time when we open with O_APPEND, optionally, similar to how DOM works. Working on answering your last question... |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/19 ] |
|
Imagine an append write that spans from a current instantiated component in to the next one, which is uninstantiated. So the append write locks the stripe that holds the current file size, then another (non-append) write comes along and instantiates the next component (doesn't write in to the stripe we've locked). This write goes on to change file size as well. This race with the non-append write is fine. But what if another append write appears now? It will not lock the same stripe our original append did, and the original will not (necessarily) have locked the 'new' stripe we have to write in to it (at least, not yet). We have two append writes happening with different ideas of the file size, and we have no way to keep them from proceeding at the same time, because the second one didn't care about the stripe the first one locked. So they can end up "atop" one another. |
| Comment by Andreas Dilger [ 03/Apr/19 ] |
|
It would be interesting to know what happens on a local filesystem in this case (in particular XFS since it allows parallel writers on a file)? If the first append is at the old EOF, while the non-append write extends the file size, it seems IMHO perfectly reasonable that the second append would be allowed to continue. The same behavior would be seen regardless of whether the first append had completed or not. If the two appends were overlapping then the write locks held by the first append should prevent the second write or append |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/19 ] |
|
It's not that it's not allowed to continue. The issue is that the first append (which is writing across a component boundary) will lock the stripe where file size is. It then checks file size. Leave this thread here - this is the race... Now start another append write. It will not attempt to lock the same stripe as the previous one, because the file size is no longer there. And the two appends end up on top of each other. That is the race I am describing. |
| Comment by Andreas Dilger [ 03/Apr/19 ] |
|
That would only happen if the first append was writing to a region where there was an uninstantiated component. As previously described, I think the components for the append should always be pre-instantiated before write starts. At a minimum the full range of [size,size+len], which is known in advance, should be instantiated. Optionally, something like [size,size+N\*len], where N is something reasonable like 8-16, num_cpus, etc. so that racing appends are unlikely to need to retry. Even then, it may not be worthwhile to have N > 1 if one considers "len <<< component_size" so the chance of crossing a component is very small on any given write. |
| Comment by Stephane Thiell [ 04/Apr/19 ] |
|
Andreas, please find below the size stats for 177,478 files found like that (partial scan done last night): # NumSamples = 177478; Min = 0.00; Max = 129292.00 # Mean = 3339.485897; Variance = 60982229.732148; SD = 7809.111968; Median 1221.000000 # each ∎ represents a count of 2179 0.0000 - 12929.2000 [163461]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (92.10%) 12929.2000 - 25858.4000 [ 12062]: ∎∎∎∎∎ (6.80%) 25858.4000 - 38787.6000 [ 630]: (0.35%) 38787.6000 - 51716.8000 [ 396]: (0.22%) 51716.8000 - 64646.0000 [ 260]: (0.15%) 64646.0000 - 77575.2000 [ 235]: (0.13%) 77575.2000 - 90504.4000 [ 157]: (0.09%) 90504.4000 - 103433.6000 [ 116]: (0.07%) 103433.6000 - 116362.8000 [ 85]: (0.05%) 116362.8000 - 129292.0000 [ 76]: (0.04%) All of them would fit in our "mdt" 128kB DOM stripe... but they have a PFL layout fully instantiated instead. |
| Comment by Andreas Dilger [ 08/Apr/19 ] |
|
It isn't totally clear from the output, but are the above units in bytes (with strange fractional-byte boundaries) or KB? In any case, if all or a large majority of O_APPEND files are smaller than 128KB, it seems to validate my theory that creating a one-stripe file when a file is first opened with O_APPEND would be perfectly fine. If these files are only logs/errors they will very rarely be larger than a few MB in size, and easily handled by one OST. That would be a very simple heuristic to implement (maybe with a tunable option to disable it if it not wanted), and would be easily backported if needed. The other proposals will need a considerably larger effort (changes to locking, CLIO layer, protocol compatibility) and would be much less likely to be available in an older release. There is the patch https://review.whamcloud.com/33126 "LU-11234 lod: add uid/gid/suffix/jobid/nid pool selection layout_policy" which allows an administrator to specify an OST pool to be selected based on process/user/filename/client attributes. I think it would be much more useful to allow specifying a whole layout like patch https://review.whamcloud.com/28972 "LU-9982 lustre: Clients striping from mapped FID in nodemap", but at least it is a step in the right direction. This could also be improved to detect whether the |
| Comment by Stephane Thiell [ 08/Apr/19 ] |
|
Sorry for the lack of context! Yes the above units are in bytes. I sometimes use histogram.py (from https://github.com/bitly/data_hacks ) to display quick histograms from data points, hence the weird ranges - they depend on the input data). Anyway, yes, these files are all very small (< 128KB). Thanks for the different links, this is very interesting. To be honest, we would be happy already with a simple patch to try the one-stripe option, as long as there is a tunable to enable/disable on the flight. This would allow us to still use "large" PFL layout in general but not waste inodes on OSTs for O_APPEND files. |
| Comment by Nathan Dauchy (Inactive) [ 29/Jun/19 ] |
|
This issue appears to be causing problems for us as well, primarily for Slurm stdout/stderr files (which, as Stephanie said, can be scattered all over various directories). Using 2.10.5+ servers and 2.10.7+ clients. The other trigger that I don't see mentioned is probably less common... a seek to some point in the file without actually writing anything. It may not be directly related, but would be an opportunity for efficiency if any changes could also handle "dd if=/dev/zero of=sparse_file bs=1 count=0 seek=100G". |
| Comment by Andreas Dilger [ 24/Jul/19 ] |
|
Mike, The general consensus is that we should default to create a 1-stripe file when a file is opened with O_APPEND, because there is purely overhead when having multiple stripes in this case due to extra OST object creation, extra locking of all the objects, etc. and no benefits (multiple threads cannot write to the same file anyway). There should be a tunable parameter that disables this feature, something like lctl set_param mdt.*.append_onestripe_enable=0 or ...append_logfile_enable=0 or similar to disable it (enabled by default since this is the most common case). There could be a mdt.*.append_*_max_mb=N tunable to set the maximum size of the log files. Probably the default for *_max_mb should be unlimited (it will run out of space when OST is full), and use a plain file layout, which provides maximum compatibility for clients and applications. If it is not unlimited, then it will need to create a single PFL component with a limited size, and all PFL-capable clients will return an error if they try to write beyond the end of the component. There definitely may be a benefit in many environments to limit the maximum size of a log file to protect from a runaway job spamming the filesystem (I've wanted this even for local filesystems on occasion). I was thinking that having a DoM + 1-stripe OST PFL layout could be useful, but this doesn't help in the end because O_APPEND will always instantiate the OST object anyway, and the OST would need to be checked for data each time. That would just be overhead and doesn't make any sense to do in the end compared to a DoM-only file. However, I think it would make sense to allow creating a DoM-only PFL file if only small log files are used (which seems to be typical), as the lower write latency could improve application performance if they are blocked on the log writes. It could also hurt performance if the MDS is overloaded, and it could consume too much MDT space if the logs got very large, so this behavior shouldn't be enabled by default. Maybe mdt.*.append_*_enable=mdt to enable creating the component on DoM, and change *_max_mb by default (if currently unlimited) for such files to 32MB? If we wanted to get fancy at some point in the future (if this feature gains interest), we could allow ...append_enable=[FID] together with patch https://review.whamcloud.com/28972 "LU-9982 lustre: clients striping from mapped FID in nodemap" to specify a source FID for the layout template for all O_APPEND files, which allows maximum flexibility instead of trying to specify a variety of layouts via tunable parameters, but this would require that LU-9982 is completed first to allow specifying an arbitrary FID as a layout template instead of the parent directory, so it isn't a goal for the initial patch. |
| Comment by Nathan Dauchy (Inactive) [ 24/Jul/19 ] |
|
Andreas, Mike, Rather than a boolean for append_onestripe_enable, how about "create_append_stripe_count=N" where N=0 is to disable and use the filesystem default? That would allow admins to set whatever count fits best with the general workload. It would also fit better with my other concern about controlling what pool that (single) stripe is on, so have "create_append_stripe_pool=S". (Otherwise it would use any OST, and a large log file single-striped could fill up an expensive SSD target. Or if the admin knows all log files are small, just go ahead and keep them exclusively on SSDs for small write performance!) And I guess "create_append_stripe_width=N" for completeness. Update: Upon re-reading Andreas' comment, maybe the intent was to still use a PFL layout. That brings to mind a possibility of just defining a "create_append_extents=N" option which indicates how many of the default layout PFL extents to define. I think in our case we would set it to 2, to get first part on SSD and have a backstop of single larger stripe on HDD. The end of the last default extent (typically "-1") should be applied to the file, OR *_max_mb if that is defined but I think that would no longer be needed. Update 2: Thanks Patrick for catching that I said "instantiate" when I should not have. My thought was to still create a special layout, with PFL, but base it on the default layout... truncated to N extents. Then the whole thing can be instantiated. Thanks! |
| Comment by Patrick Farrell (Inactive) [ 24/Jul/19 ] |
|
"as the lower write latency could improve application performance if they are blocked on the log writes" But, then again, I suppose APPEND is being used because multiple threads might be writing to this file, and in that case, we're going to spam syncs from lock contention, so it would be very good to be on flash rather than spinning disk. Hmm. It seems like a "nice to have", but it does seem useful. |
| Comment by Patrick Farrell (Inactive) [ 24/Jul/19 ] |
|
Unfortunately, choosing how many to instantiate isn't an option - If it were easy to do that, we probably wouldn't be having this conversation. Not instantiating the whole layout for appends is quite challenging for subtle reasons around eliminating the possibility of write-splitting in the scenario where more than one client is writing at once and some writes span the boundary between instantiated and uninstantiated components. The only ways we came up with of doing that which are sure to work are fairly heavy handed. (The details and things we tried are buried in the ~40 previous comments on this bug.) So Andreas suggested we sidestep this by creating a special layout, as almost all O_APPEND users probably prefer this anyway, and this is straightforward to implement. |
| Comment by Andreas Dilger [ 24/Jul/19 ] |
My goal here is to keep the interface as simple as possible so that it can be implemented, tested, landed, backported, and deployed in a reasonable time, since the more complex it is the more chance of bugs, longer testing, more conflicts/dependencies, etc. I guess I would be OK with append_layout_enable=<stripe_count> instead of only 0, 1 (0 has the added benefit of meaning "use the default stripe count" anyway), though there are performance optimizations that can be done for one-stripe files (e.g. patch https://review.whamcloud.com/31553 "LU-10782 llite: simple layout append tiny write") that would be lost if the files are striped across OSTs. I guess append_layout_pool=<pool> wouldn't be too much more complex. I'd rather avoid the full spectrum of "create complex file layout via tunable parameters" for this issue. I think the stripe_width is so rarely used, and is irrelevant for 1-stripe files, that it doesn't warrant a separate parameter - we can use the filesystem default stripe_width for this. If we want to allow a complex file layout at some point in the future, this could be done with append_layout_enable=[FID] and LU-9982, but I don't think that is needed for most cases. We could also leverage patch https://review.whamcloud.com/33126 "LU-11234 lod: add pool selection layout_policy" to specify a different pool for specific file extensions like *.log *.out, etc., and that could also integrate with LU-9982 to specify a complex layout, but neither of those patches are themselves finished, they don't necessarily catch all log files (which overwhelmingly use O_APPEND), nor are they necessarily simple enough to make a trivial backport to 2.10. Nathan, did you ever check/post the file size distribution for O_APPEND log files at your site? I see comment-245248 from Stephane, but nothing from you here or in the other DDN tickets in Jira. If all the log files are all small (Stephane reported 92% smaller than 13KB and 100% smaller than 128KB), then the OST they are located on doesn't really matter, and storing them on flash might make sense anyway because that is faster and applications will eventually be blocked on their stdout/stderr if it is taking a long time. If there is an upper limit on the size of such log files, that would avoid filling flash OSTs/MDTs, and be a generally useful feature as well (IMHO at least). |
| Comment by Nathan Dauchy (Inactive) [ 24/Jul/19 ] |
|
Andreas, no, I don't have a size distribution for O_APPEND files, sorry. The users scatter and name their slurm job output files in various ways so I would have to scan the whole file system and guess at naming conventions. And even that might not catch other files that happened to be created with O_APPEND. If you can suggest a way to scan the filesystem for files which are not using their last instantiated extent, I'm happy to try to provide more data. Capping the size on O_APPEND files is potentially useful, but also violates the principle of least surprise on a POSIX-like filesystem, and would lead to very unhappy and confused users if writes fail unexpectedly. Hence my suggestion of "truncating" the PFL layout to N extents, and keeping the extent end of the last component. Hopefully it would be fairly easy to take the layout that would otherwise be created on O_APPEND and just set the layout to the first N components, modifying the last component end to be the original layout's last component end. No additional pool specification needed, no max size limit surprises, the system "just works" for the users. |
| Comment by Gerrit Updater [ 25/Jul/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35611 |
| Comment by Gerrit Updater [ 25/Jul/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35617 |
| Comment by Stephane Thiell [ 25/Jul/19 ] |
|
Hello! I wanted to clarify something regarding the file distribution of our O_APPEND files. I'm sorry but I originally only scanned for files <= 128KB. I redid a partial scan last night and this is the new results: [root@fir-rbh01 data_hacks]# cat /tmp/sz | ./histogram.py -l -b 20 -p
# NumSamples = 226682; Min = 0.00; Max = 517320932.00
# Mean = 21389.221213; Variance = 1342429682961.335938; SD = 1158632.678186; Median 1218.000000
# each ∎ represents a count of 1457
0.0000 - 493.3562 [ 50221]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (22.15%)
493.3562 - 1480.0685 [109348]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (48.24%)
1480.0685 - 3453.4931 [ 17414]: ∎∎∎∎∎∎∎∎∎∎∎ (7.68%)
3453.4931 - 7400.3424 [ 2219]: ∎ (0.98%)
7400.3424 - 15294.0409 [ 43661]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (19.26%)
15294.0409 - 31081.4379 [ 1009]: (0.45%)
31081.4379 - 62656.2319 [ 777]: (0.34%)
62656.2319 - 125805.8200 [ 614]: (0.27%)
125805.8200 - 252104.9961 [ 372]: (0.16%)
252104.9961 - 504703.3483 [ 266]: (0.12%)
504703.3483 - 1009900.0527 [ 307]: (0.14%)
1009900.0527 - 2020293.4616 [ 215]: (0.09%)
2020293.4616 - 4041080.2794 [ 44]: (0.02%)
4041080.2794 - 8082653.9150 [ 27]: (0.01%)
8082653.9150 - 16165801.1862 [ 186]: (0.08%)
16165801.1862 - 32332095.7286 [ 1]: (0.00%)
32332095.7286 - 64664684.8134 [ 0]: (0.00%)
64664684.8134 - 129329862.9829 [ 0]: (0.00%)
129329862.9829 - 258660219.3219 [ 0]: (0.00%)
258660219.3219 - 517320932.0000 [ 1]: (0.00%)
It's just to clarify that we still have files like that > 128KB, but the fact remains that most of the files that are suspected to be opened in O_APPEND are small files. I'm for a simple solution that can land quickly, rather than nothing, or a complex one. And I don't mind impacting the performance of our users doing open(O_APPEND) anyway.
@dauchy, If you want to check the distribution of such files, this is how I do it: determine the size of your PFL setting where all components are instantiated, say it's 100GB. Then determine the max number of components (in the example below, 6). We want to scan all files that are smaller than that and that have all of their components initialized (lcme_flags as "init"). Those files are either files that were opened with O_APPEND or files that were big and then truncated - but I assume here that the latter is rare. Then, I run something like this: $ find /lustre -size -100G -type f -exec ./chkstripe.sh 6 {} \; chkstripe.sh being: #!/bin/bash
initcnt=$1
path=$2
c=$(lfs getstripe "$path" | grep lcme_flags: | grep -c init)
if [[ $c == $initcnt ]]; then
sz=$(stat -c '%s' "$path")
echo $sz $path
fi
|
| Comment by Andreas Dilger [ 26/Jul/19 ] |
|
I thought I had commented yesterday, but I guess I got distracted and forgot to commit it.
I was going to suggest to use "lfs find" to find files with multiple PFL components, which have a ful stripe count but have a smaller size than needed to instantiate the last component. However, it looks like I found a small bug in lfs find when checking the stripe count of PFL files that doesn't match the expected implementation. The attached patch fixes this issue, and only changes lfs and does not need any changes to the client or server code. With this patch, you can run something similar to "lfs find /lfs1 -type f -comp-count +1 -stripe_count=M -size -N", where M is the stripe count of the last component (i.e. file is fully instantiated), and N is a size smaller than what would normally be needed to instantiate that component. This will list files that are (very likely) created with O_APPEND. |
| Comment by Nathan Dauchy (Inactive) [ 27/Jul/19 ] |
|
I finally have some data for file size distribution for the NOAA system. Update & Corrections: Reducing lru_max_age and dropping cache a few times on the client prevented the eviction and hang, and 'stat' provides more accurate file sizes. I restriped the 3 largest files and reran to get a clearer picture. Total scan took ~12 hours. I used a scanning method combining the suggestions from Andreas and Stephane, and adding in some parallelism, running multiple instances of the following at project subdir levels:
lfs find $proj -type f -size -34359738360 -stripe_count 46 -comp-count 5 |
xargs -r -P 8 -I {} /bin/bash -c '[ $(lfs getstripe -c {}) -eq 32 ] && stat -c "%s %n" {}'
Here are the results with the same histogram reporting tool so you can see the differences from what Stephane reported: # find results/ -type f -size +0 | xargs awk '{print $1}' | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i"
# NumSamples = 1377552; Min = 0.00; Max = 356778002.00
# each ∎ represents a count of 2435
0 - 340 [ 3802]: ∎ (0.28%)
340 - 1020 [ 89830]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.52%)
1020 - 2381 [164156]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.92%)
2381 - 5103 [182637]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (13.26%)
5103 - 10547 [133979]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.73%)
10547 - 21435 [156886]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (11.39%)
21435 - 43211 [127305]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.24%)
43211 - 86763 [ 72366]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (5.25%)
86763 - 173867 [127675]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.27%)
173867 - 348076 [ 84564]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (6.14%)
348076 - 696492 [ 23804]: ∎∎∎∎∎∎∎∎∎ (1.73%)
696492 - 1393325 [ 20384]: ∎∎∎∎∎∎∎∎ (1.48%)
1393325 - 2786990 [118037]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.57%)
2786990 - 5574321 [ 63407]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (4.60%)
5574321 - 11148982 [ 2558]: ∎ (0.19%)
11148982 - 22298306 [ 2012]: (0.15%)
22298306 - 44596952 [ 3835]: ∎ (0.28%)
44596952 - 89194245 [ 288]: (0.02%)
89194245 - 178388830 [ 8]: (0.00%)
178388830 - 356778001 [ 18]: (0.00%)
|
| Comment by Stephane Thiell [ 27/Jul/19 ] |
|
Thanks Andreas! This is awesome. We might be able to do a full scan with that. I'll let it run and report back! |
| Comment by Nathan Dauchy (Inactive) [ 27/Jul/19 ] |
|
Some further analysis of the file types which are likely opened with O_APPEND... Most of the files overall are indeed "log" files (some created by slurm, some by user scripts) or have a suffix that leads me to believe are batch job stdout/stderr: # find results/ -type f -size +0 | xargs cat | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15
1100802 log
190166 txt
13387 grib2
5102 grb2
4396 0
2176 1
1218 2
937 3
771 out
634 err
478 41
337 mk
230 sh
151 conf
142 json
# find results/ -type f -size +0 | xargs cat | egrep -c "\.o[0-9]{5,8}$"
50619
# find results/ -type f -size +0 | xargs cat | egrep -c "\.e[0-9]{5,8}$"
1027
The majority of the files greater than 10 MB are ".grb2" or ".grib2" binary files. I don't know if the app that creates those uses append for multi-writer reasons similar to slurm. # find results/ -type f -size +0 | xargs awk '{if ($1>10000000) print }' | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 4
5102 grb2
496 log
57 grib2
31 out
|
| Comment by Stephane Thiell [ 31/Jul/19 ] |
|
Thanks Nathan for the improved commands! I also ran the command using the patched lfs. This time I was able to get a list of such files on the whole filesystem (444M inodes total), however re: the size profile I only have a partial scan at this point. I ran: # lfs find /fir -type f -comp-count +1 -stripe_count=16 -size -200G >/tmp/lfs_find_list Results: 15.3M of files are like that (~3.5%). suffix analysis: # cat /tmp/lfs_find_list | sed 's/.*\.//' | sort | uniq -c | sort -nr | head -n 15 1495272 rst 1462325 ppairs 1428921 csv 1306429 txt 1065879 score 1065836 align 1031148 out 743603 tmp 591077 input 476363 log 266451 sam 261306 err 179451 html 173386 stdout 84526 bash We've identified the .csv files are being generated by https://github.com/Sage-Bionetworks/synapser and our researchers are in touch with the developers to avoid the O_APPEND in that case (as it's really not needed). The out/log/err ones are mostly Slurm logs. We plan to have a look at the other file types. As for the size profile analysis, this is the result on almost 10% of them: # cat /tmp/sz_all | ./histogram.py -b 20 -l --no-mvsd -p -f "%10i"
# NumSamples = 1372517; Min = 0.00; Max = 206874793928.00
# each ∎ represents a count of 17442
0 - 197291 [1308181]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (95.31%)
197291 - 591874 [ 10154]: (0.74%)
591874 - 1381039 [ 24264]: ∎ (1.77%)
1381039 - 2959370 [ 7160]: (0.52%)
2959370 - 6116032 [ 13260]: (0.97%)
6116032 - 12429356 [ 8819]: (0.64%)
12429356 - 25056003 [ 426]: (0.03%)
25056003 - 50309298 [ 14]: (0.00%)
50309298 - 100815887 [ 21]: (0.00%)
100815887 - 201829067 [ 42]: (0.00%)
201829067 - 403855425 [ 19]: (0.00%)
403855425 - 807908143 [ 25]: (0.00%)
807908143 - 1616013577 [ 2]: (0.00%)
1616013577 - 3232224446 [ 15]: (0.00%)
3232224446 - 6464646184 [ 35]: (0.00%)
6464646184 - 12929489659 [ 29]: (0.00%)
12929489659 - 25859176611 [ 4]: (0.00%)
25859176611 - 51718550513 [ 2]: (0.00%)
51718550513 - 103437298318 [ 31]: (0.00%)
103437298318 - 206874793928 [ 14]: (0.00%)
|
| Comment by Nathan Dauchy (Inactive) [ 02/Aug/19 ] |
Patrick, Andreas, Please do put a parameter to control the pool for O_APPEND files into the patch. Based on the scan data from our system and that Stephane provided, I think we (and others using SSDs for the first component of PFL) could accommodate all of these files on the SSD OST pool, which would handle the small writes of log entries much better. I'm OK with handling those few "large" ones with periodic scans and "lfs migrate" to HDD to prevent the SSDs from filling up... at least for a while until a more extensive change for this issue is tackled. Thanks for your consideration, |
| Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ] |
|
OK, that seems like a reasonable request. adilger, let me know if you have any objections, but I'll see about adding it to the current patch. |
| Comment by Patrick Farrell (Inactive) [ 06/Aug/19 ] |
|
spitzcor and vitaly_fertman of Cray indicated they're still interested in taking aim directly at the problem, and asked me to break down the approach I had most recently settled on. I'm still a fan of the current "special layout for O_APPEND" patch, but this fix could be complementary (and it may not be available for a while - The current plan is not too bad, but it's not trivial either). So I'll do that here. There are two basic problems/constraints we are trying to meet. Append writes must be atomic, meaning two things: 2. They must not be mixed with other O_APPEND writes. If two O_APPEND writes, A and B, are racing, either AB or BA is valid, but it is not valid for any part of A or B to overwrite the other one. This is of course not true of regular writes, which are started at a specific offset. This means that if a regular write is racing with an O_APPEND write, it can overwrite part of the O_APPEND write. This is acceptable. The first problem (write tearing) is solved by getting the file size before starting the write and using the same one throughout the O_APPEND write. (The current code checks the file size repeatedly, I believe for every iteration of cl_io_loop.) This means that if another write races with our O_APPEND write we will not 'tear' the O_APPEND write by moving to a new EOF in the middle of the write. Note that we must also retain the size across i/o restarts, for the case where we have to update the layout in the middle of an O_APPEND write. The second problem requires that we allow only one O_APPEND write at a time. There are probably a few ways to solve this, but I think the correct way (it is definitely the simplest way) is to add another bit to the MDT IBITS lock, an O_APPEND bit. All O_APPEND writes must ask for this bit in PW mode before starting to write. (We cannot use the layout lock for this exclusion because the server revokes our layout lock when we have to update the layout.) This lock must be held across i/o restarts, so it should be taken before the layout lock. (Or it could possibly be taken with the layout lock? We have to be careful about ordering/pairing issues with the layout and append bits, I have not thought about this carefully yet.) Note that excluding O_APPEND writes does require excluding multiple O_APPEND writes on the same node as well. This can be done using the local tree_lock in the write path, locking it from 0 to EOF. This combination of things should allow not instantiating the full layout and locking every object. It's a fair bit of work. Note of course that this is a split client/server solution, so it will need a compatibility flag so the client knows it can use the O_APPEND flag. The good news is that this should interop safely with older clients - The older clients simply instantiate and lock everything for O_APPEND, which will give the correct exclusion vs newer clients. |
| Comment by Gerrit Updater [ 09/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35611/ |
| Comment by Patrick Farrell (Inactive) [ 09/Sep/19 ] |
|
Opened LU-12738 to track remaining work, this can be closed once https://review.whamcloud.com/#/c/35617/ lands. |
| Comment by Gerrit Updater [ 20/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35617/ |
| Comment by Peter Jones [ 20/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Nathan Dauchy (Inactive) [ 23/Oct/19 ] |
|
Can this get pulled back into b2_10 and/or b2_12 for LTS? |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36935 |
| Comment by Gerrit Updater [ 12/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36935/ |
| Comment by Gerrit Updater [ 13/Dec/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37007 |
| Comment by Gerrit Updater [ 03/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37007/ |
| Comment by Gerrit Updater [ 13/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49935/ |