[LU-9538] Size on MDT with guarantee of eventual consistency Created: 19/May/17  Updated: 05/Feb/20  Resolved: 09/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.0

Type: New Feature Priority: Minor
Reporter: Li Xi (Inactive) Assignee: Li Xi (Inactive)
Resolution: Fixed Votes: 0
Labels: LSOM, patch

Attachments: PDF File 2018 LSOM Test Plan.pdf    
Issue Links:
Blocker
is blocked by LUDOC-402 Add documentation for lazy size on MDT Resolved
Related
is related to LU-10370 "truncate" does not update blocks cou... Resolved
is related to LU-11466 DoM files should not need LSOM sync f... Resolved
is related to LU-11479 Error replicating xattr for /tmp/targ... Resolved
is related to LU-11696 "lfs getsom" returns "24" (sizeof lus... Resolved
is related to LU-12026 verify that MDS stores atime/mtime/ct... Resolved
is related to LU-11190 LSOM size/age accounting histogram Open
is related to LU-11367 integrate LSOM with lfs find Resolved
is related to LU-10934 integrate statx() API with Lustre Resolved
is related to LU-11473 Add ‘lfs getsom’ to the lfs man page Resolved
Rank (Obsolete): 9223372036854775807

 Description   

I belive that size on MDT has been discussed for a long time, and there were even some implementations of it before. I am creating this ticket to discuss it again, because keeping file size on MDTs seems very important for the new policy engine of Lustre (LiPE) that I am currently work on.

LiPE scans MDTs directly and extracts almost all the file attributes. Values of a series of mathematical expressions are calculated by LiPE according to these attribute values. And the expression values determine which rules the corresponding file matches with. This works perfectly for almost all metadata of files, except the file sizes, because MDT doesn't keep file sizes. That is the reason why we want to add file size on MDT.

Given the fact that file size on MDT has been discussed for a long time, I believe a lot of problems/difficulties of implementing this feature has been recognized by people in Lustre community. And I think is obvious that implementing a strict size on MDT with strong guarantees is too hard.

For LiPE, I think file sizes with guarantees of eventual consistency should be enough for most use cases. Because 1) smart administrators will leave enough margin of data management. I don't think smart administrator will define any dangerous rule based on the strict file size without enough margins of timestamps and file size. 2) Most management actions can be withdrawn without any data lose. And 3) Data removing are usually double/triple checked before being committed. It is reasonable to ask administrator to double check the sizes of removing files on Lustre client if file size on MDT is not precise all the time.

Still, we have a lot of choices about how to implement file size on MDT, even we choose to imlement a relax/lazy version. I believe that a lot of related work in the history could be reused . I guess using a new extended attribute for file size on MDT might be better than using the i_size in inode structure, since data on MDT is coming. And file size on MDT should be synced in a couple of scenarios which provides enough consistency guarantees yet at the same time introduces little performance impact, for example 1) when the last file close finishes, and 2) when a significant time has been past since last sync

I'd like to work on this when this is fully discussed and a design is agreed by all people involved. Any advice would be appreciated. Thanks!



 Comments   
Comment by Peter Jones [ 19/May/17 ]

Li Xi

This seems like it would benefit from discussion at the upcoming developer sessions at LUG. Will you be ready to discuss it by then?

Peter

Comment by Jinshan Xiong (Inactive) [ 19/May/17 ]

It‘s natural to have a SoM attribute on the MDS.

The difficulties of maintaining a precise file size on the MDT is that there would exist evicted clients that keep writing data to OST objects, therefore it's hard to ensure that the file size is accurate at some point.

There was a proposal to solve this problem by global eviction - when a client is evicted by the MDT, it should be evicted by all OSTs at the same time. Well this is no longer a feasible solution as long as DNE is enabled.

I'm trying to solve this problem in a different way in FLR project, where I use versioned RPC and OST objects to prevent write from evicted clients. But this solution needs a sync write at file open.

Comment by Jinshan Xiong (Inactive) [ 19/May/17 ]

With regard to LiPE, would you like to briefly describe how it will solve the existing problem of RobinHood? Or is there any HLD like thingy so that I can take a look?

Comment by Li Xi (Inactive) [ 20/May/17 ]

Peter,

Sure, let's discuss it at the developer meeting.

Comment by Li Xi (Inactive) [ 20/May/17 ]

Jinshan,

Thanks for the information. I might not fully understand the details of precise file size. But it seems that precise file size on MDT is still difficult, given the fact that the file size need to be sync beteen OSTs and MDT. I believe there are ways to implement precise file size on MDT by introducing some advanced features like FLR or versioned RPC that you've mentioned. But since the exact file size can always be double-checked on client if needed, I still feel that we might not actually need precise file size on MDT most of the time. At least that is true for LiPE. Do you know any use case that file size on MDT needs to be always precise?

Strict consistency is so hard, we might need to spend a lot of time to finish the implementation. And I guess there will require even more time before the implementation becomes stable. And during that time period, a lot of opportunities might have been missed. That is why I am wondering whether it is better to implement a file size on MDT with relaxed guarantee.

Comment by Li Xi (Inactive) [ 20/May/17 ]

With regard to LiPE, I would not say that LiPE will solve all the existing problems of Robinhood. They are simply two different implementations of policy engines, each of them has advantages and disadvantages.

Nevertheless, LiPE has some obvious difference with Robinhood, which might brings some extra advantages and disadvantages. The most obvious difference is that LiPE doesn't use extra storage. Instead, it scans MDTs directly. This implementation simplfies a lot of things. For example, Lustre Changelog is not a hard requirement by LiPE. However, more effort is still needed to use Changelog for incremental scanning, which is not supported by LiPE now. Even so, I think LiPE should be at least useful in the circumstance that MDTs only have small numbers of inodes. As far as we have tested, if the inode number on a typical MDT is 4 billion (the top limitation now), then the scanning time would be around one hour. I guess this is still acceptable for some use cases. And after the incremental scanning is finished, I think things will become even better.

Comment by Li Xi (Inactive) [ 20/May/17 ]

Jinshan, there will be a presentation about LiPE on the comming LUG. Will you be there? Let's discuss LiPE and SoM further there if so.

Comment by Jinshan Xiong (Inactive) [ 22/May/17 ]

Do you know any use case that file size on MDT needs to be always precise?

The benefit of SoM is that the clients don't need to send glimpse to OSTs that is a huge benefit. If we can't guarantee the accuracy of SoM then it's meaningless to store it on the MDT.

I agree that it would be enough for most of the use cases if the file size could be guaranteed to be precise at some point. So I would like to hear more from your design

For LiPE, we have a 'similar' proposal to scan the namespace and store the results in Lustre. In this scheme, the reint operation will change the Lustre namespace and scanning result in the same transaction, therefore there is no requirement of ChangeLog at all.

I won't go to LUG but I look forward to reading your slides afterwards.

Comment by Li Xi (Inactive) [ 23/May/17 ]

Hi Jinshan,

I am not against the precise SoM. It will surely improve the metadata performance.

I am wondering whether we can implement non-precise SoM first, and then on the next stage improve it to always-precise SoM when everything else is ready? Before precise SoM is ready, the file size got from client can keeps the same like before, i.e. getting from OSTs. And extra APIs can be added for non-precise SoM for application that doesn't care much.

Another use case of SoM is OST usage balance. For a tool that is trying to balance the disk usage of OSTs, the sizes of files is useful, but it is fine even the file sizes are inaccurate.

Regards,
Li Xi

Comment by Andreas Dilger [ 24/May/17 ]

Li Xi, we used to have "imprecise" file size on MDT in the 1.8 code, but it caused problems because the size was stored in the MDT inode i_size, so anyone backing up the MDT directly via tar or rsync had problems because every Lustre MDT inode appeared to be a huge sparse file. This should be avoided. Also, with DoM it may be that the size of the data in the MDT inode will not match the total file size. This means we should store the SOM size in an attribute that is not i_size.

Beyond just storing the current "imprecise" size on the MDT inode during client write/close operations, it makes sense to enhance LFSCK to reconstruct the size based on the current OST object sizes.

Having an imprecise file size is definitely useful for some uses, but not for POSIX applications, so this size would need to be kept internal to the MDT.

Comment by Andreas Dilger [ 31/May/17 ]

I think there are two main tasks to be implemented here:

  • store the file size and blocks count in an MDT xattr (e.g. LMA) on file close in a similar manner as atime, so only increasing size and/or blocks, and also changing it on a truncate RPC
  • enhance LFSCK to update the file size and blocks count stored in the MDT xattr during layout scanning if it is inaccurate. This should be skipped if the file is currently open.

Then, tools like LiPE, Lester, and Zester can use the MDT size xattr if it is available without having to stat the file or look up the object sizes on the OSTs.

Beyond the improved scanning speed, the other major benefit of storing this on each MDT inode, rather than doing this in userspace is that it will be able to handle new layout types like PFL and FLR automatically without exporting all the gory details into userspace apps.

Comment by Jinshan Xiong (Inactive) [ 31/May/17 ]

enhance LFSCK to update the file size and blocks count stored in the MDT xattr during layout scanning if it is inaccurate. This should be skipped if the file is currently open.

Can you tell a little bit more about how LFSCK could ensure the file size is accurate?

Comment by Andreas Dilger [ 31/May/17 ]

Jinshan, this isn't related to keeping the size 100% accurate for "size-on-MDS" to return to the client. This is for storing an "approximate size" for use by low-level filesystem scanning tools like Lester, Zester, RobinHood or others, to avoid the need for additional scans of the OST filesystems, and combining object sizes in userspace (which is complicated by the changes in file layout). For many uses (e.g. file purge, file backup/archive/HSM, space balancing, etc) having an approximate file size is enough to make the decision.

Comment by Li Xi (Inactive) [ 10/Jun/17 ]

This is a design for Size on MDT which tries to keep the implementation as simple as possible yet at the same time provides enough guarantee for the applications to use SoM. Since we have an implementation of SoM which was removed for a certain reason, I am going to call this desing/implementation as LSOM (Lazy Size on MDT).

Design Version 0.1:

1. In the earlier steps of implementing LSOM, LSOM will not be accessible from client. Instead, LSOM values can only be accessed on MDT. But in the future, Lustre specific APIs can be added so as to enable the applications on client side to access LSOM. And in order to speed up metadata accessing of applications which don’t mind inaccurate file sizes, the LSOMs could be used as client side file sizes directly, if the client is mounted with flag lsom_as_size=true. But these features should be implemented in separate patches in the future, not in the first patch of LSOM.

2. The LSOM will be saved as an EA value on MDT. Extending the existing LMA EA would be a suitable solution.

3. LSOM includes both the apparent size and also the disk usage of the file.
4. When updating LSOM, MDT will send RPCs directly to OSTs to get the file size and disk usage.

5. Each in-memory MDT file object has a field LSOM_NEED_SYNC which indicates whether LSOM needs to be synced or not.

6. Each in-memory MDT file object has a field LSOM_SYNC_TIME which is the last LSOM syncing time.

7. A background thread pool on MDT will sync the LSOMs of the files which:
7.1. have true values for LSOM_NEED_SYNC, and
7.2. have LSOM_SYNC_TIME values which are smaller than (CURRENT_TIME - LSOM_SYNC_INTERVAL). LSOM_SYNC_INTERVAL is a time interval which can be configured through /proc or "lctl conf_param". Usually LSOM_SYNC_INTERVAL is 10 seconds or 60 seconds.
That means, all the update of LSOM will be asynchronous, thus not much latency overhead for applications will be introduced. And also, the load of LSOM syncing would be minimum and can be further eliminated by increasing LSOM_SYNC_INTERVAL.

8. When a file is being truncated, the LSOM fields will be set to LSOM_NEED_SYNC = true, LSOM_SYNC_TIME = 0. That means, the LSOM should be synced as soon as possible. T

9. When MDT detects the last close of a file, the LSOM fields will be set to LSOM_NEED_SYNC = true, LSOM_SYNC_TIME = 0. Together with design 8, this brings several advantages:
9.1. No lfsck support for LSOM is needed, since open & close the file would sync the LSOM automatically.
9.2. If a file is being closed, and no other writing/truncating after that, then after waiting for a minimum time (LSOM_SYNC_TIME_COST, which should be normally much smaller than one second), the LSOM can be assumed as synced.
9.3. If a file is being truncated, and no other writing/truncating after that, then after waiting for a minimum time (LSOM_SYNC_TIME_COST), the LSOM can be assumed as synced.
9.4. Huge and sharp file size changing caused by truncating can be detected from LSOM with short delay (LSOM_ SYNC _TIME_COST).
9.5) Since both truncate and last close operation are normally not so frequent, the extra load of syncing LSOM fields would be minimum.

10. Each in-memory client file object has a field CLIENT_LSOM_NEED_SYNC which indicates whether LSOM needs to be synced or not.

11. Each in-memory client file object has a field CLIENT_LSOM_SYNC_TIME which is the time when the client sends the last RPC of setting LSOM_NEED_SYNC flag on MDT.

12. A background thread pool on client will send a RPC which sets LSOM_NEED_SYNC field of the file object on MDT to true, iff the client file object:
12.1. has true values for CLIENT_LSOM_NEED_SYNC, and
12.2. has CLIENT_LSOM_SYNC_TIME value which is smaller than (CURRENT_TIME - LSOM_SYNC_INTERVAL).

13. When a client is writing data (or other circumstances which might be or might not be changing the file sizes. Not sure there is any other operations like writing. And I am not sure whether the writing client has any certain way to exactly know whether it is changing the file size or not), the CLIENT_LSOM_NEED_SYNC the file will be set to true. Since writing operations could be really frequent, this design has advantages of:
13.1. Not much overhead will be introduced, because only a single extra RPC will be sent to MDS every LSOM_SYNC_INTERVAL seconds for each file, and the RPC handling procedure is quick.
13.2. After the last write, if no other writing/truncating, even the file is still being opened, the LSOM could be assumed as synced after waiting for (2 * LSOM_SYNC_INTERVAL + LSOM_SYNC_TIME_COST).
13.3. Even multiple clients are writing to the same file from time to time, the syncing of the LSOM will only happen every LSOM_SYNC_INTERVAL seconds.

Comment by Li Xi (Inactive) [ 10/Jun/17 ]

Andreas & Jinshan,

 

I haven't check the design of the fomer implementation of SoM. I might be missing something important. Would you please check the my desgin and give some advices? That would be very appreciated!

 

Thanks

Comment by Andreas Dilger [ 12/Jun/17 ]

1. In the earlier steps of implementing LSOM, LSOM will not be accessible from client. Instead, LSOM values can only be accessed on MDT. But in the future, Lustre specific APIs can be added so as to enable the applications on client side to access LSOM. And in order to speed up metadata accessing of applications which don’t mind inaccurate file sizes, the LSOMs could be used as client side file sizes directly, if the client is mounted with flag lsom_as_size=true. But these features should be implemented in separate patches in the future, not in the first patch of LSOM.

It is worthwhile to note that with the new statx() interface it is possible to request "lazy" attributes via AT_STATX_DONT_SYNC, which we might interpret as returning LSOM.

2. The LSOM will be saved as an EA value on MDT. Extending the existing LMA EA would be a suitable solution.

Agree.

3. LSOM includes both the apparent size and also the disk usage of the file.

Agree.

4. When updating LSOM, MDT will send RPCs directly to OSTs to get the file size and disk usage.

I was thinking that in the common case, clients can send their file size to the MDS on close, and the MDS just saves the largest size, similar to atime. This avoids many extra RPCs from the MDS for each file, and in the common case is 100% accurate. If the inode is being truncated, then the MDS also gets an RPC for this and can update LSOM at that time. The only time that LSOM would not be accurate is if the file is open for a long time (no close RPC), or if the client crashes/evicted without a close. I think only in the crash/evict case, the MDS should fetch the size from the OSTs.

5. Each in-memory MDT file object has a field LSOM_NEED_SYNC which indicates whether LSOM needs to be synced or not.

This is where difficulties arise, and is what made the original SOM complex. If the LSOM_NEED_SYNC flag is to be accurate, it will need to be updated synchronously on disk when the file is opened/closed, otherwise the {{LSOM_NEED_SYNC flag itself can be stale (set or cleared) and will provide misleading information.

6. Each in-memory MDT file object has a field LSOM_SYNC_TIME which is the last LSOM syncing time.

Not sure we need this? Either LSOM_NEED_SYNC is enough, or we can use btime or ctime of the inode to decide whether the inode needs to be sync'd.

7. A background thread pool on MDT will sync the LSOMs of the files which:
7.1. have true values for LSOM_NEED_SYNC, and
7.2. have LSOM_SYNC_TIME values which are smaller than (CURRENT_TIME - LSOM_SYNC_INTERVAL). LSOM_SYNC_INTERVAL is a time interval which can be configured through /proc or "lctl conf_param". Usually LSOM_SYNC_INTERVAL is 10 seconds or 60 seconds.
That means, all the update of LSOM will be asynchronous, thus not much latency overhead for applications will be introduced. And also, the load of LSOM syncing would be minimum and can be further eliminated by increasing LSOM_SYNC_INTERVAL.

If there are large numbers of files being created, then this can introduce a fairly high (and mostly unecessary) load on the MDS, since it will need to scan the filesystem looking for inodes with LSOM_NEED_SYNC and LSOM_SYNC_TIME, and then do num_stripes OST RPCs to fetch the size for each file. The scanning overhead can be reduced through ChangeLogs, especially if there is already a ChangeLog consumer and the logs are just kept a bit longer, but the extra RPC overhead cannot be avoided in this case.

8. When a file is being truncated, the LSOM fields will be set to LSOM_NEED_SYNC = true, LSOM_SYNC_TIME = 0. That means, the LSOM should be synced as soon as possible.

The MDS always gets an RPC when the client is truncating a file, to ensure that it has permission to do the truncate and to avoid truncating a file that is opened for execute on another node (this happens more often than we ever thought). In this case, the new size is already sent to the MDS, so there is no extra work needed beyond storing the size on the MDT inode.

9. When MDT detects the last close of a file, the LSOM fields will be set to LSOM_NEED_SYNC = true, LSOM_SYNC_TIME = 0. Together with design 8, this brings several advantages:
9.1. No lfsck support for LSOM is needed, since open & close the file would sync the LSOM automatically.

I don't agree. We will always need LFSCK support for this, because there is always a chance that LSOM will be totally wrong for some reason (e.g. corruption of OST objects, corruption of MDT inodes, breakage of LSOM algorithm, crash of both client and server during writes, etc). Also, LFSCK is ALREADY doing most of this work today (fetching OST object attributes to the MDS) to verify quota and file layout, so the extra effort of updating LSOM is minimal.

I think one interesting option would be to have LFSCK do a scan of files in the ChangeLog at startup, so that it can recover LSOM data for recently modified files.

9.2. If a file is being closed, and no other writing/truncating after that, then after waiting for a minimum time (LSOM_SYNC_TIME_COST, which should be normally much smaller than one second), the LSOM can be assumed as synced.

This is not true. There may be tens or hundreds of seconds of delay flushing data to disk on the OSTs under heavy load, and it isn't safe to assume the data is sync'd before a crash. Even on the MDS the sync interval may be 5s or more. If the client or OST has buffered a large amount of data and then closes the file, the LSOM would be updated but the data may not be persistent on disk if it fails after close.

9.3. If a file is being truncated, and no other writing/truncating after that, then after waiting for a minimum time (LSOM_SYNC_TIME_COST), the LSOM can be assumed as synced.

Again not true. Even so, there is no need for this.

9.4. Huge and sharp file size changing caused by truncating can be detected from LSOM with short delay (LSOM_ SYNC _TIME_COST).

There is no need for this, because the MDS already gets the new size from the client doing the truncate. Later writes can increase the size again.

9.5) Since both truncate and last close operation are normally not so frequent, the extra load of syncing LSOM fields would be minimum.

I don't think the MDS->OSS RPCs are needed in most cases, only if the client is evicted with open files for write. Every client close should send the size to the MDS. It looks in ll_prepare_close() that the client already sends the size and blocks to the MDS on each close, but it doesn't set ATTR_SIZE and ATTR_BLOCKS, even though it should do this (and let the MDS to decide what to do with it).

10. Each in-memory client file object has a field CLIENT_LSOM_NEED_SYNC which indicates whether LSOM needs to be synced or not.

I don't think this is needed. This kind of complexity is what made SOM difficult to get correct. The client should just always send the size that it knows when it is closing the file.

11. Each in-memory client file object has a field CLIENT_LSOM_SYNC_TIME which is the time when the client sends the last RPC of setting LSOM_NEED_SYNC flag on MDT.

I don't think this is needed.

12. A background thread pool on client will send a RPC which sets LSOM_NEED_SYNC field of the file object on MDT to true, iff the client file object:
12.1. has true values for CLIENT_LSOM_NEED_SYNC, and
12.2. has CLIENT_LSOM_SYNC_TIME value which is smaller than (CURRENT_TIME - LSOM_SYNC_INTERVAL).

If the client only sends the LSOM on close, it is no worse than NFS is today.

13. When a client is writing data (or other circumstances which might be or might not be changing the file sizes. Not sure there is any other operations like writing. And I am not sure whether the writing client has any certain way to exactly know whether it is changing the file size or not), the CLIENT_LSOM_NEED_SYNC the file will be set to true. Since writing operations could be really frequent, this design has advantages of:
13.1. Not much overhead will be introduced, because only a single extra RPC will be sent to MDS every LSOM_SYNC_INTERVAL seconds for each file, and the RPC handling procedure is quick.
13.2. After the last write, if no other writing/truncating, even the file is still being opened, the LSOM could be assumed as synced after waiting for (2 * LSOM_SYNC_INTERVAL + LSOM_SYNC_TIME_COST).
13.3. Even multiple clients are writing to the same file from time to time, the syncing of the LSOM will only happen every LSOM_SYNC_INTERVAL seconds.

The main problem I see here is that this will cause RPCs to be sent to the MDS from every client for every LSOM_SYNC_INTERVAL. Since most applications only have a file open for a short time, it is enough to send this at close time. It would be better to just send a glimpse call from the MDS in such cases.

I think we need to step back and look at what the goal is here. This proposal has gone from "store the approximate size on the MDS that is updated within minutes/hours" which is relatively simple to implement and has low overhead to "work hard to have almost accurate file size on the MDS within seconds" which starts to add a lot of overhead but still doesn't get us far enough to have 100% accurate file size.

For the purpose of HSM and other policy engines, having a file size that is approximately accurate most of the time is enough, and adding more complexity won't necessarily improve functionality any significant amount, IMHO.

Comment by Jinshan Xiong (Inactive) [ 30/Jun/17 ]

If the inode is being truncated, then the MDS also gets an RPC for this and can update LSOM at that time.

In current implementation, the client sends truncate RPC to the MDT first and then do truncate to OST objects. When the MDT and OSTs have different ideas about the file size, which one should be trusted? In theory it should trust OSTs because the result of glimpse is always correct, but in that case it would be hard to take the benefit of LSOM.

The only time that LSOM would not be accurate is if the file is open for a long time (no close RPC), or if the client crashes/evicted without a close. I think only in the crash/evict case, the MDS should fetch the size from the OSTs.

It would be difficult to know which files are currently being opened by the evicted clients, unless we implement persistent open.

Comment by Andreas Dilger [ 30/Jun/17 ]

I think the important thing to remember here is that this is lazy SOM, so it shouldn't be used to make life-or-death decisions, and we shouldn't be making the code complex to try and ensure it is always correct. For LSOM, if the MDS returns the size for the file then the client shouldn't even check the OST size. If the client really cares about the size then it should get the size from the OST(s).

If one in a million files is skipped for purge because LSOM is incorrect, that is not fatal. If one in a million files is put into the "mirror" pool instead of the "raid-6" pool because of the wrong size, that is not fatal. It is also fine if the MDS just replies that it doesn't know the size, and the client has to fetch it from the OSTs. That is no worse than what we do today, so if LSOM can avoid 99% of file size then that is already a huge win.

As for knowing open files for each client, we already track the open file handles per client, and we would invalidate LSOM for all of its open files before the client is evicted. Then, the next time the file is opened/closed the LSOM would be updated. Updating LSOM at eviction time instead of invalidating it would be more complex, and may cause cascading failures if the MDS is having network problems and becomes blocked on OST RPCs while trying to evict the client.

Comment by Jinshan Xiong (Inactive) [ 03/Jul/17 ]

I see. If we allow some odds of wrong decision based on LSOM that should be able to work. I just want to raise this question to make it clear

As for knowing open files for each client, we already track the open file handles per client, and we would invalidate LSOM for all of its open files before the client is evicted.

Can you elaborate on this? When does client/server invalidate LSOM attribute after eviction.

There is a known problem for current Lustre implementation that when a client with opening files is evicted, the open handlers on the client are not cleared. Applications are still able to access files via those handlers.

Comment by Andreas Dilger [ 04/Jul/17 ]

I don't think the MDS currently invalidates anything when the client is evicted, but it does track the open files per client and we can invalidate the LSOM flag on the MDT if the file was opened for write by that client. A more sophisticated MDS could mark the open handle for LSOM invalidation only when it gets a write intent, but that is an optimization not strictly needed for the first LSOM to work.

Comment by Li Xi (Inactive) [ 06/Jul/17 ]

Andreas, I agree with your comments. And I agree LSOM should be very lazy so as to make it very easy&efficient to implement. I will revise the design soon.

Comment by Li Xi (Inactive) [ 30/Oct/17 ]

This is a design for Lazy Size on MDT which tries to keep the implementation as simple as possible. It is worth to note that no guarantee of LSOM accuracy is implied by this design. Thus, any tool that scans LSOM on MDT should notice its lazy behavior to avoid inproper expectations. A file that is being opened for a long time might cause inaccurate LSOM for a very long time. And also eviction or crash of client might cause incomplete process of closing a file, thus might cause inaccurate LSOM. A precise LSOM could only be read from MDT when 1) all possible corruption and inconsistency caused by client eviction or client/server crash have all been fixed by LFSCK and 2) the file is not being opened.

Design Version 0.2:

  1. In the first step of implementing LSOM, LSOM will not be accessible from client. Instead, LSOM values can only be accessed on MDT. Thus, no interface or logic codes will be added on client side to enabled the acess of LSOM from client side. But in the future, statx() interface with AT_STATX_DONT_SYNC flag or Lustre specific APIs can be added so as to enable the applications on client side to access LSOM. And in order to speed up metadata accessing of applications which don’t mind inaccurate file sizes, the LSOMs could be used as client side file sizes directly, if the client is mounted with flag lsom_as_size=true. But these features should be implemented in separate patches in the future, not in the first patch of LSOM.
  2. The LSOM will be saved as an EA value on MDT. Extending the existing LMA EA would be a suitable solution.
  3. LSOM includes both the apparent size and also the disk usage of the file.
  4. Whenever a file is being truncated, the LSOM of the file on MDT will be updated.
  5. Whenever client is closing a file, ll_prepare_close() will send the size and blocks to the MDS. The MDS will update the LSOM of the file if the sizes are being increased.
  6. An enhancement of LFSCK is needed to update LSOM stored in the MDT xattr during layout scanning if it is inaccurate. This should be skipped if the file is currently being opened.
  7. Improvement: LFSCK could do a scan of files in the ChangeLog at startup, so that it can recover LSOM data for recently modified files.
Comment by Andreas Dilger [ 30/Oct/17 ]

I think this new proposal is very reasonable. Adding in proper statx() interface as a separate patch would be desirable for many reasons, but in the meantime, the MDS could return LSOM size+blocks in the MDT body without setting OBD_MD_FLSIZE/FLBLOCKS, since this is what statx(AT_STATX_DONT_SYNC) will be using. I also agree that fixing up LSOM with LFSCK is very desirable, but again as a separate patch.

Comment by Gerrit Updater [ 07/Nov/17 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/29960
Subject: LU-9538 mdt: Lazy size on MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 20aa43be8f0ea178b61e13c4df6abfa20d3d1560

Comment by Qian Yingjin (Inactive) [ 07/Nov/17 ]

I have pushed the first version of LSOM patch.
It existed the following problem for the file block size stored the LSOM EA:
when closing a file, ll_prepare_close will pack and send the size and blocks to the MDS. However, the block size information may be inaccurate on the client. When write the file, it can get the block size by merging the blocks information returned by LVB of LDLM lock. But after appending data to the file, the block usage may be changed. Moreover, the page cache data may not flush to the OST fully when closing a file. All these leads that the block size information on client may be inconsistent with the sum of the block size for each data object which should be obtained from OST.

Any suggestion to solve this problem is welcome!

Comment by Andreas Dilger [ 08/Nov/17 ]

In my experience, the block count stored with the file is used relatively rarely and is not very important to keep accurately. Some tools like c or tar use it to determine if a file is sparse (if block count < size / blocksize), but little else. It is important to have a non-zero block count for any files with data, so that those tools do not optimize away reads for entirely sparse files.

I think for Lazy SOM that it may be enough to write whatever block count that the client sees to disk. We should make sure that the OST RPC replied are getting updated size and block data back to the clients. For small files, the client already ensures the block count is non-zero. For large files the client will start sending OST RPCs to flush the dirty data, and shouldn't be off by more than the cache max_dirty_mb * num_writers for each object. That may still be a lot, but at least will grow with the file size.

Are there policy engine decisions made by file blocks instead of file size? If this is important to get more accurate, the client could keep a rough estimate of the blocks count based on the number of pages dirtied (possibly up to, say, 0.9 * size / blocksize) so that the LSOM is closer to reality, but the actual size will still be larger when it is sent to the MDT.

The next phase would be to have LFSCK update the size and blocks in the background during layout scanning, since it should already have this data.

Final note, the xattr used for LSOM should be consistent with whatever FLR is using, so we don't store this data twice.

Comment by Olaf Weber [ 09/Nov/17 ]

When deciding whether a file is a candidate for HSM migration based on the amount of space it occupies in the filesystem, the blocks on disk are a better measure than file size.

Comment by Li Xi (Inactive) [ 10/Nov/17 ]

Instead of implementing LSOM feature with no consistency guarantee at all, I think it would be better to provide guarantee of "eventual consistency". That means, if a file is not updated any more, its file size and block count in LSOM will eventually be synced with the precise file size and block count on OSTs, after a synchronization process that takes only finite time to execute. The time interval before final synchronization could be one minute, one hour or one day, depending on the implementation and run-time events. No matter how long the time interval will be, having "eventual consistency" guaranteed is still helpful for scanning tools or policy engines.

For example, if LSOM will be synced eventually, the following rule will safe to evaluated based on LSOM:

Files that 1) have bigger (or smaller) size (or block count) than a certain value and 2) have not been updated for a certain period of time (e.g. one day).

And even the first scanning of the policy missed some of the files because of un-finished synchronization processes of the LSOMs, the next scanning rounds would find these files eventually.

Instead, if LSOM has no guarantee of eventual consistency, some files might never be found by this rule.

In order to provide guarantee of eventual consistency for LSOM, we need to implement the synchronization method. Since it is unlikely to implement the synchronization method inside LSOM without introducing overhead, we might need to implement this either in LFSCK or as an external tool. I am not familar with LFSCK so I am not sure what LFSCK is capable of. But if implementing the synchronization method as an external tool, the design seems straightforward.

We could implement a tool to monitor the Lustre Changelog. Any event that implies update of file size or block count will trigger a process of LSOM synchronization. Possible events would be 1) truncating of the file, 2) closing of a file handle with write flag. Such kind of event won't triger synchronization immediately. Intead, a time period will be waited to avoid flood of synchronization, because obviously, multiple events of a single file could be grouped into a single process of LSOM synchronization. And also, if the file is actively changing by other applications, it is not necessary to sync LSOM at this time.

During a LSOM process of synchronization, a user-space tool will do the following things:

1) Acquires a lease and opens the file, making sure no other process is changing the file.

2) Calls getattr() to read the updated size and block count from OSTs

3) Closes the file and thus updates the LSOM to the latest value.

What do you think of this design? Is LFSCK able to implement this? Thanks!

Comment by Andreas Dilger [ 10/Nov/17 ]

In order to provide guarantee of eventual consistency for LSOM, we need to implement the synchronization method. Since it is unlikely to implement the synchronization method inside LSOM without introducing overhead, we might need to implement this either in LFSCK or as an external tool. I am not familar with LFSCK so I am not sure what LFSCK is capable of. But if implementing the synchronization method as an external tool, the design seems straightforward.

LFSCK is already able to scan all of the inodes in the filesystem (in sequential order for efficiency), and verify that the MDT layout matches the allocated OST objects, and OST objects have the right UID/GID (maybe needs work to fix projid?). This would only need some small modification to update Lazy SOM size/blocks on the MDT, since it already has the layout and should have all the attributes from the OST objects. I wouldn't want to add duplicate functionality into Lustre for doing this.

Also, it wouldn't be bad to always be running LFSCK in the background at a slow rate (maybe in a mode that is only reporting problems, but fixing LSOM consistency) to detect other errors that may appear over time. If we are accessing a file and all of its objects anyway, then we may as well do multiple consistency checks via LFSCK rather than only update LSOM

We could implement a tool to monitor the Lustre Changelog. Any event that implies update of file size or block count will trigger a process of LSOM synchronization. Possible events would be 1) truncating of the file, 2) closing of a file handle with write flag. Such kind of event won't triger synchronization immediately. Intead, a time period will be waited to avoid flood of synchronization, because obviously, multiple events of a single file could be grouped into a single process of LSOM synchronization. And also, if the file is actively changing by other applications, it is not necessary to sync LSOM at this time.

LFSCK could also be taught to check recent entries from the ChangeLog rather than scanning the whole filesystem, or both. It currently gets the list of inodes to scan from OI Scrub (sequential list of inodes from OSD device), but it could instead get this list from ChangeLog, possibly reading it in chunks and sorting then merging duplicate FIDs to avoid duplicate scanning.

During a LSOM process of synchronization, a user-space tool will do the following things:

1) Acquires a lease and opens the file, making sure no other process is changing the file.
2) Calls getattr() to read the updated size and block count from OSTs
3) Closes the file and thus updates the LSOM to the latest value.

There is no need for a lease in this case, or blocking other file access. According to your "eventual consistency" model, the LSOM update could be "as current as possible" and then it would be rescanned again later when any updates are logged.

Comment by Andreas Dilger [ 11/Nov/17 ]

PS - before we add the requirement for time-limited consistency, it is useful to go ahead with completely lazy SOM. That will allow implementing the mechanisms needed to update MDT inode size/blocks, and we can get an idea how big the errors might get.

I suspect that only very small files (under 1 RPC in size, so 1-4MB today) would not have any information about blocks from the OSTs (they will all have an idea about the size). With DoM, these small files would mostly be stored on the MDT and will always have accurate size/blocks.

It would be possible to keep a running total of blocks based on dirty pages for each OSC on the client during write, and send that to the MDS if the client has not gotten any reply back from the OSTs before close. For large files (over max_dirty_mb, but really any client that dirties more than RPC size), they will always need to send some RPCs to the OSTs, and can get blocks information back in the reply, so the amount of error in LSOM blocks count is limited.

I suspect that in most cases the client can easily send accurate-enough information to the MDT that no scanning would be needed at all. Why plan on adding overhead before we even know whether it is needed?

Comment by Gerrit Updater [ 16/Nov/17 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/30124
Subject: LU-9538 utils: Tool for syncing file LSOM xattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3ea8730a2fb158d3072210bdf25d605e924350a6

Comment by Qian Yingjin (Inactive) [ 22/Nov/17 ]

To improve the LSOM xattr sync tool, we have the following propose:
Use some kind of medias to maintain the information of files that need to sync LSOM data on MDT: i.e. hashtable (in memory) or database sqlite.
1. Poll on the changelog device (there is some problem in current changelog device implementation: LU-10267), once receive the record, first check whether the target FID is already stored. If so, just update the timestamp; If not, insert the FID with the timestamp into the storage media.
2. And then, sync the LSOM data of the files which were slience for a centain threshold (i.e. 600s, NOW - latest_timestamp > 600).
3. Goto 1.

There are two choices to maintain the files needing to sync LSOM data: Hashtalbe in memory and sqlite Database.
Hashtable is simple and quick, as we known, but we need to limit the total number of cached FID information.
sqlite database maybe also a good choice. i.e. we can store hunge number of data; even the tool crashes due to the system reboot, we will not lose any data if not use memory mode.
Which one is better?

Any advice would be appreciated. Thanks!

Comment by Andreas Dilger [ 22/Nov/17 ]

An in-memory hash should be enough. The records will not be dropped from the ChangeLog until after they are processed, so cannot be lost. At worst a crash minght mean duplicate FID scanning, which is fine in this case.

This also avoids problems with MySQL performance keeping up with the MDS.

Comment by Nathan Rutman [ 02/Feb/18 ]

There remain many questions about when and if SoM and how far SoM can be trusted that have delayed the feature from landing for years. Likewise, there are many users that may care more or less about strictness, POSIX adherence, and staleness. Rather than try to choose one answer, how about we go ahead with some or all versions of SoM, but insure that we indicate the "quality" of the SoM attributes – e.g. we can have confidence flags stored on the MDS for both the size and block count (independently) on a per-file basis:

  • SOM_FL_ROUGH: Approximate, may never have been strictly correct, but a guess is better than nothing
  • SOM_FL_STALE: Known stale – was right at some point in the past, but may be wrong now (e.g. opened-for-write)
  • SOM_FL_STRICT: Known correct; FLR or DoM file (SoM guaranteed)
  • SOM_FL_UNKNOWN: Unknown / no SoM, must get size from OSTs

We can then potentially add mount options that make the stat() behavior selectable: mount –o no_block_count, -o som_stale_ok, etc., or make a special IOCTL that gets a GUESS value, useful for policy engines or space rebalancing.

Comment by Nathan Rutman [ 06/Feb/18 ]

Information that we should also track on the MDS:

  • size
  • block count
  • mtime
  • extents

Tracking a client's max written extent range might be very useful for improving dirty data tracking for broad-brush items like tiering data or re-archiving a partial file. For each open-on-write, each client can report it's written extent range, and the MDS just chooses the overall min and max (along with max reported size, and latest reported mtime). Block count can't seem to be reported by the client, but if we have flags as above to separately indicate block-count-on-mds status, that can then be handled with separate mechanisms if needed.

Comment by Andreas Dilger [ 22/Feb/18 ]

Nathan, I'm not quite sure what you mean by "should also track on the MDS ... size, block count", since that is already what this ticket is about. You are correct that tracking the blocks count may not be totally accurate (less so for ZFS, since it does delayed allocation on the OSS), but clients will have some idea of the blocks count if the writes take more than a few seconds. The current block count will be returned to the client in each reply for each object, so it would be outdated by at most a few seconds on ZFS (transaction commit interval) and at most the RPC round-trip for ldiskfs on the last client writing to the file (which is why LSOM will save the maximum block count by default). The mtime should also be sent to the MDS as part of normal close operations, and since this is driven by the client's dirty time (not the OSS local clock) it should be accurate.

As for "extents", I'm not sure how that is different than tracking the file size? The layout components themselves will not be instantiated unless there is a write to the component, and the maximum offset of the write is, by definition, the file size.

Comment by Nathan Rutman [ 22/Feb/18 ]

"also" was a poor choice. Reads better without it.
By extents, I meant that it will probably be useful for every client to report the extent range that they dirtied. Very different than file size, we could take the union of these reports so that the MDS can get an idea of the range of dirty data, maybe as simple as min and max. Now tools like HSM could know which parts of a file have been dirtied, for future resyncing.

Comment by Nathan Rutman [ 22/Feb/18 ]

Oh, and for block count, I would be happy if the possibly-slightly-stale count was returned and marked as SOM_FL_STALE. Vitaly also suggested that if we wanted to delay the close RPC internally until we get the write commit callbacks from the OSTs, we could then send a fully correct block count.

Comment by Vitaly Fertman [ 28/Feb/18 ]

I think only in the crash/evict case, the MDS should fetch the size from the OSTs.

in fact, the truncate too, it is not protected by close, thus it could mix up with IO and MDS will not know which one reaches the OST first, punch or write.

the same for utime.

As for knowing open files for each client, we already track the open file handles per client, and we would invalidate LSOM for all of its open files before the client is evicted. Then, the next time the file is opened/closed the LSOM would be updated. Updating LSOM at eviction time instead of invalidating it would be more complex, and may cause cascading failures if the MDS is having network problems and becomes blocked on OST RPCs while trying to evict the client.

it seems you completely forgot about the MDS failure, when you already do not track accurately open files and therefore cannot invalidate.

Comment by Vitaly Fertman [ 28/Feb/18 ]

To improve the LSOM xattr sync tool, we have the following propose:
Use some kind of medias to maintain the information of files that need to sync LSOM data on MDT: i.e. hashtable (in memory) or database sqlite.
1. Poll on the changelog device (there is some problem in current changelog device implementation: LU-10267), once receive the record, first check whether the target FID is already stored. If so, just update the timestamp; If not, insert the FID with the timestamp into the storage media.

if you are trying to make SOM to have "eventual consistency" by this way, it is not valuable because changelog is not reliable by itself, it does not guarantee it gets all the file modifications

Comment by Vitaly Fertman [ 28/Feb/18 ]

LFSCK is already able to scan all of the inodes in the filesystem (in sequential order for efficiency), and verify that the MDT layout matches the allocated OST objects, and OST objects have the right UID/GID (maybe needs work to fix projid?). This would only need some small modification to update Lazy SOM size/blocks on the MDT, since it already has the layout and should have all the attributes from the OST objects. I wouldn't want to add duplicate functionality into Lustre for doing this.

@Andreas, how the input for LFSCK 'what needs to be checked' is formed? is it just going in rounds though the inode table again and again? or upon some event?

Comment by Nathan Rutman [ 21/Mar/18 ]

To get this ticket back on track, let's decide what should be in-scope and what should be follow-on. In my opinion:

1. The Design 0.2 is exactly what we should implement first, call it "Rough SoM" (and not Eventual Consistency).

2. We need to incorporate the "accuracy flags" in my comment at this stage, because there will already be all four classes of SoM as soon as this patch lands:

  • Rough SoM from this patch
  • Strict SoM for DoM and FLR
  • Stale SoM for FLR
  • Unknown SoM for any files not yet opened (legacy files)

3. The "eventual consistency" process should probably be moved to a different ticket - I think it may not be necessary at all if lfsck can do the same job, and there is likely to be more discussion about that. Let's let this ticket focus on the Rough SoM, and get that into a release ASAP.

4. Addressing Vitaly's point, Rough SoM does not need to be invalidated in the face of MDS failure or client eviction. We already accept that it is not completely accurate, and therefore don't need to take the synchronous invalidate step (as in FLR), saving that performance impact.

5. The other junk I asked for (mtime, written extents range) doesn't need to be implemented; just thoughts. Maybe we should include as unpopulated fields in data structures, if agreed.

6. As has been discussed throughout this ticket, block count won't be correct, but that's fine. It should also be marked with its own SOM_FL_ROUGH in case we decide to implement a more accurate mechanism later.

Comment by Vitaly Fertman [ 22/Mar/18 ]

Addressing Vitaly's point

 

I suggested to invalidate asynchronously without a perf impact, what will reach the disk in most of the cases, what in fact moves rough->stale/unknown. the point is the rough attrs is what we tried (not so hard though) to get accurate and therefore have some value, whereas once opened for write - not adequate anymore and I would not look there even for special purposes.

also, I do not see much sense in the 4th state, on disk it means no som ea, in memory it is the same as stale - nobody look at them. thus 3 types of reliability which I would put to som ea.

Comment by Andreas Dilger [ 23/Mar/18 ]

The initial implementation proposed by Li Xi was not intended to be directly accessible from the client, only for scanning the MDT filesystem directly. I'd prefer that we minimize the complexity initially, so we can start storing the size on the MDT inodes, and it can be refined to be more accurate over time.

Comment by Alex Zhuravlev [ 23/Mar/18 ]

iirc, at some point we discussed the model where invalidation happens at MDT restart (where we lose in-core state). so that every SOM attribute tagged with boot epoch can be easily recognised as valid/invalid. that would bring the majority of the infrastructure with nearly zero recovery complexity.

 

Comment by Vitaly Fertman [ 23/Mar/18 ]

Andreas, the main point is to tag the attributes with a reliability level, on-disk. it does not matter much where they are stored in. the simple idea is som ea or extended lovea. if storing in another place and reading the disk we can distinguish between these 3 levels - that's fine.

Alex, I am afraid it is not too great as the whole current fs cache is invalidated. talking about a guessed attributes, this has more harm than value.

Comment by Nathan Rutman [ 23/Mar/18 ]

adilger, so then as a minimal thing, are you on board with my comment? Main difference from initial patch being some indication of the inaccuracy.

Comment by Andreas Dilger [ 19/Apr/18 ]

Nathan, yes I'd be on-board with storing a flag with the SOM data to indicate how accurate it is. That means LSOM size/blocks data would be considered "lower class" than guaranteed-to-be accurate FLR/DoM size/blocks data. I don't think that would add to the implementation complexity significantly, and gives us a path forward as this feature evolves.

Initially, the LSOM patch does not return the size to the client at all, but it might be possible to wire this into the statx() interface in the future (see LU-10934), which has an "I don't care about accurate size" flag.

Comment by Andreas Dilger [ 08/May/18 ]

What would also be useful (in a separate patch) is to return the LSOM size/blocks in the stat info as part of ioctl(IOC_MDC_GETFILEINFO) so that "lfs find" can use it (if non-zero, ignore it otherwise). For many purposes, the LSOM size would be enough (i.e. is this file larger than 1GB), but if the file ctime is very recent (< 10 minutes) or the size is very close to the threshold (within 10%?), then it may still make sense to do a stat() on the file to get the accurate size. If it does an open() + fstat() + close(), then this will also update the LSOM attrs, so that the next time "lfs find" is run the attributes will be accurate.

Comment by Alex Zhuravlev [ 08/May/18 ]

>  Alex, I am afraid it is not too great as the whole current fs cache is invalidated. talking about a guessed attributes, this has more harm than value.
 
we can't trust it anyway, so any application requiring actual attributes (like ls) has to check with OST.
in this sense the description of ticket isn't quite correct? it's just some applications may use potentially stale attributes on its own.

Comment by Vitaly Fertman [ 14/May/18 ]

Alex, that's right, it is more about guessed SOM which may still be useful for things like RH, and would be better to not lose it completely.

Comment by Vitaly Fertman [ 29/May/18 ]

I would like to clarify again when the [cm]time logic will be covered. originally SOM, despite its name, was intended to handle all the OSS side attributes. having a kind of lazy SOM, it looks like there is no need to go to OSS at all what is not true, thus not all of the robinhood issues are resolved as it has time policies. yes, we can defer it for the next patch, so let’s create a ticket. from my point of view it has more sense to add it right here.

Comment by Vitaly Fertman [ 29/May/18 ]

here - I mean to the current patch

Comment by Li Xi (Inactive) [ 01/Jun/18 ]

Paste my comment on the patch here in case for future discussion:

It's clearly awared that LSOM will be inaccurate in many cases, especially when there are concurrent writes/truncates. However, the LSOM is designed to be inaccurate in this way. The accuracy of LSOM can not be trusted in any time. It is only a way to speedup scanning tools or policy engines. Whenever the MDT scanning tools or policy engines need accurate LSOM, it should check the real file size/blocks.

For example, the policy engine might want to run "lfs migrate" on all of the files that are bigger than 1GB to balance the OST usages. Even though LSOM might be inaccurate, but it is very likely the small files have small LSOM, and large files have large LSOM. In the worst case, all of the LSOM could be wrong. But because most large files have large LSOM, it is very likely the policy engine can find most of the files that are bigger than 1GB. If 90% of the files larger than 1GB have LSOM larger than 1GB, then the policy engine can find out and migrate 90% of the files. And that is totally good enough for OST usage balancing.

After the policy engine gets the "suspected" file list that could be larget than 1GB by scanning LSOM, it can double check by getting the real file size. If the file size is smaller than 1GB, policy engine could skip the migrating.

There might be some files that are larger than 1GB but have <1GB LSOM. The syncing tool of LSOM could help to sync the LSOM eventually, and policy engine might found and migrate those file eventually in the next loop.

Even with LSOM syncing tool, the policy engine could still miss some files that are large than 1GB. But that is totally fine. It is very likely that a very small part of the files are missing. And please note the file system is being changed all the time. Even a tool can print files that are larger than 1GB at the scanning time point, conditions could change in the next second because of writing/truncating. In these kind of use cases, it is impossible to ensure accuracy of 100%, and 100% accuray is unnessary at all for these use cases. To get 100% accuray, the administrator will need to 1) stop all of the I/O, 2) scanning the whole file system using real file size. I can't think up any use cases for that. It is really exclusive and really slow, which I don't think is useful for data management of actively running file system.

I think the current design is alreay enough for a lot of use cases. Of course, improving the LSOM would always be nice. The accuracy that policy engine can get from scanning LSOM could be improved maybe from 99% to 99.9%. But we need to think whether the efforce to decrease the inaccuracy from 1% to 0.1% really worth it. And we currently don't have any data or experience about the accuracy at all. I think we need to land the LSOM feature and use it for current use cases. We will be able to know whether the accuracy is enough or not soon. Enough statistics can be collected soon for making the decision or judge at that time. And if the accuracy is obviously not enough for some use cases or in some corner cases, we can improve it latter at any time.

Comment by Andreas Dilger [ 17/Jul/18 ]

For the 2.12 release, it would be great if lfs find could be enhanced to use the LSOM data from the MDS when checking -size or blocks. Maybe a lfs find --lazy option could be added to determine if the LSOM data is used or not. At first, this could use lgetxattr("trusted.som") interface to get the LSOM attr, but eventually this should be converted to use the statx(AT_STATX_DONT_SYNC) interface on the client. That is an internal implementation detail that the user should not care about when using -lazy and can be done at some later time.

Ideally, the use of lgetxattr() would avoid sending an extra RPC to the MDS to fetch the lazy size, but this is not going to be worse than fetching the size from the OSS nodes, as it would only involve a single MDS_GETXATTR RPC (and may already be prefetched to the client).

Comment by Gerrit Updater [ 30/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/29960/
Subject: LU-9538 mdt: Lazy size on MDT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f1ebf88aef2101ff9ee30b0ddea107e8f700c07f

Comment by Gerrit Updater [ 01/Aug/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32918
Subject: LU-9538 utils: fix lfs xattr.h header usage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 59921f66904c17b77a69f9bb4bc0b0d8676d32f4

Comment by Gerrit Updater [ 06/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32918/
Subject: LU-9538 utils: fix lfs xattr.h header usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cc234da91b6c00cbe681d7352320df94c09dc288

Comment by Gerrit Updater [ 09/Aug/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30124/
Subject: LU-9538 utils: Tool for syncing file LSOM xattr
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: caba6b9af07567ff4cdae9f6450f399cd3ca445e

Comment by Peter Jones [ 09/Aug/18 ]

Landed for 2.12

Comment by John Bent (Inactive) [ 19/Oct/18 ]

"when a significant time has been past since last sync" . Was this value defined?  Is there a config variable?

Comment by Andreas Dilger [ 19/Oct/18 ]

John, the llsom_sync tool, if running on the MDSes, will monitor the Changelog and update the LSOM data by default 10 minutes after the file was modified. It also aggregates updates so that multiple file modifications in the prior 10 minutes do not result in multiple LSOM updates (it is set to the most current size/blocks value).

If the llsom_sync tool is not running, then the majority of new files will still have the LSOM data updated at close, except when there are strange file write orderings (e.g. many clients doing write/truncate/etc.), or the clients crash before they close the file. That update typically happens as soon as the client closes the file on the MDS.

Files will also have their LSOM data updated to the current size/blocks when opened and closed by any client (if it has changed), so it is naturally correcting itself over time. That is all the llsom_sync tool is doing in the end - open and close the file after (presumably) it has stopped being modified. If it is still being modified, or is modified again later, there will be another Changelog record written, and llsom_sync will open/close the file another time.

Comment by Gerrit Updater [ 03/Nov/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33565
Subject: LU-9538 utils: update description of ldiskfs xattrs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b889b2caf3f791083a3785c3c60eb9b78127eca5

Comment by Gerrit Updater [ 10/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33565/
Subject: LU-9538 utils: update description of ldiskfs xattrs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 533a253ad86cd6bb09c3889110312ef375e9590d

Generated at Sat Feb 10 02:27:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.