[LU-1158] nanosecond timestamp support for Lustre Created: 01/Mar/12  Updated: 03/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Feng Lei
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9019 Migrate lustre to standard 64 bit tim... Resolved
is related to LU-12922 pjdfstest chown_00: POSIX compliance ... Open
is related to LU-4050 NFS reexport issue Resolved
is related to LU-10934 integrate statx() API with Lustre Resolved
is related to LU-11971 Send file creation time to clients Resolved
is related to LUDOC-92 Nanosecond Time Stamps Doc Changes Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-1619 Subtle and infrequent failure of test... Technical task Resolved WC Triage  
Story Points: 3
Rank (Obsolete): 4515

 Description   

The current Lustre network protocol has support for a 64-bit timestamp of seconds, but does not have a field for passing the nanosecond timestamp from clients to servers and back again.

It would be relatively straight-forward to put 3x __u32 nanosecond timestamps in the reserved fields in struct obdo and struct mdt_body. These fields are currently always initialized to 0, so there wouldn't even need to be a protocol change or feature to begin using these fields for nanoseconds - just copy them in/out of the RPC structures, and old clients/servers will just store 0 there, and ignore any nanosecond timestamps that are sent to them (no differently than they do today).

It is more complex to add the nanosecond timestamps to struct ost_lvb, which is most commonly used for glimpse locks (stat) on OST objects. This will require a structure change to fit the extra 3x __u32 nanosecond timestamps into ost_lvb, which may require a protocol change. It may be possible if this structure is passed in a separate ptlrpc message buffer that the larger size will be ignored by older clients, which would avoid the need for additional complexity for interoperability.



 Comments   
Comment by Andreas Dilger [ 01/Mar/12 ]

Johann, you were just working on changing the LDLM LVB structure to allow different LVB data to be passed for quota. Does it seem practical to include the modifications necessary to increase ost_lvb in your changes, or is your change not so intrusive (i.e. using a smaller struct than ost_lvb) that adding this ability would increase the complexity of your patch unnecessarily?

Comment by Johann Lombardi (Inactive) [ 02/Mar/12 ]

Right, i've already made changes to the LDLM LVB on the orion_quota branch, although i have not increased the size of it yet.
Yes, i think it makes sense to include the modifications to increase the ost_lvb in the same patch.

Comment by Isami Romanowski (Inactive) [ 03/Jul/12 ]

Patch tracking at http://review.whamcloud.com/3266

LU-1158 general: support for nanosecond m/a/ctimes in Lustre

Comment by Andreas Dilger [ 10/Jul/12 ]

I wanted to start a discussion here about the nanosecond times usage of the reserved fields in mdt_rec_reint. I wasn't aware of it when this project started, but I see during patch inspection that the rr_{m,a,c}time_ns fields consume the last 3 fields in that template struct. In most of the instantiations (mdt_rec_link, mdt_rec_setxattr, mdt_rec_rename, mdt_rec_unlink) there are plenty of other padding fields that could be used if we need to add something to the wire protocol. In a few of them (mdt_rec_create, mdt_rec_setattr) there are only 2 or 3 unused fields left, and in the case of mdt_rec_setattr there are no 64-bit fields at all.

The question is what should we do at this point? Should we hope that we don't need to add too many new fields to these structures? Should we not add the atime_ns field, and just stick with one second resolution for atimes, since we don't even write the atime to disk more frequently than every 60s and keep an extra padding field for future use? Should we work to add more space for padding fields to mdt_rec_reint in a compatible manner to 2.3 so that the use of the *time_ns fields in 2.4 will not consume all of the available space?

Comment by Isami Romanowski (Inactive) [ 10/Jul/12 ]

I think an initial mental exercise that would be good to perform in this case is to try and think of any fields that might need to be added in the future. If we can readily think of anything that would be handy or obvious to add (I can't, but given my inexperience with Lustre and high-performance computing in general I don't think I count), then it would imply that work should begin to add more space to mdt_rec_reint. Otherwise, if there are no extensions that may have to be made to mdt_rec_reint in the near future (as I suspect the case may be; prophecies about software are hard to tell), then I posit that at least removal of the atime_ns field is not the correct choice. While it is only updated on-disk every 60s, there is apparently a desire or need by some to have this kind of resolution for their timestamps, and at least having the nanosecond times could tell one of these highly-granular-time-needing users the most recent atime (down to the nanosecond) within that 60s between disk updates. Leaving atime out of the nanosecond club when mtime and ctime have nanoseconds also seems somewhat inconsistent.

Since I tend favor compatability with future expansion, I would vote for working to add more space for padding fields into mdt_rec_reint in a manner compatible with 2.3. However, I don't know quite how much work that will entail; if it would be quite an undertaking, then it would seem prudent to instead leave the structures as-is and cross the bridge of expanding their size later when some fancy new feature requires it.

Comment by Peter Jones [ 26/Aug/12 ]

Landed for 2.1.3 and 2.3

Comment by Andreas Dilger [ 26/Aug/12 ]

Peter, just the patch to reserve the flag for this feature was landed, to avoid conflicts with other features being developed. The actual code is not landed yet.

Comment by nasf (Inactive) [ 19/Oct/12 ]

We need to re-make the nanosecond timestamp patch against the variable sized LVB patch.

Comment by nasf (Inactive) [ 12/Apr/13 ]

Do we want to support it in Lustre-2.5 ?

Comment by Artem Blagodarenko (Inactive) [ 19/Mar/18 ]

Are any plans exist currently to rewrite the patch against the variable sized LVB patch?

Comment by James A Simmons [ 19/Mar/18 ]

This overlaps with the 64 bit time work I have been doing. Now that we support ktime_t this can easily be handled. Just as a note DO NOT use 32 bit fields for nanoseconds. This with not work after 2038 and due to that upstream will reject the patch.

Comment by Artem Blagodarenko (Inactive) [ 19/Mar/18 ]

simmonsja thanks a lot for answer! I am going to look LU-9019 now.

Comment by Andreas Dilger [ 19/Mar/18 ]

James, I don't understand your comment. Why would we ever want more than a 32-bit field for nanoseconds? Surely there can't be more than 2^32 nanoseconds in a second? I understand that there can be leap seconds and other time adjustments that might result in over 10^9 nanoseconds in a second, but using a full 64-bit field for nanoseconds in the network protocol is just a waste of space.

Comment by James A Simmons [ 20/Mar/18 ]

Oh I see. You want to basically transmit struct timespec64 over the wire. I was thinking in terms of nanoseconds since the epoch being transmitted.

Comment by Artem Blagodarenko (Inactive) [ 20/Mar/18 ]

simmonsja So, do you think LU-9019 helps "transmit struct timespec64 over the wire" somehow?

Comment by Andreas Dilger [ 20/Mar/18 ]

Hmm, when does ns-since-epoch overflow? Maybe 2 extra bits from the 2^30 ns in a 32-bit field... That would simplify the protocol change, and give us 4*140 years extra?

Comment by James A Simmons [ 20/Mar/18 ]

2^32 -1 nanoseconds gives us 4.294967295 seconds until overflow.  So using just 32 bit nanoseconds time stamps are not very useful. As Atrem pointed out LNet already sends 64 bit time in seconds. We can use a 32 bit field to add nanosecond value along the already used seconds send. That is why I compared it to struct timespec64 = { time64_t tv_sec; long tv_nsec }

Atrem all the needed infrastructure to support the linux kernel 64 bit time handling has been merged to the latest lustre. Just don't use the cfs time wrappers since they will be going away.

Comment by Andreas Dilger [ 20/Mar/18 ]

Clearly I wasn't thinking of only a __u32 nanoseconds timestamp, since that won't even last until the end of this comment, let alone to Y2038, but rather in addition to the existing 64-bit seconds field, as was discussed in the original description.

On the other hand, passing the entire timestamp as a 64-bit nanosecond timestamp gives us roughly 292 years before signed overflow, and could simplify the wire protocol change (no need to change the size of the fields) with the added complexity that there would need to be a connection flag to indicate if this client is sending timestamps in seconds or nanoseconds. Potentially we could just assume any value larger than 2^32 is going to be nanoseconds (it is unlikely that any current Lustre releases would still be running in 20 years, and conversely 2^60 ns is needed to get to 2006, so they are very unlikely to conflict), but using an OBD_CONNECT_NANOSECONDS flag is not too hard. The main complexity is that this flag would need to be checked in many places, possibly places where the client export is not easily available, so just making a decision based on the size of the timestamp is relatively safe, and the old seconds format could eventually be deprecated with little effort.

Comment by Gerrit Updater [ 04/Dec/23 ]

"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53313
Subject: LU-1158 general: support nanosecond timestamps
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3dc43cd08efc8347cdceed3939be7c33256a081c

Generated at Sat Feb 10 01:14:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.