nanosecond timestamp support for Lustre (LU-1158)

[LU-1619] Subtle and infrequent failure of test 39a in sanityn.sh noticed after ns timestamp implementation Created: 10/Jul/12  Updated: 23/Jun/17  Resolved: 23/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Technical task Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: ReviewNeeded
Environment:

CentOS 6.2/x86_64 on Maloo


Issue Links:
Duplicate
duplicates LU-6366 sanityn test_39a: mtime (1426255087) ... Resolved
Story Points: 3
Rank (Obsolete): 10374

 Description   

The results of a Maloo automated test of the nanosecond timestamp patches (LU-1158, http://review.whamcloud.com/#change,3266 and http://review.whamcloud.com/#change,3271, the latter of which includes a revised sanityn.sh for testing nanosecond times) revealed a slight time desynchronization after file renaming in test 39a of the revised sanityn.sh, on the order of 500,000 nanoseconds (about 1/2 of a millisecond) or so. This failure occurred 1 time out of 100 trials. Since the previous maximum time granularity of Lustre was at second resolution, and this bug cropped up 1% of the time during testing, it is highly unlikely that this bug would be noticed or reproducible without application of the LU-1158 patches; for it to be noticed, of the 1% of times the bug occurs, the 1/2 millisecond difference would have to occur across a second boundary, i.e. the second timestamps would have to be different. However, since the patches do not change any logic outside of determining when a timestamp is greater than another, it is possible that this bug exists undetected in master.

Links to related Maloo test session and sanityn.sh failure:
https://maloo.whamcloud.com/test_sessions/fd767d96-ca51-11e1-a9ff-52540035b04c
https://maloo.whamcloud.com/test_sets/34a6a812-ca53-11e1-a9ff-52540035b04c



 Comments   
Comment by Andreas Dilger [ 11/Jul/12 ]

comment
Isami, thanks for filing this bug. It is always useful to track issues like this, even if they do not appear to be causing any significant problems, since I suspect it is a sign that the timestamps aren't being handled correctly somewhere, and could lead to much larger skews in the timestamps. just a note about the "failure rate" of test results in Maloo. The percentage reported in Maloo means "1 of the past 100 tests failed" (i.e. the test just run and failed), so it isn't at all clear whether this means "every test from now on with this patch will fail" or "this was a fluke and the failure rate is 1 in 10000". If you want to get statistically significant numbers on how often this is failing, please try something like:

 Test-Parameters: testlist=sanityn,sanityn,sanityn,sanityn,sanityn,sanityn,sanityn envdefinitions=ONLY=39 

or similar, so that it runs sanityn test_39* several times. There typically shouldn't be any "drift" in the timestamps in this manner - everything should be derived only from the client clock. There are a couple of places that this might be introduced: - different nanosecond times on different CPU cores due to clock drift. I doubt this is the case anymore, but this used to happen in the past - skew between userspace and kernel timestamps. This could only be the case if the test is somehow using the clock from userspace - bug in the code allowing the MDS or OSS timestamps to affect the file. This is the most likely cause. It would be possible to test out the last hypothesis by setting wildly different times on the MDS and OSS nodes from the client (either or both in the past or future), and then tracing where the bad timestamp is coming from (which should be easy if the timestamps can clearly be distinguished as to which node they came from).

Generated at Sat Feb 10 01:18:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.