[LU-9397] Inconsistence use of cfs_time_current() and ktime_get_real_seconds() Created: 25/Apr/17  Updated: 02/Mar/18  Resolved: 03/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Amir Shehata (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10707 TCP eth routed LNet traffic broken Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

"LU-6245 libcfs: add ktime_get_real_seconds support" introduced a problem where sometimes we use cfs_time_current() and others ktime_get_real_seconds(). This leads to wrong comparison, leading to strange timing issues.

One behavior observed were routers considering peers dead, due to these inconsistent comparison.

We'll need to make sure that the LNet and LND code follows a consistent and correct approach.



 Comments   
Comment by James A Simmons [ 25/Apr/17 ]

I see what happened. The ksocklnd stuff landed first with the intent of last_alive being moved over to time64_t. Later we change the behavior of LNetEQPoll and lnet ping so it went back to the jiffies based system. The ksocklnd code was left behind.

Comment by Gerrit Updater [ 25/Apr/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/26813
Subject: LU-9397 ksocklnd: make last_alive in units of jiffies
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7006890fcc484a8074a033517a6607e6ace15b8b

Comment by Olaf Weber [ 26/Apr/17 ]

It looks like a similar issue exists for ksock_conn::ksnc_tx_last_post.

Comment by Olaf Weber [ 26/Apr/17 ]

And in ksocknal_destroy_conn() the calculation in the CERROR() involving last_rcv is wrong. last_rcv is in jiffies, is subtracted from ktime_get_real_seconds(), with the result then being converted from jiffies to seconds.

Comment by Amir Shehata (Inactive) [ 26/Apr/17 ]

aside from Olaf's comments, I tested the fix and I don't see the issue any longer

Comment by James A Simmons [ 26/Apr/17 ]

Good we are on the right track. I have to look over the ksocklnd driver and do a proper 64 bit time port.

Comment by James A Simmons [ 01/May/17 ]

I have been looking at the socklnd code and it looks like everything is pretty much needs only second precision. It looks like everything could be changed into time64_t. Olaf do you see cases where we will need higher resolution? When we want to go to jiffies I can use cfs_time_seconds() as the case for the call to lnet_notify() and ksocknal_query(). Other than that I don't see the need for ktime_t.

Comment by Olaf Weber [ 02/May/17 ]

As far as I can tell, the main argument for jiffies is that they're cheaper to get and use. The extra resolution doesn't matter for these cases.

Using ktime_get_real_seconds() in particular should get you (assuming proper time sync in place) time stamp values that can (easily) be compared between nodes. That by itself might justify this.

Comment by James A Simmons [ 03/May/17 ]

Okay just wanted to verify that was the case for jiffies. The issue with jiffies they can also vary between nodes (LU-5443) depending on what the user selected for HZ so I like to migrate away as much as possible from them. Currently I'm working on the patch but I have to go over it very carefully. To easy to break things.

Comment by Gerrit Updater [ 03/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26813/
Subject: LU-9397 ksocklnd: move remaining time handling to 64 bits
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 59c25356208706b4dd0920bc7754f21e2db14a0f

Comment by Peter Jones [ 03/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:25:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.