[LU-9397] Inconsistence use of cfs_time_current() and ktime_get_real_seconds() Created: 25/Apr/17 Updated: 02/Mar/18 Resolved: 03/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Amir Shehata (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
" One behavior observed were routers considering peers dead, due to these inconsistent comparison. We'll need to make sure that the LNet and LND code follows a consistent and correct approach. |
| Comments |
| Comment by James A Simmons [ 25/Apr/17 ] |
|
I see what happened. The ksocklnd stuff landed first with the intent of last_alive being moved over to time64_t. Later we change the behavior of LNetEQPoll and lnet ping so it went back to the jiffies based system. The ksocklnd code was left behind. |
| Comment by Gerrit Updater [ 25/Apr/17 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/26813 |
| Comment by Olaf Weber [ 26/Apr/17 ] |
|
It looks like a similar issue exists for ksock_conn::ksnc_tx_last_post. |
| Comment by Olaf Weber [ 26/Apr/17 ] |
|
And in ksocknal_destroy_conn() the calculation in the CERROR() involving last_rcv is wrong. last_rcv is in jiffies, is subtracted from ktime_get_real_seconds(), with the result then being converted from jiffies to seconds. |
| Comment by Amir Shehata (Inactive) [ 26/Apr/17 ] |
|
aside from Olaf's comments, I tested the fix and I don't see the issue any longer |
| Comment by James A Simmons [ 26/Apr/17 ] |
|
Good we are on the right track. I have to look over the ksocklnd driver and do a proper 64 bit time port. |
| Comment by James A Simmons [ 01/May/17 ] |
|
I have been looking at the socklnd code and it looks like everything is pretty much needs only second precision. It looks like everything could be changed into time64_t. Olaf do you see cases where we will need higher resolution? When we want to go to jiffies I can use cfs_time_seconds() as the case for the call to lnet_notify() and ksocknal_query(). Other than that I don't see the need for ktime_t. |
| Comment by Olaf Weber [ 02/May/17 ] |
|
As far as I can tell, the main argument for jiffies is that they're cheaper to get and use. The extra resolution doesn't matter for these cases. Using ktime_get_real_seconds() in particular should get you (assuming proper time sync in place) time stamp values that can (easily) be compared between nodes. That by itself might justify this. |
| Comment by James A Simmons [ 03/May/17 ] |
|
Okay just wanted to verify that was the case for jiffies. The issue with jiffies they can also vary between nodes ( |
| Comment by Gerrit Updater [ 03/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26813/ |
| Comment by Peter Jones [ 03/Jun/17 ] |
|
Landed for 2.10 |