[LU-11771] bad output in target_handle_reconnect: Recovery already passed deadline 71578:57 Created: 13/Dec/18  Updated: 08/Oct/19  Resolved: 25/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11762 replay-single test 0d fails with 'po... Resolved
is related to LU-9019 Migrate lustre to standard 64 bit tim... Resolved
is related to LU-12769 replay-dual test 0b hangs in client m... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In functions target_handle_reconnect and target_handle_connect I've found incorrect using of linux kernel time types.

       now = ktime_get_seconds();
        deadline = jiffies_to_msecs(target->obd_recovery_timer.expires) /
                   MSEC_PER_SEC; 

Comparing jiffies converted to seconds and seconds from CLOCK_MONOTONIC is incorrect.
jiffies converted to seconds should be used Instead of ktime_get_seconds.
In this way we can avoid wrong timeouts in warnings and incorrect timeouts comparing.

2018-07-31 18:51:46 [ 8201.235800] Lustre: fs1-OST0000: Recovery already passed deadline 71578:57. If you do not want to wait more, please abort the recovery by force.
...
2018-07-31 18:51:46 [ 8201.236177] Lustre: fs1-OST0000: Denying connection for new client 71f8ec29-a676-0a96-3d1d-97b43c72e168(at 172.18.1.101@o2ib), waiting for 13 known clients (1 recovered, 11 in progress, and 1 evicted) to recover in 71578:57


 Comments   
Comment by Gerrit Updater [ 13/Dec/18 ]

Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33848
Subject: LU-11771 ldlm: fix timeout in tgt_handle_re/connect message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3459d62a3cd8313e564942ba6ebc08204b9e0960

Comment by James A Simmons [ 13/Dec/18 ]

What you are suggesting is the the seconds since boot don't match the jiffies mapped to seconds since boot. Note if you build lustre on a system with CONF_HZ=1000 and install on a system with CONF_HZ=100 this patch will break.

Comment by Andreas Dilger [ 13/Dec/18 ]

It would also be useful to improve the error message "If you do not want to wait more, please abort the recovery by force." to be more specific, like "Please run 'lctl --device fs1-OST0000 abort_recovery' to force recovery to finish. This evicts clients and may cause application IO errors."

Comment by Andreas Dilger [ 13/Dec/18 ]

This problem was introduced in patch https://review.whamcloud.com/29295 "LU-9019 ldlm: migrate the rest of the code to 64 bit time"

Comment by James A Simmons [ 14/Dec/18 ]

The reason for this is that jiffies is initialized to 5 minutes before the machine actually boots. So jiffies starts at -300 * HZ while ktime actually starts at 0. I will look carefully at the difference between the two some time tomorrow.

Comment by Gerrit Updater [ 14/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33857
Subject: LU-11771 osd: avoid use of HZ in brw_stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2bb2e975b96651ff271020a0b2efbc7e452c71c3

Comment by Gerrit Updater [ 17/Dec/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33883
Subject: LU-11771 ldlm: use hrtimer for recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fd7a6b308cb1094fa4cf0cd20aebe9ede9f779fc

Comment by Gerrit Updater [ 04/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33857/
Subject: LU-11771 osd: avoid use of HZ in brw_stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8d83e946bc96df6535d9f501db400e2196a45668

Comment by Peter Jones [ 04/Jan/19 ]

Andreas's patch just landed. Do we need Sergey's and/or James's too or can they now be abandoned?

Comment by James A Simmons [ 04/Jan/19 ]

The patch that landed was a fix for something else. We need to land the other patch.

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33883/
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1ba794f6ec9e7ce7ad65fd74f170089fffc31d91

Comment by James A Simmons [ 08/Apr/19 ]

Patch landed that resolves this issue.

Comment by Gerrit Updater [ 09/Apr/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34626
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 64d859077f20971360e221b17e7f1fb6e0b0a148

Comment by Gerrit Updater [ 10/Apr/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34629
Subject: Revert "LU-11771 ldlm: use hrtimer for recovery to fix timeout messages"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 78c69a1d9ea4dfe3f34afeacec820e6d56f24ced

Comment by James A Simmons [ 10/Apr/19 ]

So something changed which now makes this patch fail. Even the back ported version to 2.12 doesn't have these kinds of failures.

Comment by Gerrit Updater [ 11/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34629/
Subject: Revert "LU-11771 ldlm: use hrtimer for recovery to fix timeout messages"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 71f156e09259ef1e6e78b83cf68442b49c9ab25e

Comment by Gerrit Updater [ 18/Apr/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34710
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e319cc283e33842f0d62eb41067c3a28840f0cf9

Comment by Gerrit Updater [ 25/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34710/
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9334f1d51249c186e15b42a1717312d03385153a

Comment by James A Simmons [ 25/May/19 ]

Fix landed. We just need to let it soak.

Comment by Gerrit Updater [ 20/Jun/19 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/35276
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6c1b9ce62e507b32b1ad867c34f3e1fd9db1eeb9

Comment by Gerrit Updater [ 11/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35276/
Subject: LU-11771 ldlm: use hrtimer for recovery to fix timeout messages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 632bf95a442931b97c2ea4816fa3e56a7853a2a2

Generated at Sat Feb 10 02:46:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.