Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.0
-
None
-
2.1.0 + with minimal back porting from 2.2
-
3
-
4669
Description
while testing we hit a situation when recovery never finished and recovery timer exceed a hard recovery timer.
00010000:00080000:20.0:1330709620.108824:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:20.0:1330709690.108951:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:20.0:1330709690.120847:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709760.120868:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709760.132776:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709830.131858:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709830.143745:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 70 seconds
00010000:00000400:1.0:1330709900.142871:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709900.154725:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 40 seconds
00010000:00000400:1.0:1330709940.153865:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.165727:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:13.0:1330709940.165827:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:13.0:1330709940.177697:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.178088:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.189941:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:13.0:1330709940.190014:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:13.0:1330709940.201864:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.202082:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
00010000:00080000:1.0:1330709940.213933:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 0 seconds
00010000:00000400:1.0:1330709940.214821:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
...
after analyzing a logs that hand looks addressed to waiting in target_recovery_overseer function with check_for_clients() argument.
that hung looks a result of using a
if (obd->obd_no_conn == 0 &&
obd->obd_connected_clients + obd->obd_stale_clients ==
obd->obd_max_recoverable_clients)
in case of MDT
obd_no_conn set by post recovery if at least one ost connected and config llog processed.
but mdt_postrecov can't called because recovery isn't finished.
second issue in that area - reset_recovery_timer function.
if we have a race and reset_recovery_timer function called in same time as recovery should be finished, but before timer a hit, we set a '0' (and negative number at next turn) as next timer time.
00010000:00080000:1.0:1330709942.007696:0:19858:0:(ldlm_lib.c:1361:reset_recovery_timer()) snxs4-MDT0000: recovery timer will expire in 4294967294 seconds
00010000:00080000:1.1:1330709942.007794:0:9:0:(ldlm_lib.c:1887:target_recovery_expired()) snxs4-MDT0000: recovery timed out; 36 clients are still in recovery after 902s (49 clients connected)
00010000:00000400:13.0:1330709942.007802:0:19858:0:(ldlm_lib.c:1570:target_recovery_overseer()) recovery is timed out, evict stale exports
Attachments
Issue Links
- is related to
-
LU-1522 ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
- Resolved
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....