Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/11/
-
3
-
9223372036854775807
Description
Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)
The following effects can be observed:
- After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)
- Sometimes only 1 MDT finish recovery
- Often the recovery never completes
- This is true for all MDSes
- a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).
- Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.
Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes:
Recovery for secondary MDTs on lola-11
mds_failover : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647 lola-10 Mar 3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted. Mar 3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112 lola-10 Mar 3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted. Mar 3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652 lola-10 Mar 3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted. Mar 3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686 lola-10 Mar 4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted. Mar 4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted. mds_failover : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927 lola-10 Mar 4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted. Mar 4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519 lola-10 Mar 4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted. Mar 4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
------------------
Recovery for primary MDTs on lola-11
mds_failover : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316 lola-11 Mar 3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted. Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. mds_failover : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044 lola-11 Mar 3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted. ---> MDT0007 never recovered mds_failover : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982 lola-11 Mar 4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted. Mar 4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted. mds_failover : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_failover : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_restart : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered
Attached message, console and debug log files (with mask '1') of all MDS nodes (lola[8-11].
Same situation ended once with start of oom-killer (see LU-7836.)
Attachments
Issue Links
- is related to
-
LU-7974 Allow failover MDT to connect other MDTs immediately
-
- Resolved
-
After ~ 73 hours the recovery process stalled again and lead to an continuously increasing allocation of slabs.
The later effect is handled in
LU-7836. NOTE: All kernel debugs, messages and console logs have been attached toLU-7836.All events between 2016-03-17 04:38 — 2016-03-18 02:00 were executed with 'mds_restart', 'mds_failover + wait for recovery'
(wait for recovery means that the recovery process need to complete on secondary node before failback)
This is mentioned as the former failback 'mechanism' configured within soak framework was to failback immediately after
the server target was mounted successfully on the secondary node. The error happens actually after a 'mds_restart'.
Sequence of events
MDT stalled after time_remaining is 0:
The recovery process can't be interrupted:
For this event the debug log file lola-8-lustre-log-20160318-0240 has been attached.