Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/11/
-
3
-
9223372036854775807
Description
Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)
The following effects can be observed:
- After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)
- Sometimes only 1 MDT finish recovery
- Often the recovery never completes
- This is true for all MDSes
- a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).
- Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.
Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes:
Recovery for secondary MDTs on lola-11
mds_failover : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647 lola-10 Mar 3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted. Mar 3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112 lola-10 Mar 3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted. Mar 3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652 lola-10 Mar 3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted. Mar 3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686 lola-10 Mar 4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted. Mar 4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted. mds_failover : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927 lola-10 Mar 4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted. Mar 4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519 lola-10 Mar 4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted. Mar 4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
------------------
Recovery for primary MDTs on lola-11
mds_failover : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316 lola-11 Mar 3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted. Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. mds_failover : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044 lola-11 Mar 3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted. ---> MDT0007 never recovered mds_failover : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982 lola-11 Mar 4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted. Mar 4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted. mds_failover : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_failover : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_restart : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered
Attached message, console and debug log files (with mask '1') of all MDS nodes (lola[8-11].
Same situation ended once with start of oom-killer (see LU-7836.)
Attachments
Issue Links
- is related to
-
LU-7974 Allow failover MDT to connect other MDTs immediately
-
- Resolved
-
Soak has been continued to execute b2_8 RC5 build with reformatted Lustre FS.
Now there's only 1 MDT per MDS and 5 OSTs per OSS (unchanged). MDT had
been formatted with ldiskfs and OSTs using zfs.
The recovery process never stalls now neither for MDS restarts nor failover. All
recovery times are below 2 mins now. See the attached file recovery-times-201603-17.