Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/11/
-
3
-
9223372036854775807
Description
Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)
The following effects can be observed:
- After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)
- Sometimes only 1 MDT finish recovery
- Often the recovery never completes
- This is true for all MDSes
- a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).
- Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.
Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes:
Recovery for secondary MDTs on lola-11
mds_failover : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647 lola-10 Mar 3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted. Mar 3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112 lola-10 Mar 3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted. Mar 3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652 lola-10 Mar 3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted. Mar 3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686 lola-10 Mar 4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted. Mar 4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted. mds_failover : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927 lola-10 Mar 4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted. Mar 4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted. mds_failover : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519 lola-10 Mar 4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted. Mar 4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
------------------
Recovery for primary MDTs on lola-11
mds_failover : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316 lola-11 Mar 3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted. Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. mds_failover : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359 lola-11 Mar 3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted. Mar 3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted. mds_restart : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044 lola-11 Mar 3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted. ---> MDT0007 never recovered mds_failover : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982 lola-11 Mar 4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted. Mar 4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted. mds_failover : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_failover : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered mds_restart : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783 lola-11 ---> MDT0006 never recovered ---> MDT0007 never recovered
Attached message, console and debug log files (with mask '1') of all MDS nodes (lola[8-11].
Same situation ended once with start of oom-killer (see LU-7836.)
Attachments
Issue Links
- is related to
-
LU-7974 Allow failover MDT to connect other MDTs immediately
- Resolved