Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.9.0
-
3
-
9223372036854775807
Description
The effect happens during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
OSTs have been configured using zfs, MDTs using ldiskfs
Environment consist of 4 MDSes with 1 MDT each and 6 OSSes with 4 OSTs each.
DNE is enabled.
MDS and OSS nodes are configured in active-active HA configuration.
The effect of very long recovery times occurred after the restart of two MDS nodes.
lola-9 : 89:38 min recovery time
lola-10 : 193:55 min recovery time
An OSS failover/failback injected by the framework, while the MDTs of the restarted MDS nodes were in recovery, finished successful in 48 seconds.
Sequence of events:
- 2016-07-19 22:57:52 - lola-10 mds_restart
- 2016-07-19 23:07:21 - lola-10 , recovery started
- 2016-07-20 00:41:55 - lola-9 mds_restart
- 2016-07-20 00:51:55 - lola-9 , recovery started
- 2016-07-20 00:53:07 - lola-3 oss_failover
- 2016-07-20 00:58:55 - lola-3 recovery started
- 2016-07-20 00:59:43 - lola-3 kernel: Lustre: soaked-OST0007: Recovery over after 0:48, of 23 clients 23 recovered and 0 were evicted.
- 2016-07-20 02:21:35 - lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 89:38, of 22 clients 21 recovered and 1 was evicted.
- 2016-07-20 02:21:17 - lola-10 kernel: Lustre: soaked-MDT0002: Recovery over after 193:55, of 22 clients 20 recovered and 2 were evicted.
Clients were in the following state during recovery:
[root@lola-16 ~]# pdsh -g clients 'lctl get_param *.*.state | grep -A 2 -E "MDT0001|MDT0002" ' | dshbak -c | less -i ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: ---------------- lola-23 ---------------- mdc.soaked-MDT0001-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-24 ---------------- mdc.soaked-MDT0001-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-25 ---------------- mdc.soaked-MDT0001-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-26 ---------------- mdc.soaked-MDT0001-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-27 ---------------- mdc.soaked-MDT0001-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-29 ---------------- ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history:
Attached files:
Console, messages logs of MDS nodes lola-[9,10]
debug kernel logs of MDS nodes lola-[9,10] and single Lustre client node lola-19
Attachments
Issue Links
- duplicates
-
LU-8250 MDT recovery stalled on secondary node
- Resolved