Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.9.0
-
3
-
9223372036854775807
Description
The effect happens during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
OSTs have been configured using zfs, MDTs using ldiskfs
Environment consist of 4 MDSes with 1 MDT each and 6 OSSes with 4 OSTs each.
DNE is enabled.
MDS and OSS nodes are configured in active-active HA configuration.
The effect of very long recovery times occurred after the restart of two MDS nodes.
lola-9 : 89:38 min recovery time
lola-10 : 193:55 min recovery time
An OSS failover/failback injected by the framework, while the MDTs of the restarted MDS nodes were in recovery, finished successful in 48 seconds.
Sequence of events:
- 2016-07-19 22:57:52 - lola-10 mds_restart
- 2016-07-19 23:07:21 - lola-10 , recovery started
- 2016-07-20 00:41:55 - lola-9 mds_restart
- 2016-07-20 00:51:55 - lola-9 , recovery started
- 2016-07-20 00:53:07 - lola-3 oss_failover
- 2016-07-20 00:58:55 - lola-3 recovery started
- 2016-07-20 00:59:43 - lola-3 kernel: Lustre: soaked-OST0007: Recovery over after 0:48, of 23 clients 23 recovered and 0 were evicted.
- 2016-07-20 02:21:35 - lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 89:38, of 22 clients 21 recovered and 1 was evicted.
- 2016-07-20 02:21:17 - lola-10 kernel: Lustre: soaked-MDT0002: Recovery over after 193:55, of 22 clients 20 recovered and 2 were evicted.
Clients were in the following state during recovery:
[root@lola-16 ~]# pdsh -g clients 'lctl get_param *.*.state | grep -A 2 -E "MDT0001|MDT0002" ' | dshbak -c | less -i ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: ---------------- lola-23 ---------------- mdc.soaked-MDT0001-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-24 ---------------- mdc.soaked-MDT0001-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-25 ---------------- mdc.soaked-MDT0001-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-26 ---------------- mdc.soaked-MDT0001-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-27 ---------------- mdc.soaked-MDT0001-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-29 ---------------- ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history:
Attached files:
Console, messages logs of MDS nodes lola-[9,10]
debug kernel logs of MDS nodes lola-[9,10] and single Lustre client node lola-19
Attachments
Issue Links
- duplicates
-
LU-8250 MDT recovery stalled on secondary node
-
- Resolved
-
Activity
Fix Version/s | Original: Lustre 2.9.0 [ 11891 ] | |
Resolution | New: Duplicate [ 3 ] | |
Status | Original: In Progress [ 3 ] | New: Resolved [ 5 ] |
Assignee | Original: Lai Siyao [ laisiyao ] | New: nasf [ yong.fan ] |
Priority | Original: Critical [ 2 ] | New: Major [ 3 ] |
Status | Original: Open [ 1 ] | New: In Progress [ 3 ] |
Fix Version/s | New: Lustre 2.9.0 [ 11891 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Lai Siyao [ laisiyao ] |
Attachment | New: messages-lola-9.log.bz2 [ 22288 ] | |
Attachment | New: messages-lola-10.log.bz2 [ 22289 ] | |
Attachment | New: console-lola-9.log.bz2 [ 22290 ] | |
Attachment | New: console-lola-10.log.bz2 [ 22291 ] | |
Attachment | New: lustre-log-lola-9-20160720_0136.bz2 [ 22292 ] | |
Attachment | New: lustre-log-lola-10-20160720_0136.bz2 [ 22293 ] | |
Attachment | New: lustre-log-lola-19-20160720_0136.bz2 [ 22294 ] | |
Attachment | New: lustre-log-lola-19-20160720_0225.bz2 [ 22295 ] | |
Attachment | New: lustre-log-lola-10-20160720_0225.bz2 [ 22296 ] | |
Attachment | New: lustre-log-lola-9-20160720_0225.bz2 [ 22297 ] |
Description |
Original:
The effect happens during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) OSTs have been configured using _zfs_, MDTs using _ldiskfs_ Environment consist of 4 MDSes with 1 MDT each and 6 OSSes with 4 OSTs each. MDS and OSS nodes are configured in active-active HA configuration. The effect of very long recovery times occurred after the restart of two MDS nodes. {{lola-9}} : 89:38 min recovery time {{lola-10}} : 193:55 min recovery time An OSS failover/failback injected by the framework, while the MDTs of the restarted MDS nodes were in recovery, finished successful in 48 seconds. Sequence of events: * 2016-07-19 22:57:52 - lola-10 mds_restart * 2016-07-19 23:07:21 - lola-10 , recovery started * 2016-07-20 00:41:55 - lola-9 mds_restart * 2016-07-20 00:51:55 - lola-9 , recovery started * 2016-07-20 00:53:07 - lola-3 oss_failover * 2016-07-20 00:58:55 - lola-3 recovery started * 2016-07-20 00:59:43 - lola-3 kernel: Lustre: soaked-OST0007: Recovery over after 0:48, of 23 clients 23 recovered and 0 were evicted. * 2016-07-20 02:21:35 - lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 89:38, of 22 clients 21 recovered and 1 was evicted. * 2016-07-20 02:21:17 - lola-10 kernel: Lustre: soaked-MDT0002: Recovery over after 193:55, of 22 clients 20 recovered and 2 were evicted. Clients were in the following state during recovery: {noformat} [root@lola-16 ~]# pdsh -g clients 'lctl get_param *.*.state | grep -A 2 -E "MDT0001|MDT0002" ' | dshbak -c | less -i ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: ---------------- lola-23 ---------------- mdc.soaked-MDT0001-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-24 ---------------- mdc.soaked-MDT0001-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-25 ---------------- mdc.soaked-MDT0001-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-26 ---------------- mdc.soaked-MDT0001-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-27 ---------------- mdc.soaked-MDT0001-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-29 ---------------- ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: {noformat} Attached files: Console, messages logs of MDS nodes {{lola-[9,10]}} debug kernel logs of MDS nodes {{lola-[9,10]}} and single Lustre client node {{lola-19}} |
New:
The effect happens during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) OSTs have been configured using _zfs_, MDTs using _ldiskfs_ Environment consist of 4 MDSes with 1 MDT each and 6 OSSes with 4 OSTs each. DNE is enabled. MDS and OSS nodes are configured in active-active HA configuration. The effect of very long recovery times occurred after the restart of two MDS nodes. {{lola-9}} : 89:38 min recovery time {{lola-10}} : 193:55 min recovery time An OSS failover/failback injected by the framework, while the MDTs of the restarted MDS nodes were in recovery, finished successful in 48 seconds. Sequence of events: * 2016-07-19 22:57:52 - lola-10 mds_restart * 2016-07-19 23:07:21 - lola-10 , recovery started * 2016-07-20 00:41:55 - lola-9 mds_restart * 2016-07-20 00:51:55 - lola-9 , recovery started * 2016-07-20 00:53:07 - lola-3 oss_failover * 2016-07-20 00:58:55 - lola-3 recovery started * 2016-07-20 00:59:43 - lola-3 kernel: Lustre: soaked-OST0007: Recovery over after 0:48, of 23 clients 23 recovered and 0 were evicted. * 2016-07-20 02:21:35 - lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 89:38, of 22 clients 21 recovered and 1 was evicted. * 2016-07-20 02:21:17 - lola-10 kernel: Lustre: soaked-MDT0002: Recovery over after 193:55, of 22 clients 20 recovered and 2 were evicted. Clients were in the following state during recovery: {noformat} [root@lola-16 ~]# pdsh -g clients 'lctl get_param *.*.state | grep -A 2 -E "MDT0001|MDT0002" ' | dshbak -c | less -i ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: ---------------- lola-23 ---------------- mdc.soaked-MDT0001-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff8808e6af4400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-24 ---------------- mdc.soaked-MDT0001-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102e910800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-25 ---------------- mdc.soaked-MDT0001-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88102f38a400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-26 ---------------- mdc.soaked-MDT0001-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880f7857fc00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-27 ---------------- mdc.soaked-MDT0001-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88081a0eec00.state= current_state: REPLAY_WAIT state_history: ---------------- lola-29 ---------------- ---------------- lola-13 ---------------- mdc.soaked-MDT0001-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88028add7800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-16 ---------------- mdc.soaked-MDT0001-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880415a56400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-19 ---------------- mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state= current_state: REPLAY_WAIT state_history: ---------------- lola-20 ---------------- mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-21 ---------------- mdc.soaked-MDT0001-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: -- mdc.soaked-MDT0002-mdc-ffff881033e32400.state= current_state: REPLAY_WAIT state_history: ---------------- lola-22 ---------------- mdc.soaked-MDT0001-mdc-ffff88082e455000.state= current_state: REPLAY_WAIT state_history: {noformat} Attached files: Console, messages logs of MDS nodes {{lola-[9,10]}} debug kernel logs of MDS nodes {{lola-[9,10]}} and single Lustre client node {{lola-19}} |