[LU-7938] Recovery on secondary OSS node stalled Created: 29/Mar/16  Updated: 27/Jul/16  Resolved: 27/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: nasf (Inactive)
Resolution: Not a Bug Votes: 0
Labels: soak
Environment:

lola
build: 2.8 GA + patches


Attachments: File console-lola-5.log-20160722.bz2     File console-lola-5.log.bz2     File lola-4-console-20160415.bz2     Text File lola-4-lustre-log-20160415-0420.bz2     File lola-4-lustre-log-20160415-0430.bz2     File lola-4-messages-20160415.bz2     Text File lustre-log-20160329-0759-recovery-stalled.bz2     File lustre-log-20160329-0803-recovery-aborted.bz2     Text File lustre-log-lola-3-2016-07-25_0855-ost-recovery-stalled.bz2     Text File lustre-log-lola-3-2016-07-25_0859-after-ost-recovery-aborted.bz2     Text File lustre-log-lola-30-2016-07-25_0855-ost-recovery-stalled.bz2     Text File lustre-log-lola-30-2016-07-25_0859-after-ost-recovery-aborted.bz2     Text File lustre-log-lola-5-20160722_0434_ost_recovery_stalled.bz2     File lustre-log-lola-5-20160722_0438_after_ost_recovery_aborted.bz2     Text File lustre-log-lola-8-2016-07-25_0855-ost-recovery-stalled.bz2     Text File lustre-log-lola-8-2016-07-25_0859-after-ost-recovery-aborted.bz2     File messages-lola-5.log-20160722.bz2     File messages-lola-5.log.bz2     HTML File ost-failover-setttings-20160726_0702    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Description

Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDS and OSS nodes are configured in HA active-active failover configuration. MDS nodes operated wiht 1 MDT per MDS, while OSSes running 4 OST per node.
Nodes lola-4 and lola-5 form a HA cluster.

Event history

  • 2016-03-29 07:48:04,307:fsmgmt.fsmgmt:INFO triggering fault oss_failover of node lola-4
    powercyle node
  • 2016-03-29 07:52:27,876:fsmgmt.fsmgmt:INFO lola-4 is up
  • 2016-03-29 07:53:09,557:fsmgmt.fsmgmt:INFO zpool import and mount of lola-4's OSTs complete
  • Recovery don't complete after recovery_time is zero
  • 2016-03-29 08:03 (approximately) Aborted recovery manually (lctl --device ... abort_recovery)

Attached files:
messages, console and debug log before (lustre-log-20160329-0759-recovery-stalled) and after recovery was aborted (lustre-log-20160329-0803-recovery-aborted)



 Comments   
Comment by Oleg Drokin [ 29/Mar/16 ]

lustre-log-20160329-0759-recovery-stalled is referenced, but not really provided?

Comment by Frank Heckes (Inactive) [ 30/Mar/16 ]

Sorry, it took many attempts to upload the file.

Comment by Frank Heckes (Inactive) [ 30/Mar/16 ]

The error can be reproduced easily. Please let me know if debug logs with mask different than default are needed.

Comment by Frank Heckes (Inactive) [ 15/Apr/16 ]

The error occurred also in soak test for build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). Test set-up is the same as the one described above, besides
that the Lustre FS has been extended by 2 OSS nodes hosting 4 OSTs each.

The recovery is stalled, but can be manually aborted, too.
Sequence of events:

  • 2016-04-15 03:57:14,905:fsmgmt.fsmgmt:INFO triggering fault oss_failover (lola-5 OSTs ----> lola-4)
  • 2016-04-15 04:02:13,325:fsmgmt.fsmgmt:INFO Wait for recovery to complete (--> all OSTs mounted on secondary node lola-4)
  • 2016-04-15 04:20 created debug log
  • 2016-04-15 04:28:29 Recovery aborted for ost- {3,9,15,21}

    (decimal index)

  • 2016-04-15 04:30 created debug log
    Attached files:
    console, messages files of lola-4 and lola-4-lustre-log-20160415-0420 (before abort), lola-4-lustre-log-20160415-0430 (after abort)
Comment by Cliff White (Inactive) [ 06/Jun/16 ]

The error is occurring again on the tip of master plus the two patches.

# cat /proc/fs/lustre/mdt/soaked-MDT0002/recovery_status 
status: RECOVERING
recovery_start: 1465232959
time_remaining: 0
connected_clients: 20/20
req_replay_clients: 0
lock_repay_clients: 1
completed_clients: 19
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 442381632838
Comment by Cliff White (Inactive) [ 06/Jun/16 ]

Log errors

Lustre: Skipped 299 previous similar messages
Lustre: soaked-MDT0002: Recovery already passed deadline 137:09, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
Comment by Cliff White (Inactive) [ 06/Jun/16 ]

Does not appear possible to abort_recovery, results in timeouts.

INFO: task lctl:41183 blocked for more than 120 seconds.
      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 0000000000000019     0 41183  41083 0x00000080
 ffff8806bdda7b68 0000000000000086 0000000000000000 ffff88082fe86ab0
 ffff88082fe86ae8 ffff880439371150 0000ece58069441d 0000000000000000
 ffff8806bdd8eae8 000000010f83d3bd ffff8806bdd8f068 ffff8806bdda7fd8
Call Trace:
 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0
 [<ffffffff8153a693>] wait_for_common+0x123/0x180
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20
 [<ffffffffa0992e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]
 [<ffffffffa112ac02>] mdt_iocontrol+0x6b2/0x920 [mdt]
 [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa0783edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]
 [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20
 [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0
 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580
 [<ffffffff811a03f6>] ? final_putname+0x26/0x50
 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0
 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Comment by Frank Heckes (Inactive) [ 22/Jul/16 ]

Comments from Cliff above seems to be related to LU-8089.

The error for for OST reported in the description above, also occurred for build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
and is reproducible for any failover initiated.
Test environment configuration is
4 MDSes with 1 MDT each formatted with ldiskfs in active-active failover confguration
6 OSSes with 4 OSTs each formatted with zfs in active-active failover configuration
16 clients
The stalled recovery process can be interrupted (aborted) at any time.

I attached new message, console and debug log files (take during recovery, after recovery abort) of secondary node lola-5.
Debug files of all other nodes taken at the same time, are available and can be uploaded on demand.

Sequence of events:

  • 2016-07-22 04:21:54,665:fsmgmt.fsmgmt:INFO triggering fault oss_failover lola-4
  • 2016-07-22 04:26:37,971:fsmgmt.fsmgmt:INFO lola-4 is up
  • 2016-07-22 04:27:01,278:fsmgmt.fsmgmt:INFO ... soaked-OST0008 mounted successfully on lola-5 (NOTE: decimal id for OST)
  • 2016-07-22 04:27:12,348:fsmgmt.fsmgmt:INFO ... soaked-OST0020 mounted successfully on lola-5
  • 2016-07-22 04:27:23,515:fsmgmt.fsmgmt:INFO ... soaked-OST0002 mounted successfully on lola-5
  • 2016-07-22 04:27:36,047:fsmgmt.fsmgmt:INFO ... soaked-OST0014 mounted successfully on lola-5
  • 2016-07-22 04:27:36 Recovery started
  • 2016-07-22 04:37 Recovery aborted

Status of recovery on OSS node lola-5 and client status were:

[root@lola-5 ~]# lctl get_param *.*.recovery_status
obdfilter.soaked-OST0002.recovery_status=
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/23
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 68719725397
obdfilter.soaked-OST0003.recovery_status=
status: COMPLETE 
recovery_start: 1468949759
recovery_duration: 30
completed_clients: 23/23
replayed_requests: 0
last_transno: 34368384861
VBR: DISABLED
IR: ENABLED
obdfilter.soaked-OST0008.recovery_status=
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/23
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 68719723367
obdfilter.soaked-OST0009.recovery_status=
status: COMPLETE
recovery_start: 1468949751
recovery_duration: 38
completed_clients: 23/23
replayed_requests: 0
last_transno: 34368183131
VBR: DISABLED
IR: ENABLED
obdfilter.soaked-OST000e.recovery_status=
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/23
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 68719751551
obdfilter.soaked-OST000f.recovery_status=
status: COMPLETE 
recovery_start: 1468949763
recovery_duration: 26
completed_clients: 23/23
replayed_requests: 0
last_transno: 34368269722
VBR: DISABLED
IR: ENABLED
obdfilter.soaked-OST0014.recovery_status=
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/23
req_replay_clients: 0
lock_repay_clients: 0
completed_clients: 0
evicted_clients: 0
replayed_requests: 0
queued_requests: 0
next_transno: 68719756594
obdfilter.soaked-OST0015.recovery_status=
status: COMPLETE 
recovery_start: 1468949751
recovery_duration: 38
completed_clients: 23/23
replayed_requests: 0
last_transno: 34368229218
VBR: DISABLED
IR: ENABLED

The recovery stalled in a previous session running during night time over several hours without completion. That's why I aborted the recovery after ten minutes; I wasn't sure debug logs would contain relevant information, but confident that the recovery would never end.

Comment by nasf (Inactive) [ 24/Jul/16 ]

According to the recovery status, no clients connected to the new mounted OSTs on the lola-5, and there are no client connection records in the lola-5 debug logs (I am sure that it is NOT because of logs overwritten). So it seems that the clients were not aware of the OST load switching from lola-4 to lola-5, as to no connection to the lola-5, then no recovery. We need the full debug logs on the MGS and the client (one active client is enough). Thanks!

Comment by Frank Heckes (Inactive) [ 25/Jul/16 ]

Resources of lola-2 (OST0012 OST0000 OST0006 OST000c) failed over to lola-3

  • 2016-07-25 08:49:12,572:fsmgmt.fsmgmt:INFO Wait for recovery to complete
  • 2016-07-25 08:55 created debug logs
  • 2016-07-25 08:57:51 aborted recovery
  • 2016-07-25 08:59 created debug logs

Upload of debug logs has been started, but it might take some time.

Comment by Frank Heckes (Inactive) [ 26/Jul/16 ]

debug files have been uploaded. Roles:
lola-8: MGS
lola-3: secondary OSS
lola-30: Lustre client

Comment by nasf (Inactive) [ 26/Jul/16 ]

Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c):

00000100:00000001:30.0:1469462043.991030:0:125723:0:(import.c:508:import_select_connection()) Process entered
00000100:00080000:30.0:1469462043.991033:0:125723:0:(import.c:523:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: connect to NID 192.168.1.102@o2ib10 last attempt 5189761512
00000100:00080000:30.0:1469462043.991036:0:125723:0:(import.c:567:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: tried all connections, increasing latency to 50s
00000100:00080000:30.0:1469462043.991099:0:125723:0:(import.c:601:import_select_connection()) soaked-OST0000-osc-ffff8810717d2000: import ffff880ff6706000 using connection 192.168.1.102@o2ib10/192.168.1.102@o2ib10
00000100:00000001:30.0:1469462043.991104:0:125723:0:(import.c:605:import_select_connection()) Process leaving (rc=0 : 0 : 0)
...

That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs?

Comment by Frank Heckes (Inactive) [ 26/Jul/16 ]

Indeed it's a configuration error. Looks like NID of the secondary node (192.168.1.103@o2ib) weren't written during FS format.
I set them explicitly again using zfs set lustre:failover.node, and check whether this resolves the issue.

Comment by Frank Heckes (Inactive) [ 26/Jul/16 ]

I set the parameter 'lustre:failover.node' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702).
Actually this didn't solve the problem. Recovery is stalled still on failover node.

I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content:

192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18
192.168.1.102: Writing soaked-ost18/ost18 properties
192.168.1.102:   lustre:version=1
192.168.1.102:   lustre:flags=4194
192.168.1.102:   lustre:index=18
192.168.1.102:   lustre:fsname=soaked
192.168.1.102:   lustre:svname=soaked:OST0012
192.168.1.102:   lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10
192.168.1.102:   lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10

Additionally I used option '--force-nohostid' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available.

Comment by Frank Heckes (Inactive) [ 26/Jul/16 ]

reformat is onging. I'm going to start soak again, once it has completed.

Comment by Frank Heckes (Inactive) [ 26/Jul/16 ]

failover.node is specified for mkfs.lustre command via option --servicenode

Comment by Frank Heckes (Inactive) [ 27/Jul/16 ]

Failover works after reformat of Lustre FS. Ticket can be closed.

Generated at Sat Feb 10 02:13:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.