[LU-7938] Recovery on secondary OSS node stalled Created: 29/Mar/16 Updated: 27/Jul/16 Resolved: 27/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Description Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324). DNE is enabled. MDTs had been formatted with ldiskfs, OSTs using zfs. MDS and OSS nodes are configured in HA active-active failover configuration. MDS nodes operated wiht 1 MDT per MDS, while OSSes running 4 OST per node. Event history
Attached files: |
| Comments |
| Comment by Oleg Drokin [ 29/Mar/16 ] |
|
lustre-log-20160329-0759-recovery-stalled is referenced, but not really provided? |
| Comment by Frank Heckes (Inactive) [ 30/Mar/16 ] |
|
Sorry, it took many attempts to upload the file. |
| Comment by Frank Heckes (Inactive) [ 30/Mar/16 ] |
|
The error can be reproduced easily. Please let me know if debug logs with mask different than default are needed. |
| Comment by Frank Heckes (Inactive) [ 15/Apr/16 ] |
|
The error occurred also in soak test for build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). Test set-up is the same as the one described above, besides The recovery is stalled, but can be manually aborted, too.
|
| Comment by Cliff White (Inactive) [ 06/Jun/16 ] |
|
The error is occurring again on the tip of master plus the two patches. # cat /proc/fs/lustre/mdt/soaked-MDT0002/recovery_status status: RECOVERING recovery_start: 1465232959 time_remaining: 0 connected_clients: 20/20 req_replay_clients: 0 lock_repay_clients: 1 completed_clients: 19 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 442381632838 |
| Comment by Cliff White (Inactive) [ 06/Jun/16 ] |
|
Log errors Lustre: Skipped 299 previous similar messages Lustre: soaked-MDT0002: Recovery already passed deadline 137:09, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. |
| Comment by Cliff White (Inactive) [ 06/Jun/16 ] |
|
Does not appear possible to abort_recovery, results in timeouts. INFO: task lctl:41183 blocked for more than 120 seconds. Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lctl D 0000000000000019 0 41183 41083 0x00000080 ffff8806bdda7b68 0000000000000086 0000000000000000 ffff88082fe86ab0 ffff88082fe86ae8 ffff880439371150 0000ece58069441d 0000000000000000 ffff8806bdd8eae8 000000010f83d3bd ffff8806bdd8f068 ffff8806bdda7fd8 Call Trace: [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0 [<ffffffff8153a693>] wait_for_common+0x123/0x180 [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20 [<ffffffffa0992e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc] [<ffffffffa112ac02>] mdt_iocontrol+0x6b2/0x920 [mdt] [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30 [<ffffffffa0783edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass] [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20 [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580 [<ffffffff811a03f6>] ? final_putname+0x26/0x50 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b |
| Comment by Frank Heckes (Inactive) [ 22/Jul/16 ] |
|
Comments from Cliff above seems to be related to The error for for OST reported in the description above, also occurred for build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) I attached new message, console and debug log files (take during recovery, after recovery abort) of secondary node lola-5. Sequence of events:
Status of recovery on OSS node lola-5 and client status were: [root@lola-5 ~]# lctl get_param *.*.recovery_status obdfilter.soaked-OST0002.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719725397 obdfilter.soaked-OST0003.recovery_status= status: COMPLETE recovery_start: 1468949759 recovery_duration: 30 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368384861 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0008.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719723367 obdfilter.soaked-OST0009.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368183131 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST000e.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719751551 obdfilter.soaked-OST000f.recovery_status= status: COMPLETE recovery_start: 1468949763 recovery_duration: 26 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368269722 VBR: DISABLED IR: ENABLED obdfilter.soaked-OST0014.recovery_status= status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/23 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 0 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 68719756594 obdfilter.soaked-OST0015.recovery_status= status: COMPLETE recovery_start: 1468949751 recovery_duration: 38 completed_clients: 23/23 replayed_requests: 0 last_transno: 34368229218 VBR: DISABLED IR: ENABLED The recovery stalled in a previous session running during night time over several hours without completion. That's why I aborted the recovery after ten minutes; I wasn't sure debug logs would contain relevant information, but confident that the recovery would never end. |
| Comment by nasf (Inactive) [ 24/Jul/16 ] |
|
According to the recovery status, no clients connected to the new mounted OSTs on the lola-5, and there are no client connection records in the lola-5 debug logs (I am sure that it is NOT because of logs overwritten). So it seems that the clients were not aware of the OST load switching from lola-4 to lola-5, as to no connection to the lola-5, then no recovery. We need the full debug logs on the MGS and the client (one active client is enough). Thanks! |
| Comment by Frank Heckes (Inactive) [ 25/Jul/16 ] |
|
Resources of lola-2 (OST0012 OST0000 OST0006 OST000c) failed over to lola-3
Upload of debug logs has been started, but it might take some time. |
| Comment by Frank Heckes (Inactive) [ 26/Jul/16 ] |
|
debug files have been uploaded. Roles: |
| Comment by nasf (Inactive) [ 26/Jul/16 ] |
|
Thanks Frank for the detailed logs. According to the client side log, it always tried to connect lola2 for the failed OSTs (OST0012 OST0000 OST0006 OST000c):
That means there is only one candidate can be selected to connect the OST0000, but according to your description, there should be another candidate 192.168.1.103@o2ib for connection. So I am wondering how did you config the lola2 and lola3 as the failover pairs? |
| Comment by Frank Heckes (Inactive) [ 26/Jul/16 ] |
|
Indeed it's a configuration error. Looks like NID of the secondary node (192.168.1.103@o2ib) weren't written during FS format. |
| Comment by Frank Heckes (Inactive) [ 26/Jul/16 ] |
|
I set the parameter 'lustre:failover.node' for on all OSTs to the appropriate values for the failover nodes (see attached file ost-failover-setttings-20160726_0702). I also reformatted the soak FS. The parameter setting will be filled (e.g.) with the following content: 192.168.1.102: mkfs_cmd = zfs create -o canmount=off -o xattr=sa soaked-ost18/ost18 192.168.1.102: Writing soaked-ost18/ost18 properties 192.168.1.102: lustre:version=1 192.168.1.102: lustre:flags=4194 192.168.1.102: lustre:index=18 192.168.1.102: lustre:fsname=soaked 192.168.1.102: lustre:svname=soaked:OST0012 192.168.1.102: lustre:mgsnode=192.168.1.108@o2ib10:192.168.1.109@o2ib10 192.168.1.102: lustre:failover.node=192.168.1.102@o2ib10:192.168.1.103@o2ib10 Additionally I used option '--force-nohostid' for mkfs.lustre - command. I think the patch enabling hostid haven't landed, yet and spl fails during reformat if /etc/hostid isn't available. |
| Comment by Frank Heckes (Inactive) [ 26/Jul/16 ] |
|
reformat is onging. I'm going to start soak again, once it has completed. |
| Comment by Frank Heckes (Inactive) [ 26/Jul/16 ] |
|
failover.node is specified for mkfs.lustre command via option --servicenode |
| Comment by Frank Heckes (Inactive) [ 27/Jul/16 ] |
|
Failover works after reformat of Lustre FS. Ticket can be closed. |