[LU-2104] conf-sanity test 47 never completes, negative time to recovery Created: 07/Oct/12 Updated: 18/Nov/16 Resolved: 19/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, sequoia | ||
| Issue Links: |
|
||||||||||||
| Sub-Tasks: |
|
||||||||||||
| Severity: | 4 | ||||||||||||
| Rank (Obsolete): | 4392 | ||||||||||||
| Description |
|
Running conf-sanity test 47 I met this: Oct 7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s anity test 47: server restart does not make client loss lru_resize settings == 2 0:52:02 (1349657522) Oct 7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing Oct 7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000 Oct 7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete Oct 7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log Oct 7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized Oct 7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107 Oct 7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages Oct 7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Oct 7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296 Oct 7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio ns using this service will fail.Oct 7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0 000: Connection restored to lustre-OST0000 (at 0@lo)Oct 7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067: ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct 7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT 0000Oct 7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD T0000 completeOct 7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107 Oct 7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Oct 7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import Oct 7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log Oct 7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Oct 7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages Oct 7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading Oct 7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Oct 7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53 ... Oct 7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17 Oct 7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message Oct 7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27 Oct 7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17 Oct 7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages Oct 7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16 Oct 7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages Oct 7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52 ... Oct 7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds. Oct 7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov D 0000000000000002 6256 1991 2 0x00000080 Oct 7 20:58:47 centos6-0 kernel: [ 3960.466011] ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e Oct 7 20:58:47 centos6-0 kernel: [ 3960.466835] 000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282 Oct 7 20:58:47 centos6-0 kernel: [ 3960.467658] ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778 Oct 7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace: Oct 7 20:58:47 centos6-0 kernel: [ 3960.468913] [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.469451] [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.470270] [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.470765] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 Oct 7 20:58:47 centos6-0 kernel: [ 3960.471287] [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.472065] [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20 Oct 7 20:58:47 centos6-0 kernel: [ 3960.472594] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.473336] [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 7 20:58:47 centos6-0 kernel: [ 3960.473550] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.473889] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.474239] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds. Oct 7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov D 0000000000000002 6256 1991 2 0x00000080 Oct 7 21:00:47 centos6-0 kernel: [ 4080.473577] ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e Oct 7 21:00:47 centos6-0 kernel: [ 4080.473931] 000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282 Oct 7 21:00:47 centos6-0 kernel: [ 4080.474288] ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778 Oct 7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace: Oct 7 21:00:47 centos6-0 kernel: [ 4080.474820] [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475049] [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475403] [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475620] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 Oct 7 21:00:47 centos6-0 kernel: [ 4080.475845] [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.476186] [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.476399] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.476759] [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.477017] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.477446] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.477801] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12 ... Oct 7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27 So recovery timer epiring multiple times and happily counting into negative territory. |
| Comments |
| Comment by Oleg Drokin [ 27/Oct/12 ] |
|
Just had this happened again, this time on replay-single test 60: [706146.971474] Lustre: lustre-OST0000: Received new MDS connection from 0@lo, removing former export from same NID [706146.976893] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:30 [706146.983784] Lustre: Skipped 12 previous similar messages [706152.169633] Lustre: lustre-MDT0000-osp-OST0001: Connection restored to lustre-MDT0000 (at 0@lo) [706152.170833] Lustre: Skipped 49 previous similar messages [706157.162094] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:20 [706157.164244] Lustre: Skipped 1 previous similar message [706157.187058] Lustre: 14127:0:(ost_handler.c:1635:ost_filter_recovery_request()) @@@ not permitted during recovery req@ffff880061b4bbf0 x1416960657112833/t0(0) o13->eed49115-0cb1-7092-2b55-9e6bd5e44f7f@0@lo:0/0 lens 224/0 e 0 to 0 dl 1351322349 ref 1 fl Interpret:/0/ffffffff rc 0/-1 [706177.161796] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:00 [706177.165299] Lustre: Skipped 3 previous similar messages [706177.296185] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [706177.298714] Lustre: lustre-OST0000: disconnecting 1 stale clients [706177.300153] LustreError: 8290:0:(ofd_grant.c:158:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152 [706180.190726] Lustre: DEBUG MARKER: == replay-single test 60: test llog post recovery init vs llog unlink == 03:19:25 (1351322365) [706207.304114] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [706212.156405] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:25 [706212.158494] Lustre: Skipped 12 previous similar messages [706237.304163] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [706267.305532] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [706277.168303] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:20 [706277.173843] Lustre: Skipped 25 previous similar messages [706297.304216] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [706407.166143] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in -1:-49 [706407.168414] Lustre: Skipped 49 previous similar messages ... [774147.173943] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in -1130:-49 |
| Comment by Prakash Surya (Inactive) [ 07/Nov/12 ] |
|
Hit this again on a Grove OSS. |
| Comment by Prakash Surya (Inactive) [ 07/Nov/12 ] |
|
Debugging a little further.. Stacks on the console look the same as what's in the description: 2012-11-06 15:01:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2012-11-06 15:01:03 tgt_recov D 0000000000000008 0 6569 2 0x00000000 2012-11-06 15:01:03 ffff8807e4a29e10 0000000000000046 0000000000000000 00000001001d5070 2012-11-06 15:01:03 ffff881029ce6c00 0000000000000000 ffff88100d2de148 ffff881029ce6c4c 2012-11-06 15:01:03 ffff8807e48e4638 ffff8807e4a29fd8 000000000000f4e8 ffff8807e48e4638 2012-11-06 15:01:03 Call Trace: 2012-11-06 15:01:03 [<ffffffffa08a1330>] ? check_for_clients+0x0/0x90 [ptlrpc] 2012-11-06 15:01:03 [<ffffffffa08a2d25>] target_recovery_overseer+0x95/0x250 [ptlrpc] 2012-11-06 15:01:03 [<ffffffffa08a1130>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] 2012-11-06 15:01:03 [<ffffffff81091090>] ? autoremove_wake_function+0x0/0x40 2012-11-06 15:01:03 [<ffffffffa08a9f8e>] target_recovery_thread+0x58e/0x19d0 [ptlrpc] 2012-11-06 15:01:03 [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc] 2012-11-06 15:01:03 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-06 15:01:03 [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc] 2012-11-06 15:01:03 [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc] 2012-11-06 15:01:03 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Crash shows the thread still stuck many hours later: PID: 6569 TASK: ffff8807e48e4080 CPU: 8 COMMAND: "tgt_recov" #0 [ffff8807e4a29d50] schedule at ffffffff814ef152 #1 [ffff8807e4a29e18] target_recovery_overseer at ffffffffa08a2d25 [ptlrpc] #2 [ffff8807e4a29ea8] target_recovery_thread at ffffffffa08a9f8e [ptlrpc] #3 [ffff8807e4a29f48] kernel_thread at ffffffff8100c14a The line it's stuck on is: (gdb) l *target_recovery_overseer+0x95
0xed55 is in target_recovery_overseer (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c:1808).
1803 /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: No such file or directory.
in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c
1803 static int target_recovery_overseer(struct obd_device *obd,
1804 int (*check_routine)(struct obd_device *),
1805 int (*health_check)(struct obd_export *))
1806 {
1807 repeat:
1808 cfs_wait_event(obd->obd_next_transno_waitq, check_routine(obd));
1809 if (obd->obd_abort_recovery) {
1810 CDEBUG(D_HA, "recovery aborted, evicting stale exports\n");
1811 /** evict exports which didn't finish recovery yet */
1812 class_disconnect_stale_exports(obd, exp_finished);
1813 return 1;
And it's getting there from here: (gdb) list *target_recovery_thread+0x58e 0x15fbe is in target_recovery_thread (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c:2026). 2021 in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c 2022 cfs_spin_unlock(&obd->obd_dev_lock);
2023 cfs_complete(&trd->trd_starting);
2024
2025 /* first of all, we have to know the first transno to replay */
2026 if (target_recovery_overseer(obd, check_for_clients,
2027 exp_connect_healthy)) {
2028 abort_req_replay_queue(obd);
2029 abort_lock_replay_queue(obd);
2030 }
So, for whatever reason, it looks like this thread isn't receiving a signal to get woken up. Either that, or check_routine (i.e. exp_connect_healthy) isn't ever returning true. |
| Comment by Mikhail Pershin [ 20/Nov/12 ] |
|
First issue is that number of unseen exports are not decreased after the eviction of stale exports. That prevents recovery to go forward. Negative time problem is caused by that I believe. We have the following code in extend_recovery_timer(): if (to > obd->obd_recovery_time_hard) to = obd->obd_recovery_time_hard; if (obd->obd_recovery_timeout < to) { obd->obd_recovery_timeout = to; cfs_timer_arm(&obd->obd_recovery_timer, cfs_time_shift(drt)); } Each time we call extend_recovery_timer() it increase obd_recovery_timeout. Therefore at some moment both 'to' and obd_recovery_timeout will became the obd->obd_recovery_time_hard value and condition to arm timer will not be true because there is '<'. When that happens the timer stops working and recovery stuck for a while. Now I am trying to find out why stalled exports are not evicted. |
| Comment by Mikhail Pershin [ 20/Nov/12 ] |
|
I've made patch with some changes related to recovery which may help. First of all it checks exp_failed in class_disconnect_stale_exports() and don't include already failed/evicted clients to the evict list again. It fixes also case with recovery_timeout == hard_timeout mentioned above. I am not sure is that what have to be done there, probably we should abort recovery if timeout reaches HARD limit. |
| Comment by Mikhail Pershin [ 22/Nov/12 ] |
|
Another change in patch is related to class_fail_export() function. If it is called during recovery then it should update obd_stale_clients counter. Patch is updated. |
| Comment by Mikhail Pershin [ 23/Nov/12 ] |
|
Prakash, can you check the latest patch if that is easy to reproduce? |
| Comment by Prakash Surya (Inactive) [ 26/Nov/12 ] |
|
Mike, sure I'll apply the updated patch. Do you expect it to fix the issue completely? I ask because we don't have a solid reproducer, basically we've just rebooted the OSTs many times and "eventually" we'll see the negative time. |
| Comment by Mikhail Pershin [ 26/Nov/12 ] |
|
Yes, I expect fix to help. Bug says about test 47 conf-sanity, I believe it can reproduce. Also I suppose you've used MDS failover and that is the key. Problem was MDS reconnection from different nid during OST recovery, so you can try to simulate. |
| Comment by Prakash Surya (Inactive) [ 26/Nov/12 ] |
What do you mean by "MDS failover"? Unless I'm mistaken, we do not use MDS failover. When upgrading or during testing we may reboot the MDS, but it never fails over to a partner. I would not expect the NID to change since the MDS is comes back up on the same node in our configuration. |
| Comment by Christopher Morrone [ 26/Nov/12 ] |
|
We are planning to use failover eventually on this filesystem. This will be the first filesystem at LLNL where we will use MDS failover. So it may be configured with a failover nid already. But I too am skeptical that anyone really did MDS failover on this system. We're using the other MDS node for another purpose at the moment. |
| Comment by Mikhail Pershin [ 27/Nov/12 ] |
|
Reboot is a case too, it changes MDS connection as well, the case we need - MDS restart during OST recovery |
| Comment by Mikhail Pershin [ 08/Jan/13 ] |
|
patch landed |
| Comment by Prakash Surya (Inactive) [ 08/Jan/13 ] |
|
I'm fine resolving this since the patch landed, we don't really have a reproducer and haven't seen it in the wild since applying the fix. We can reopen if needed. |
| Comment by Mikhail Pershin [ 10/Jan/13 ] |
|
patch was landed |