[LU-2104] conf-sanity test 47 never completes, negative time to recovery Created: 07/Oct/12  Updated: 18/Nov/16  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB, sequoia

Issue Links:
Duplicate
is duplicated by LU-2206 OSS stuck in recovery Resolved
Related
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2575 Develop a reproducer to test the fix ... Technical task Closed Prakash Surya  
Severity: 4
Rank (Obsolete): 4392

 Description   

Running conf-sanity test 47 I met this:

Oct  7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s
anity test 47: server restart does not make client loss lru_resize settings == 2
0:52:02 (1349657522)
Oct  7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing
Oct  7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000
Oct  7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete
Oct  7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log
Oct  7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized
Oct  7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
Oct  7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages
Oct  7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Oct  7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296
Oct  7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio
ns using this service will fail.Oct  7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0
000: Connection restored to lustre-OST0000 (at 0@lo)Oct  7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067:
ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct  7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT
0000Oct  7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD
T0000 completeOct  7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
Oct  7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct  7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import
Oct  7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log
Oct  7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Oct  7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages
Oct  7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading
Oct  7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Oct  7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53
...
Oct  7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
Oct  7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message
Oct  7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27
Oct  7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
Oct  7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages
Oct  7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16
Oct  7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages
Oct  7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52
...
Oct  7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
Oct  7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
Oct  7 20:58:47 centos6-0 kernel: [ 3960.466011]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
Oct  7 20:58:47 centos6-0 kernel: [ 3960.466835]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
Oct  7 20:58:47 centos6-0 kernel: [ 3960.467658]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
Oct  7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace:
Oct  7 20:58:47 centos6-0 kernel: [ 3960.468913]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.469451]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.470270]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.470765]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
Oct  7 20:58:47 centos6-0 kernel: [ 3960.471287]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.472065]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
Oct  7 20:58:47 centos6-0 kernel: [ 3960.472594]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473336]  [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473550]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473889]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.474239]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
Oct  7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473577]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473931]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474288]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace:
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474820]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475049]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475403]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475620]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475845]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476186]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476399]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476759]  [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477017]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477446]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477801]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
Oct  7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying
 connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12
...
Oct  7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27

So recovery timer epiring multiple times and happily counting into negative territory.
Also I find it strange that the hangcheck timer hits multiple times for the same task so quickly before bigger pauses.



 Comments   
Comment by Oleg Drokin [ 27/Oct/12 ]

Just had this happened again, this time on replay-single test 60:

[706146.971474] Lustre: lustre-OST0000: Received new MDS connection from 0@lo, removing former export from same NID
[706146.976893] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:30
[706146.983784] Lustre: Skipped 12 previous similar messages
[706152.169633] Lustre: lustre-MDT0000-osp-OST0001: Connection restored to lustre-MDT0000 (at 0@lo)
[706152.170833] Lustre: Skipped 49 previous similar messages
[706157.162094] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:20
[706157.164244] Lustre: Skipped 1 previous similar message
[706157.187058] Lustre: 14127:0:(ost_handler.c:1635:ost_filter_recovery_request()) @@@ not permitted during recovery  req@ffff880061b4bbf0 x1416960657112833/t0(0) o13->eed49115-0cb1-7092-2b55-9e6bd5e44f7f@0@lo:0/0 lens 224/0 e 0 to 0 dl 1351322349 ref 1 fl Interpret:/0/ffffffff rc 0/-1
[706177.161796] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:00
[706177.165299] Lustre: Skipped 3 previous similar messages
[706177.296185] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[706177.298714] Lustre: lustre-OST0000: disconnecting 1 stale clients
[706177.300153] LustreError: 8290:0:(ofd_grant.c:158:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
[706180.190726] Lustre: DEBUG MARKER: == replay-single test 60: test llog post recovery init vs llog unlink == 03:19:25 (1351322365)
[706207.304114] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[706212.156405] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:25
[706212.158494] Lustre: Skipped 12 previous similar messages
[706237.304163] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[706267.305532] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[706277.168303] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in 0:20
[706277.173843] Lustre: Skipped 25 previous similar messages
[706297.304216] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
[706407.166143] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in -1:-49
[706407.168414] Lustre: Skipped 49 previous similar messages
...
[774147.173943] Lustre: lustre-OST0000: Denying connection for new client eed49115-0cb1-7092-2b55-9e6bd5e44f7f (at 0@lo), waiting for all 2 known clients (0 recovered, 0 in progress, and 2 unseen) to recover in -1130:-49
Comment by Prakash Surya (Inactive) [ 07/Nov/12 ]

Hit this again on a Grove OSS.

Comment by Prakash Surya (Inactive) [ 07/Nov/12 ]

Debugging a little further.. Stacks on the console look the same as what's in the description:

2012-11-06 15:01:03 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2012-11-06 15:01:03 tgt_recov     D 0000000000000008     0  6569      2 0x00000000
2012-11-06 15:01:03  ffff8807e4a29e10 0000000000000046 0000000000000000 00000001001d5070
2012-11-06 15:01:03  ffff881029ce6c00 0000000000000000 ffff88100d2de148 ffff881029ce6c4c
2012-11-06 15:01:03  ffff8807e48e4638 ffff8807e4a29fd8 000000000000f4e8 ffff8807e48e4638
2012-11-06 15:01:03 Call Trace:
2012-11-06 15:01:03  [<ffffffffa08a1330>] ? check_for_clients+0x0/0x90 [ptlrpc]
2012-11-06 15:01:03  [<ffffffffa08a2d25>] target_recovery_overseer+0x95/0x250 [ptlrpc]
2012-11-06 15:01:03  [<ffffffffa08a1130>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
2012-11-06 15:01:03  [<ffffffff81091090>] ? autoremove_wake_function+0x0/0x40
2012-11-06 15:01:03  [<ffffffffa08a9f8e>] target_recovery_thread+0x58e/0x19d0 [ptlrpc]
2012-11-06 15:01:03  [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
2012-11-06 15:01:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-06 15:01:03  [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
2012-11-06 15:01:03  [<ffffffffa08a9a00>] ? target_recovery_thread+0x0/0x19d0 [ptlrpc]
2012-11-06 15:01:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20

Crash shows the thread still stuck many hours later:

PID: 6569   TASK: ffff8807e48e4080  CPU: 8   COMMAND: "tgt_recov"
 #0 [ffff8807e4a29d50] schedule at ffffffff814ef152
 #1 [ffff8807e4a29e18] target_recovery_overseer at ffffffffa08a2d25 [ptlrpc]
 #2 [ffff8807e4a29ea8] target_recovery_thread at ffffffffa08a9f8e [ptlrpc]
 #3 [ffff8807e4a29f48] kernel_thread at ffffffff8100c14a

The line it's stuck on is:

(gdb) l *target_recovery_overseer+0x95
0xed55 is in target_recovery_overseer (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c:1808).
1803    /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c
1803 static int target_recovery_overseer(struct obd_device *obd,                     
1804                                     int (*check_routine)(struct obd_device *),  
1805                                     int (*health_check)(struct obd_export *))   
1806 {                                                                               
1807 repeat:                                                                         
1808         cfs_wait_event(obd->obd_next_transno_waitq, check_routine(obd));        
1809         if (obd->obd_abort_recovery) {                                          
1810                 CDEBUG(D_HA, "recovery aborted, evicting stale exports\n");     
1811                 /** evict exports which didn't finish recovery yet */           
1812                 class_disconnect_stale_exports(obd, exp_finished);              
1813                 return 1;                                                       

And it's getting there from here:

(gdb) list *target_recovery_thread+0x58e
0x15fbe is in target_recovery_thread (/builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c:2026).
2021    in /builddir/build/BUILD/lustre-2.3.54/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c
2022         cfs_spin_unlock(&obd->obd_dev_lock);                                    
2023         cfs_complete(&trd->trd_starting);                                       
2024                                                                                 
2025         /* first of all, we have to know the first transno to replay */         
2026         if (target_recovery_overseer(obd, check_for_clients,                    
2027                                      exp_connect_healthy)) {                    
2028                 abort_req_replay_queue(obd);                                    
2029                 abort_lock_replay_queue(obd);                                   
2030         }

So, for whatever reason, it looks like this thread isn't receiving a signal to get woken up. Either that, or check_routine (i.e. exp_connect_healthy) isn't ever returning true.

Comment by Mikhail Pershin [ 20/Nov/12 ]

First issue is that number of unseen exports are not decreased after the eviction of stale exports. That prevents recovery to go forward. Negative time problem is caused by that I believe. We have the following code in extend_recovery_timer():

        if (to > obd->obd_recovery_time_hard)
                to = obd->obd_recovery_time_hard;
        if (obd->obd_recovery_timeout < to) {
                obd->obd_recovery_timeout = to;
                cfs_timer_arm(&obd->obd_recovery_timer,
                              cfs_time_shift(drt));
        }

Each time we call extend_recovery_timer() it increase obd_recovery_timeout. Therefore at some moment both 'to' and obd_recovery_timeout will became the obd->obd_recovery_time_hard value and condition to arm timer will not be true because there is '<'. When that happens the timer stops working and recovery stuck for a while. Now I am trying to find out why stalled exports are not evicted.

Comment by Mikhail Pershin [ 20/Nov/12 ]

I've made patch with some changes related to recovery which may help. First of all it checks exp_failed in class_disconnect_stale_exports() and don't include already failed/evicted clients to the evict list again. It fixes also case with recovery_timeout == hard_timeout mentioned above. I am not sure is that what have to be done there, probably we should abort recovery if timeout reaches HARD limit.

http://review.whamcloud.com/4636

Comment by Mikhail Pershin [ 22/Nov/12 ]

Another change in patch is related to class_fail_export() function. If it is called during recovery then it should update obd_stale_clients counter. Patch is updated.

Comment by Mikhail Pershin [ 23/Nov/12 ]

Prakash, can you check the latest patch if that is easy to reproduce?

Comment by Prakash Surya (Inactive) [ 26/Nov/12 ]

Mike, sure I'll apply the updated patch. Do you expect it to fix the issue completely? I ask because we don't have a solid reproducer, basically we've just rebooted the OSTs many times and "eventually" we'll see the negative time.

Comment by Mikhail Pershin [ 26/Nov/12 ]

Yes, I expect fix to help. Bug says about test 47 conf-sanity, I believe it can reproduce. Also I suppose you've used MDS failover and that is the key. Problem was MDS reconnection from different nid during OST recovery, so you can try to simulate.

Comment by Prakash Surya (Inactive) [ 26/Nov/12 ]

Also I suppose you've used MDS failover and that is the key.

What do you mean by "MDS failover"? Unless I'm mistaken, we do not use MDS failover. When upgrading or during testing we may reboot the MDS, but it never fails over to a partner. I would not expect the NID to change since the MDS is comes back up on the same node in our configuration.

Comment by Christopher Morrone [ 26/Nov/12 ]

We are planning to use failover eventually on this filesystem. This will be the first filesystem at LLNL where we will use MDS failover. So it may be configured with a failover nid already. But I too am skeptical that anyone really did MDS failover on this system. We're using the other MDS node for another purpose at the moment.

Comment by Mikhail Pershin [ 27/Nov/12 ]

Reboot is a case too, it changes MDS connection as well, the case we need - MDS restart during OST recovery

Comment by Mikhail Pershin [ 08/Jan/13 ]

patch landed

Comment by Prakash Surya (Inactive) [ 08/Jan/13 ]

I'm fine resolving this since the patch landed, we don't really have a reproducer and haven't seen it in the wild since applying the fix. We can reopen if needed.

Comment by Mikhail Pershin [ 10/Jan/13 ]

patch was landed

Generated at Sat Feb 10 01:22:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.