Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
4
-
4392
Description
Running conf-sanity test 47 I met this:
Oct 7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s anity test 47: server restart does not make client loss lru_resize settings == 2 0:52:02 (1349657522) Oct 7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing Oct 7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000 Oct 7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete Oct 7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log Oct 7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized Oct 7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107 Oct 7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages Oct 7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Oct 7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296 Oct 7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio ns using this service will fail.Oct 7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0 000: Connection restored to lustre-OST0000 (at 0@lo)Oct 7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067: ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct 7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT 0000Oct 7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD T0000 completeOct 7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107 Oct 7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: Oct 7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Oct 7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import Oct 7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log Oct 7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Oct 7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages Oct 7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading Oct 7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context Oct 7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53 ... Oct 7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17 Oct 7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message Oct 7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27 Oct 7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17 Oct 7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages Oct 7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports Oct 7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16 Oct 7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages Oct 7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52 ... Oct 7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds. Oct 7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov D 0000000000000002 6256 1991 2 0x00000080 Oct 7 20:58:47 centos6-0 kernel: [ 3960.466011] ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e Oct 7 20:58:47 centos6-0 kernel: [ 3960.466835] 000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282 Oct 7 20:58:47 centos6-0 kernel: [ 3960.467658] ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778 Oct 7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace: Oct 7 20:58:47 centos6-0 kernel: [ 3960.468913] [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.469451] [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.470270] [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.470765] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 Oct 7 20:58:47 centos6-0 kernel: [ 3960.471287] [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.472065] [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20 Oct 7 20:58:47 centos6-0 kernel: [ 3960.472594] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.473336] [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 7 20:58:47 centos6-0 kernel: [ 3960.473550] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.473889] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 20:58:47 centos6-0 kernel: [ 3960.474239] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds. Oct 7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov D 0000000000000002 6256 1991 2 0x00000080 Oct 7 21:00:47 centos6-0 kernel: [ 4080.473577] ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e Oct 7 21:00:47 centos6-0 kernel: [ 4080.473931] 000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282 Oct 7 21:00:47 centos6-0 kernel: [ 4080.474288] ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778 Oct 7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace: Oct 7 21:00:47 centos6-0 kernel: [ 4080.474820] [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475049] [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475403] [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.475620] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 Oct 7 21:00:47 centos6-0 kernel: [ 4080.475845] [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.476186] [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.476399] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.476759] [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 7 21:00:47 centos6-0 kernel: [ 4080.477017] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.477446] [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc] Oct 7 21:00:47 centos6-0 kernel: [ 4080.477801] [<ffffffff8100c140>] ? child_rip+0x0/0x20 Oct 7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12 ... Oct 7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27
So recovery timer epiring multiple times and happily counting into negative territory.
Also I find it strange that the hangcheck timer hits multiple times for the same task so quickly before bigger pauses.
Attachments
Issue Links
- is duplicated by
-
LU-2206 OSS stuck in recovery
- Resolved