Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- MB
- sequoia

Severity:
4
Rank (Obsolete):
4392

Description

Running conf-sanity test 47 I met this:

Oct  7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s
anity test 47: server restart does not make client loss lru_resize settings == 2
0:52:02 (1349657522)
Oct  7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing
Oct  7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000
Oct  7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete
Oct  7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log
Oct  7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized
Oct  7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
Oct  7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages
Oct  7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Oct  7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296
Oct  7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio
ns using this service will fail.Oct  7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0
000: Connection restored to lustre-OST0000 (at 0@lo)Oct  7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067:
ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct  7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT
0000Oct  7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD
T0000 completeOct  7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
Oct  7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
Oct  7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct  7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import
Oct  7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log
Oct  7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Oct  7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages
Oct  7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading
Oct  7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
Oct  7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53
...
Oct  7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
Oct  7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message
Oct  7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27
Oct  7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
Oct  7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages
Oct  7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Oct  7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16
Oct  7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages
Oct  7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52
...
Oct  7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
Oct  7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
Oct  7 20:58:47 centos6-0 kernel: [ 3960.466011]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
Oct  7 20:58:47 centos6-0 kernel: [ 3960.466835]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
Oct  7 20:58:47 centos6-0 kernel: [ 3960.467658]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
Oct  7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace:
Oct  7 20:58:47 centos6-0 kernel: [ 3960.468913]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.469451]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.470270]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.470765]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
Oct  7 20:58:47 centos6-0 kernel: [ 3960.471287]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.472065]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
Oct  7 20:58:47 centos6-0 kernel: [ 3960.472594]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473336]  [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473550]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.473889]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 20:58:47 centos6-0 kernel: [ 3960.474239]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
Oct  7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473577]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
Oct  7 21:00:47 centos6-0 kernel: [ 4080.473931]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474288]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace:
Oct  7 21:00:47 centos6-0 kernel: [ 4080.474820]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475049]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475403]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475620]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
Oct  7 21:00:47 centos6-0 kernel: [ 4080.475845]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476186]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476399]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.476759]  [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477017]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477446]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
Oct  7 21:00:47 centos6-0 kernel: [ 4080.477801]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
Oct  7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying
 connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12
...
Oct  7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27

So recovery timer epiring multiple times and happily counting into negative territory.
Also I find it strange that the hangcheck timer hits multiple times for the same task so quickly before bigger pauses.

Attachments

Issue Links

is duplicated by

LU-2206 OSS stuck in recovery

Resolved

Sub-Tasks

Progress

1.

Develop a reproducer to test the fix for LU-2104

Closed

Prakash Surya (Inactive)

conf-sanity test 47 never completes, negative time to recovery

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates