[LU-10251] MDS hangs in recovery cannot abort, recovery timer is bogus Created: 16/Nov/17  Updated: 17/Nov/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

soak performance cluster


Attachments: Text File soak-8.crash.txt     Text File soak-8.recovery.wedge.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS is rebooted (single MDS, no DNE)
MDS goes into recovery, with bogus values for recovery timer.

soak-8 login: [ 1393.056450] Lustre: soaked-MDT0000: Denying connection for new client 7af6eae0-3527-5481-d01d-161d271e4510(at 192.168.1.142@o2ib), waiting for 29 known clients (6 recovered, 21 in progress, and 2 evicted) to recover in 71565:2

MDS never exits recovery, clients get -EBUSY.
Attempting to abort_recovery causes timeouts, system still wedged.

1681.193209] INFO: task lctl:2555 blocked for more than 120 seconds.^M
[ 1681.271617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
[ 1681.368730] lctl            D ffff8803f1ecd400     0  2555   2526 0x00000084^M
[ 1681.456456]  ffff880413a5bc10 0000000000000082 ffff8803f1826eb0 ffff880413a5bfd8^M
[ 1681.548186]  ffff880413a5bfd8 ffff880413a5bfd8 ffff8803f1826eb0 ffff8808195014d0^M
[ 1681.639847]  7fffffffffffffff ffff8808195014c8 ffff8803f1826eb0 ffff8803f1ecd400^M
[ 1681.731520] Call Trace:^M
[ 1681.763370]  [<ffffffff816a9589>] schedule+0x29/0x70^M
[ 1681.826052]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0^M
[ 1681.899089]  [<ffffffff816a993d>] wait_for_completion+0xfd/0x140^M
[ 1681.974192]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20^M
[ 1682.044159]  [<ffffffffc10f5a5d>] target_stop_recovery_thread.part.16+0x3d/0xd0 [ptlrpc]^M
[ 1682.144235]  [<ffffffffc10f5b08>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]^M
[ 1682.235915]  [<ffffffffc15935d0>] mdt_iocontrol+0x550/0xaf0 [mdt]^M
[ 1682.312024]  [<ffffffffc0ef3bd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]^M
[ 1682.400553]  [<ffffffffc0edebb3>] class_handle_ioctl+0x1913/0x1da0 [obdclass]^M
[ 1682.488997]  [<ffffffff812b1a98>] ? security_capable+0x18/0x20^M
[ 1682.561806]  [<ffffffffc0ec4602>] obd_class_ioctl+0xd2/0x170 [obdclass]^M
[ 1682.643909]  [<ffffffff812151bd>] do_vfs_ioctl+0x33d/0x540^M
[ 1682.712431]  [<ffffffff816b0091>] ? __do_page_fault+0x171/0x450^M
[ 1682.786103]  [<ffffffff81215461>] SyS_ioctl+0xa1/0xc0^M
[ 1682.849308]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b^M

Lustre-log, stack traces attached, we are currently forcing a kernel dump



 Comments   
Comment by Joseph Gmitter (Inactive) [ 17/Nov/17 ]

Hi Lai,

Can you please investigate this issue?

Thanks.
Joe

Generated at Sat Feb 10 02:33:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.