[LU-12949] extend_recovery_timer assertion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
- patch

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The situation is next
1. MDT0 stared recovery, and was waiting a first connection

[18123.755404] Lustre: testfs-MDT0000: in recovery but waiting for the first client to connect

2. It also was trying to communicate with MDT1 to get logs
3. failover of MDT0 was started

[18291.574217] Lustre: Failing over testfs-MDT0000

4. lod thread (which communicates with MDT1) saw obd_stopping, stopped with EIO (-5)

[18291.594477] LustreError: 3215:0:(lod_dev.c:434:lod_sub_recovery_thread()) testfs-MDT0001-osp-MDT0000 get update log failed: rc = -5

5. recovery thread called check_for_recovery_ready() function and asserted cause it thought that a client was connected and it didn't see a time stamp for a obd recovery.

[ 985.709865] Lustre: Failing over lustre-MDT0000
[ 990.467906] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) ASSERTION( obd->obd_recovery_start != 0 ) failed:
[ 990.469985] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) LBUG
[ 990.471056] Pid: 9090, comm: tgt_recover_0 3.10.0-693.21.1.x3.1.9.x86_64 #1 SMP Tue Jun 26 09:38:31 PDT 2018
[ 990.471059] Call Trace:
[ 990.471105] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
[ 990.471115] [<ffffffffc062c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 990.471130] [<ffffffffc062c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 990.471143] [<ffffffffc0aa70d9>] extend_recovery_timer+0x2a9/0x2c0 [ptlrpc]
[ 990.471205] [<ffffffffc0aa7c54>] check_for_recovery_ready+0xa4/0x1f0 [ptlrpc]
[ 990.471266] [<ffffffffc0aa969b>] target_recovery_overseer+0x26b/0x6f0 [ptlrpc]
[ 990.471326] [<ffffffffc0ab1f8c>] target_recovery_thread+0x68c/0x11d0 [ptlrpc]
[ 990.471385] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 990.471391] [<ffffffff816c1577>] ret_from_fork+0x77/0xb0
[ 990.471398] [<ffffffffffffffff>] 0xffffffffffffffff

While I was working on reproducer I've found conditions which are required for this bug.

1) MDT should have only lwp clients during recovery phase

2) MDT needs to get error from MDT-MDT log update

3) 60sec timer should wakeup recovery thread to call check_for_recovery_ready(), and in a moment before, umount should invalidate import to produce stale clients

Attachments

Activity

People

Assignee:: Alexander Boyko

Reporter:: Alexander Boyko

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Nov/19 8:57 AM

Updated:: 22/Aug/22 2:57 PM

Resolved:: 14/Dec/19 1:43 PM