[LU-12949] extend_recovery_timer assertion Created: 07/Nov/19  Updated: 22/Aug/22  Resolved: 14/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The situation is next
1. MDT0 stared recovery, and was waiting a first connection

[18123.755404] Lustre: testfs-MDT0000: in recovery but waiting for the first client to connect

2. It also was trying to communicate with MDT1 to get logs
3. failover of MDT0 was started

[18291.574217] Lustre: Failing over testfs-MDT0000

4. lod thread (which communicates with MDT1) saw obd_stopping, stopped with EIO (-5) 

[18291.594477] LustreError: 3215:0:(lod_dev.c:434:lod_sub_recovery_thread()) testfs-MDT0001-osp-MDT0000 get update log failed: rc = -5

5. recovery thread called check_for_recovery_ready() function and asserted cause it thought that a client was connected and it didn't see a time stamp for a obd recovery.

[ 985.709865] Lustre: Failing over lustre-MDT0000
[ 990.467906] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) ASSERTION( obd->obd_recovery_start != 0 ) failed:
[ 990.469985] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) LBUG
[ 990.471056] Pid: 9090, comm: tgt_recover_0 3.10.0-693.21.1.x3.1.9.x86_64 #1 SMP Tue Jun 26 09:38:31 PDT 2018
[ 990.471059] Call Trace:
[ 990.471105] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
[ 990.471115] [<ffffffffc062c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 990.471130] [<ffffffffc062c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 990.471143] [<ffffffffc0aa70d9>] extend_recovery_timer+0x2a9/0x2c0 [ptlrpc]
[ 990.471205] [<ffffffffc0aa7c54>] check_for_recovery_ready+0xa4/0x1f0 [ptlrpc]
[ 990.471266] [<ffffffffc0aa969b>] target_recovery_overseer+0x26b/0x6f0 [ptlrpc]
[ 990.471326] [<ffffffffc0ab1f8c>] target_recovery_thread+0x68c/0x11d0 [ptlrpc]
[ 990.471385] [<ffffffff810b4031>] kthread+0xd1/0xe0
[ 990.471391] [<ffffffff816c1577>] ret_from_fork+0x77/0xb0
[ 990.471398] [<ffffffffffffffff>] 0xffffffffffffffff
 

While I was working on reproducer I've found conditions which are required for this bug.

1) MDT should have only lwp clients during recovery phase

2) MDT needs to get error from MDT-MDT log update

3) 60sec timer should wakeup recovery thread to call check_for_recovery_ready(), and in a moment before, umount should invalidate import to produce stale clients



 Comments   
Comment by Alexander Boyko [ 07/Nov/19 ]

I've pushed a fix https://review.whamcloud.com/#/c/36703

Comment by Gerrit Updater [ 14/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36703/
Subject: LU-12949 obdclass: don't extend timer if obd stops
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bc871f8ff53068bfe69ad7653479b42e6a6d2d93

Comment by Peter Jones [ 14/Dec/19 ]

Landed for 2.14

Comment by Gerrit Updater [ 03/Mar/21 ]

Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41870
Subject: LU-12949 obdclass: run recovery-small test_138 on branch 2_12
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b64c63c248e7c518ff9d49cd0b458dd0584e523b

Comment by Gerrit Updater [ 22/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48283
Subject: LU-12949 obdclass: don't extend timer if obd stops
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d7619bd1049c5891632f1f532fd21869efb2f39b

Generated at Sat Feb 10 02:57:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.