[LU-12949] extend_recovery_timer assertion Created: 07/Nov/19 Updated: 22/Aug/22 Resolved: 14/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The situation is next
2. It also was trying to communicate with MDT1 to get logs
4. lod thread (which communicates with MDT1) saw obd_stopping, stopped with EIO (-5)
5. recovery thread called check_for_recovery_ready() function and asserted cause it thought that a client was connected and it didn't see a time stamp for a obd recovery. [ 985.709865] Lustre: Failing over lustre-MDT0000 [ 990.467906] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) ASSERTION( obd->obd_recovery_start != 0 ) failed: [ 990.469985] LustreError: 9090:0:(ldlm_lib.c:1754:extend_recovery_timer()) LBUG [ 990.471056] Pid: 9090, comm: tgt_recover_0 3.10.0-693.21.1.x3.1.9.x86_64 #1 SMP Tue Jun 26 09:38:31 PDT 2018 [ 990.471059] Call Trace: [ 990.471105] [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40 [ 990.471115] [<ffffffffc062c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 990.471130] [<ffffffffc062c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 990.471143] [<ffffffffc0aa70d9>] extend_recovery_timer+0x2a9/0x2c0 [ptlrpc] [ 990.471205] [<ffffffffc0aa7c54>] check_for_recovery_ready+0xa4/0x1f0 [ptlrpc] [ 990.471266] [<ffffffffc0aa969b>] target_recovery_overseer+0x26b/0x6f0 [ptlrpc] [ 990.471326] [<ffffffffc0ab1f8c>] target_recovery_thread+0x68c/0x11d0 [ptlrpc] [ 990.471385] [<ffffffff810b4031>] kthread+0xd1/0xe0 [ 990.471391] [<ffffffff816c1577>] ret_from_fork+0x77/0xb0 [ 990.471398] [<ffffffffffffffff>] 0xffffffffffffffff While I was working on reproducer I've found conditions which are required for this bug. 1) MDT should have only lwp clients during recovery phase 2) MDT needs to get error from MDT-MDT log update 3) 60sec timer should wakeup recovery thread to call check_for_recovery_ready(), and in a moment before, umount should invalidate import to produce stale clients |
| Comments |
| Comment by Alexander Boyko [ 07/Nov/19 ] |
|
I've pushed a fix https://review.whamcloud.com/#/c/36703 |
| Comment by Gerrit Updater [ 14/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36703/ |
| Comment by Peter Jones [ 14/Dec/19 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 03/Mar/21 ] |
|
Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41870 |
| Comment by Gerrit Updater [ 22/Aug/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48283 |