[LU-15724] MDT failover hang Created: 06/Apr/22  Updated: 12/Apr/23  Resolved: 06/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-16425 Interop recovery-small test_144a: MDT... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With a LU-8367 deadlock between osp_precreate_reserve() and 

osp_precreate_cleanup_orphans(), I've found a problem with MDT failover.

00000020:02000400:31.0:1644539398.776433:0:454249:0:(obd_config.c:854:class_cleanup()) Failing over kjcf05-MDT0001
...
00010000:02020000:20.0:1644539461.204784:0:454249:0:(ldlm_resource.c:1188:__ldlm_namespace_free()) 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID namespace with 46 resources in use, (rc=-110)
00010000:02020000:8.0:1644539699.332763:0:454249:0:(ldlm_resource.c:1188:__ldlm_namespace_free()) 0-0: Forced cleanup waiting for mdt-kjcf05-MDT0001_UUID namespace with 46 resources in use, (rc=-110)

So the situation is - MDT failover does not produce disconnect event, so osp_precreate_cleanup_orphans() cannot be awakened. Also it does not cleanup opd_pre_recovering and osp_precreate_reserve() wait skips wakeup signal. This hang would be ended after ~obd_timeout.



 Comments   
Comment by Gerrit Updater [ 06/Apr/22 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47005
Subject: LU-15724 osp: wakeup all precreate threads
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6080ef0513b4832c76b0dfa04efc185987f2e61b

Comment by Gerrit Updater [ 06/Apr/22 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47006
Subject: LU-15724 tests: MDT failover hang reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 300cd635638acf195124a12c4a5228dbdc85c116

Comment by Gerrit Updater [ 06/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47005/
Subject: LU-15724 osp: wakeup all precreate threads
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e55fc043679cdfadfff6874ef78e2e0128ec37ac

Comment by Gerrit Updater [ 06/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47006/
Subject: LU-15724 tests: MDT failover hang reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aa6250b7412e7baf6760fe4010a81f4f22187127

Comment by Peter Jones [ 06/Jun/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 14/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48548
Subject: LU-15724 osp: wakeup all precreate threads
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: c274853358c793555fb1a20741f72d9254a0147d

Comment by Gerrit Updater [ 14/Sep/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48549
Subject: LU-15724 tests: MDT failover hang reproducer
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 2844bbe7a5c9915c5f2bf376a6b4554e5683081c

Comment by Gerrit Updater [ 26/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48548/
Subject: LU-15724 osp: wakeup all precreate threads
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 4eede4aab35296ed9417b77b955cf43a83827fdb

Comment by Gerrit Updater [ 26/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48549/
Subject: LU-15724 tests: MDT failover hang reproducer
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9d1805c8b9cc1067b9b3ba186e5e3531112e08a3

Generated at Sat Feb 10 03:20:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.