[LU-17089] Bug in the barrier code could cause barrier freeze to fail everytime Created: 05/Sep/23  Updated: 28/Sep/23  Resolved: 28/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Tim Day Assignee: Tim Day
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The barrier code has a bug that could cause the freeze to fail everytime. barrier freeze would be called before trying a FS backup, but would repeatedly fail due to an issue in the mdd_trans_create() function.

https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;f=lustre/mdd/mdd_trans.c;hb=2b0a71081d9c2465cb4b6368fede266fcde91b82#l49

The barrier entry increments the global counter barrier_writer, but it does not get decremented if mdd_child_ops() returns error. If the barrier_writer counter does not go to 0, the freeze cannot happen.



 Comments   
Comment by Gerrit Updater [ 05/Sep/23 ]

"Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52275
Subject: LU-17089 mdd: fix for bi_writers ref counter in case of error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0dcd83758b45357e9d59ba10e2bb23d430b05ce

Comment by Gerrit Updater [ 28/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52275/
Subject: LU-17089 mdd: fix for bi_writers ref counter in case of error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f667cc6a477e0e17b5263669b4592668fbf005bb

Comment by Peter Jones [ 28/Sep/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:32:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.