[LU-7231] ENOSPC on remote MDT might create a in-consistent striped directory Created: 30/Sep/15  Updated: 26/Oct/15  Resolved: 26/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Di Wang Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7230 memory leak in sanityn.sh 90 & 91 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In current DNE implementation, when creating a striped directory. In execution phase, the master MDT will only pack the remote updates inside the RPC, and these updates will not be executed until top_trans_stop() sends these updates to remote MDT. If these remote updates fails, for example ENOSPC for writing update log, but local updates succeeds, then we need rollback those local updates (or do better job during declaration?), otherwise the namespace space might be inconsistency.

This can be reproduced by the test case in http://review.whamcloud.com/16677



 Comments   
Comment by Alex Zhuravlev [ 30/Sep/15 ]

instead we should reserve space, IMO. i.e. have grants for metadata.

Comment by Di Wang [ 30/Sep/15 ]

I agree caching the status of the remote target and had it checked in the declare phase might be the right way to go. But for 2.8, is there better temporary way to fix it, instead of using this "lfs rm_entry" to delete this corrupted striped dir afterwards. (Btw: please check the patch http://review.whamcloud.com/16677, thanks!).

Comment by Alex Zhuravlev [ 30/Sep/15 ]

it's probably not that trivial to rollback everything. say, a record in changelog. also, having a failure on a one target doesn't mean all the targets failed, right? then we'd need to rollback those too.

Comment by Andreas Dilger [ 30/Sep/15 ]

How hard would it be in the error handler (for -ENOSPC, or whatever else) to unlink the local name entry on the master and do a best effort to remove the remote directories? No huge loss if the remote entries are leaked, but it doesn't make sense to return success creating a striped directory that isn't actually usable.

Comment by Di Wang [ 30/Sep/15 ]
How hard would it be in the error handler (for -ENOSPC, or whatever else) to unlink the local name entry on the master and do a best effort to remove the remote directories?

Oh, lfs mkdir will return error for this case, because the remote transaction will fail, and MDD can still track this error and reply it to client. But it will leave a corrupt striped directory on the server side, if we do not do anything there. Right now, the solution is that the user can delete this striped directory by himself with lfs rm_entry. See patch http://review.whamcloud.com/16677.

Comment by Andreas Dilger [ 01/Oct/15 ]

Why not try to unlink the name on the master MDT and the remote slaves if there is an error? Surely that is better than waiting for the client to do it? Trying to clean up and failing is no worse than not trying at all and leaving a broken directory behind.

Comment by Andreas Dilger [ 26/Oct/15 ]

This will be fixed by LU-7230.

Generated at Sat Feb 10 02:07:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.