HSM _not only_ small fixes and to do list goes here
(LU-3647)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Technical task | Priority: | Blocker |
| Reporter: | John Hammond | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 10714 | ||||||||||||
| Description |
|
In 23c197908902183d5f88d3f431da6cde9c290e07
We have a similar deadlock with rename-onto. I think the simplest way out of this mess would be to lock fewer bits in the unlink handler. Can anyone say why unlink should invalidate cached layout? An open unlinked file is still valid for IO. |
| Comments |
| Comment by Oleg Drokin [ 25/Sep/13 ] |
|
I wouldagree that unlink does not need to invalidate the layout |
| Comment by Jinshan Xiong (Inactive) [ 25/Sep/13 ] |
|
Is it possible for unlink to grab LOOKUP only? The only side effect I can think of now is that there are some caching locks on the client side won't be revoked. But this can be easily fixed. |
| Comment by John Hammond [ 25/Sep/13 ] |
|
In general it won't be enough just to remove LAYOUT from unlink.Assume restore takes EX LAYOUT, some other operation tries to take PR LAYOUT | LOOKUP | ... and waits. Then unlink tries to take EX LOOKUP | .... Then stat (from the CT) tries to take LOOKUP | UPDATE | ... and deadlocks. As I keep saying, any operation that requires two lock is dangerous. Note that unlink (and rename onto) should take more than just LOOKUP, since it modifies link count and timestamps. Could we do something sane seeming like having the MDT send the attributes that the CT is getting from stat? |
| Comment by John Hammond [ 28/Sep/13 ] |
|
Please see http://review.whamcloud.com/7792. |
| Comment by John Hammond [ 01/Oct/13 ] |
|
I suggest that this be considered a blocker for 2.5.0. It is easy to imagine situations where users will trigger this deadlock. Faced with a long running restore on a file (which to the user may just seem like an unresponsive console or FS) the user may logout, login, and unlink the released file (since perhaps it is easy to regenerate anyway). |
| Comment by Jodi Levi (Inactive) [ 03/Oct/13 ] |
|
Patch landed to Master. If more work is needed in this ticket, please let me know and I will reopen this ticket. |
| Comment by Andreas Dilger [ 09/Oct/13 ] |
|
Per recent comments in Is there any way to know in advance if HSM is processing this file and not try to revoke the layout lock in this case? |
| Comment by Andreas Dilger [ 09/Oct/13 ] |
|
Might have jumped the gun on this. Closing it again until we know it is the culprit. |
| Comment by Jinshan Xiong (Inactive) [ 09/Oct/13 ] |
|
Hi Andreas, this is a known issue but we still decided to land the patch because the deadlock issue is more severe. The problem is that when the LOOKUP lock is revoked, we don't know if this is because the file is being unlinked or renamed. However, leaving some locks in cache may not be a problem because if the system is active, the locks will be discarded by LRU soon or later. |
| Comment by John Hammond [ 09/Oct/13 ] |
|
I believe that we could revert this patch if the HSM coordinator would send the ownership and timestamps to use on the volatile file along with the restore action. Then the copytool will not need to stat() the original file. The restoring layout swap will check that the ownerships agree protecting us from a TOCTTOU issue. |
| Comment by John Hammond [ 11/Oct/13 ] |
|
Please see http://review.whamcloud.com/7927 for a sketch of this approach. |
| Comment by John Hammond [ 25/Jan/22 ] |
|
Note to self. After http://review.whamcloud.com/13750 ( |