[LU-4002] HSM restore vs unlink deadlock - Whamcloud Community JIRA

Details

Type: Technical task
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.5.0
Affects Version/s: Lustre 2.5.0
Labels:
- HSM

Rank (Obsolete):
10714

Description

In 23c197908902183d5f88d3f431da6cde9c290e07 ~~LU-3811~~ hsm: handle file ownership and timestamps, I added a stat() of the file being restored to the CT's restore path. This is to ensure that the volatile file is given the correct ownership and timestamps before the restore, and is required for the layout swap to succeed. However this introduces a potential for deadlock vs unlink() and other operations. Consider the following sequence of operations on a single file:

Client sends restore, CDT takes and holds EX LAYOUT lock.
Client sends unlink, handler sleeps on EX FULL lock.
CDT sends restore action to CT.
CT begins restore, sends getattr (from stat()), handler sleeps on PR LOOKUP,UPDATE,PERM lock.

We have a similar deadlock with rename-onto.

I think the simplest way out of this mess would be to lock fewer bits in the unlink handler. Can anyone say why unlink should invalidate cached layout? An open unlinked file is still valid for IO.

Attachments

Issue Links

is related to

LU-4053 client leaking objects/locks during IO

Resolved

LU-4727 Lhsmtool_posix process stuck in ll_layout_refresh() when restoring

Resolved

Activity

People

Assignee:: John Hammond

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Sep/13 6:41 PM

Updated:: 25/Jan/22 8:56 PM

Resolved:: 09/Oct/13 8:22 AM