[LU-10782] Enable tiny write append for singly striped non-composite file Created: 06/Mar/18 Updated: 08/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Append and tiny writes are incompatible in general (see Singly striped files with simple (non-composite) layouts can use it safely. Tiny writes depend on finding an already dirty page and using its existence as a guarantee that various conditions have been met. Unfortunately, a page can be created without locks appropriate for protecting file size as required by append, so we cannot append safely in general. However, a singly striped file presents a special case. When appending to a singly striped file, we must take an LDLM lock on the whole file. If a partial page is created as part of this, we could mark such a page as "created by append". This page is protected by an append-appropriate LDLM lock, and as long as it exists, we know the LDLM lock still exists, so size is "owned" by this client. This requires knowing that the file meets the criteria. For this purpose, a flag can be set in the inode indicating that it meets the striping criteria and that the most recent i/o should have created an appropriate page. This flag is only advisory - If it is present, we will attempt to find an appropriate page, if it is not, we won't bother. Correctness depends on checking the inode flag to confirm the layout is simple, and then checking the page level flag to verify the page was indeed created by an append operation. I believe it is not possible to convert a simple layout to a composite one without removing the inode, but if that's not the case, then it should be a simple matter of clearing the inode flag. This will require some locking that is not done in the current version of the patch. |
| Comments |
| Comment by Gerrit Updater [ 06/Mar/18 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/31553 |
| Comment by Andreas Dilger [ 09/Mar/18 ] |
The "lfs migrate" code will do an atomic layout swap with a "victim" inode that got a copy of the file data, so the file layout may change without the inode being removed. That said, the layout lock would be revoked in that case, so you could just clear the "append safe" flag when the layout bit is cancelled, and re-set the "append safe" flag when the client gets the layout lock again and the layout has not changed. |
| Comment by Patrick Farrell (Inactive) [ 09/Mar/18 ] |
|
Ah, I see, thanks for pointing that out... So we'd tie the append bit to layout lock cancellation. all right. Uh, what, if anything, would happen to dirty data on the original layout when the new one is brought in? I believe migrate uses group locks, with the intention of excluding other i/o. So I'm thinking that if there were such data, it would just be lost - but there's no way for another process to get dirty data there unless I wrote my own thing that did migrate without a group lock. Is that correct? |
| Comment by Andreas Dilger [ 30/Aug/19 ] |
|
With the advent of patch: https://review.whamcloud.com/35617 " As for the question of append vs. migrate, I think that should be a non-issue. The same problem of racing write vs. layout change exists for non-appending writes and is handled by migrate by checking whether the file data version (== hash of last transaction each object in the layout was modified in) has changed from the start of the migration to the end. If there was any change to an object, the transno stored with the object will be updated, and the data version will be changed, and the migration will fail during layout swap. |
| Comment by Andreas Dilger [ 08/Jan/24 ] |
|
Another possibility is to start using "MDS_INODELOCK_DOM" to protect the append state (even if there is no DoM component on the file)? That would avoid ping-pong of the OST DLM locks across OSTs, but still serialize the writers. It would need agreement across all clients writing the file, so there might need to be a connect flag check if an old client writing to the file doesn't understand this feature, and then the MDS would revoke MDS_INODELOCK_DOM from the client on behalf of the old client (which would need to get all of the OST locks if it was doing an append). Just an idea I'm throwing into the ring, it is not fully baked, but might give us a path to something reasonable here. |