Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10782

Enable tiny write append for singly striped non-composite file

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Append and tiny writes are incompatible in general (see LU-10681) without doing a lot of expensive LDLM lock requests, which defeat the point of tiny writes, but there is one exception.

      Singly striped files with simple (non-composite) layouts can use it safely.

      Tiny writes depend on finding an already dirty page and using its existence as a guarantee that various conditions have been met. Unfortunately, a page can be created without locks appropriate for protecting file size as required by append, so we cannot append safely in general.

      However, a singly striped file presents a special case. When appending to a singly striped file, we must take an LDLM lock on the whole file. If a partial page is created as part of this, we could mark such a page as "created by append". This page is protected by an append-appropriate LDLM lock, and as long as it exists, we know the LDLM lock still exists, so size is "owned" by this client.

      This requires knowing that the file meets the criteria. For this purpose, a flag can be set in the inode indicating that it meets the striping criteria and that the most recent i/o should have created an appropriate page. This flag is only advisory - If it is present, we will attempt to find an appropriate page, if it is not, we won't bother. Correctness depends on checking the inode flag to confirm the layout is simple, and then checking the page level flag to verify the page was indeed created by an append operation.

      I believe it is not possible to convert a simple layout to a composite one without removing the inode, but if that's not the case, then it should be a simple matter of clearing the inode flag. This will require some locking that is not done in the current version of the patch.

      Attachments

        Issue Links

          Activity

            [LU-10782] Enable tiny write append for singly striped non-composite file

            Another possibility is to start using "MDS_INODELOCK_DOM" to protect the append state (even if there is no DoM component on the file)? That would avoid ping-pong of the OST DLM locks across OSTs, but still serialize the writers. It would need agreement across all clients writing the file, so there might need to be a connect flag check if an old client writing to the file doesn't understand this feature, and then the MDS would revoke MDS_INODELOCK_DOM from the client on behalf of the old client (which would need to get all of the OST locks if it was doing an append).

            Just an idea I'm throwing into the ring, it is not fully baked, but might give us a path to something reasonable here.

            adilger Andreas Dilger added a comment - Another possibility is to start using " MDS_INODELOCK_DOM " to protect the append state (even if there is no DoM component on the file)? That would avoid ping-pong of the OST DLM locks across OSTs, but still serialize the writers. It would need agreement across all clients writing the file, so there might need to be a connect flag check if an old client writing to the file doesn't understand this feature, and then the MDS would revoke MDS_INODELOCK_DOM from the client on behalf of the old client (which would need to get all of the OST locks if it was doing an append). Just an idea I'm throwing into the ring, it is not fully baked, but might give us a path to something reasonable here.

            With the advent of patch: https://review.whamcloud.com/35617 "LU-9341 lod: Add special O_APPEND striping" it is likely that most files using O_APPEND will now be 1-stripe plain files, so this optimization may be worthwhile to look at again.

            As for the question of append vs. migrate, I think that should be a non-issue. The same problem of racing write vs. layout change exists for non-appending writes and is handled by migrate by checking whether the file data version (== hash of last transaction each object in the layout was modified in) has changed from the start of the migration to the end. If there was any change to an object, the transno stored with the object will be updated, and the data version will be changed, and the migration will fail during layout swap.

            adilger Andreas Dilger added a comment - With the advent of patch: https://review.whamcloud.com/35617 " LU-9341 lod: Add special O_APPEND striping " it is likely that most files using O_APPEND will now be 1-stripe plain files, so this optimization may be worthwhile to look at again. As for the question of append vs. migrate, I think that should be a non-issue. The same problem of racing write vs. layout change exists for non-appending writes and is handled by migrate by checking whether the file data version (== hash of last transaction each object in the layout was modified in) has changed from the start of the migration to the end. If there was any change to an object, the transno stored with the object will be updated, and the data version will be changed, and the migration will fail during layout swap.

            Ah, I see, thanks for pointing that out... So we'd tie the append bit to layout lock cancellation. all right.

            Uh, what, if anything, would happen to dirty data on the original layout when the new one is brought in? I believe migrate uses group locks, with the intention of excluding other i/o. So I'm thinking that if there were such data, it would just be lost - but there's no way for another process to get dirty data there unless I wrote my own thing that did migrate without a group lock. Is that correct?

            paf Patrick Farrell (Inactive) added a comment - Ah, I see, thanks for pointing that out... So we'd tie the append bit to layout lock cancellation. all right. Uh, what, if anything, would happen to dirty data on the original layout when the new one is brought in? I believe migrate uses group locks, with the intention of excluding other i/o. So I'm thinking that if there were such data, it would just be lost - but there's no way for another process to get dirty data there unless I wrote my own thing that did migrate without a group lock. Is that correct?

            I believe it is not possible to convert a simple layout to a composite one without removing the inode, but if that's not the case, then it should be a simple matter of clearing the inode flag. This will require some locking that is not done in the current version of the patch.

            The "lfs migrate" code will do an atomic layout swap with a "victim" inode that got a copy of the file data, so the file layout may change without the inode being removed. That said, the layout lock would be revoked in that case, so you could just clear the "append safe" flag when the layout bit is cancelled, and re-set the "append safe" flag when the client gets the layout lock again and the layout has not changed.

            adilger Andreas Dilger added a comment - I believe it is not possible to convert a simple layout to a composite one without removing the inode, but if that's not the case, then it should be a simple matter of clearing the inode flag. This will require some locking that is not done in the current version of the patch. The " lfs migrate " code will do an atomic layout swap with a "victim" inode that got a copy of the file data, so the file layout may change without the inode being removed. That said, the layout lock would be revoked in that case, so you could just clear the "append safe" flag when the layout bit is cancelled, and re-set the "append safe" flag when the client gets the layout lock again and the layout has not changed.

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/31553
            Subject: LU-10782 llite: Simple layout append tiny write
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ebd325799c96b783b737b57fcafb3250c2ac94fc

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/31553 Subject: LU-10782 llite: Simple layout append tiny write Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ebd325799c96b783b737b57fcafb3250c2ac94fc

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: