Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9341

PFL: append should not instantiate full layout

Details

    • 3
    • 9223372036854775807

    Description

      Appending to a PFL file will cause all layout components to be instantiated because it isn't possible to know what the ending offset is at the time the write is started.

      It would be better to avoid this, potentially by locking/instantiating some large(r), but not gigantic range beyond current EOF, and if that fails retry the layout intent? The client must currently be in charge of locking the file during append, so it should know at write time how much of the file to instantiate, and it could retry.

      Attachments

        Issue Links

          Activity

            [LU-9341] PFL: append should not instantiate full layout

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37007/
            Subject: LU-9341 lod: Add special O_APPEND striping
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: d0767ae660e7662215a07df83dbf784bb5fa6eb6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37007/ Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: d0767ae660e7662215a07df83dbf784bb5fa6eb6

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37007
            Subject: LU-9341 lod: Add special O_APPEND striping
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f377c3a34e73acfe3f74ee262f58086e0c4a2992

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37007 Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f377c3a34e73acfe3f74ee262f58086e0c4a2992

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36935/
            Subject: LU-9341 utils: fix lfs find for composite files
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 6341522cc8088a367cd156a6f284823c69b92f7b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36935/ Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 6341522cc8088a367cd156a6f284823c69b92f7b

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36935
            Subject: LU-9341 utils: fix lfs find for composite files
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7939fd638606837b2552eb73cfb49d9ccf0fc7d8

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36935 Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7939fd638606837b2552eb73cfb49d9ccf0fc7d8

            Can this get pulled back into b2_10 and/or b2_12 for LTS?

            dauchy Nathan Dauchy (Inactive) added a comment - Can this get pulled back into b2_10 and/or b2_12 for LTS?
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35617/
            Subject: LU-9341 lod: Add special O_APPEND striping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e2ac6e1eaa108eef3493837e9bd881629582ea1d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35617/ Subject: LU-9341 lod: Add special O_APPEND striping Project: fs/lustre-release Branch: master Current Patch Set: Commit: e2ac6e1eaa108eef3493837e9bd881629582ea1d

            Opened LU-12738 to track remaining work, this can be closed once https://review.whamcloud.com/#/c/35617/ lands.

            pfarrell Patrick Farrell (Inactive) added a comment - Opened  LU-12738 to track remaining work, this can be closed once  https://review.whamcloud.com/#/c/35617/  lands.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35611/
            Subject: LU-9341 utils: fix lfs find for composite files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 72479a52be5f77f601d8234d957f5d6176edf6e8

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35611/ Subject: LU-9341 utils: fix lfs find for composite files Project: fs/lustre-release Branch: master Current Patch Set: Commit: 72479a52be5f77f601d8234d957f5d6176edf6e8

            spitzcor and vitaly_fertman of Cray indicated they're still interested in taking aim directly at the problem, and asked me to break down the approach I had most recently settled on.  I'm still a fan of the current "special layout for O_APPEND" patch, but this fix could be complementary (and it may not be available for a while - The current plan is not too bad, but it's not trivial either).  So I'll do that here.

            There are two basic problems/constraints we are trying to meet.

            Append writes must be atomic, meaning two things:
            1. No "write tearing".  Every byte of a particular O_APPEND write must be adjacent, no gaps.  (This is true of any write, but it is more relevant for O_APPEND writes, which try to go to EOF, so if we are not careful, a restarted append write will restart at a new EOF.)

            2. They must not be mixed with other O_APPEND writes.  If two O_APPEND writes, A and B, are racing, either AB or BA is valid, but it is not valid for any part of A or B to overwrite the other one.  This is of course not true of regular writes, which are started at a specific offset.  This means that if a regular write is racing with an O_APPEND write, it can overwrite part of the O_APPEND write.  This is acceptable.

            The first problem (write tearing) is solved by getting the file size before starting the write and using the same one throughout the O_APPEND write.  (The current code checks the file size repeatedly, I believe for every iteration of cl_io_loop.)  This means that if another write races with our O_APPEND write we will not 'tear' the O_APPEND write by moving to a new EOF in the middle of the write.  Note that we must also retain the size across i/o restarts, for the case where we have to update the layout in the middle of an O_APPEND write.

            The second problem requires that we allow only one O_APPEND write at a time.  There are probably a few ways to solve this, but I think the correct way (it is definitely the simplest way) is to add another bit to the MDT IBITS lock, an O_APPEND bit.  All O_APPEND writes must ask for this bit in PW mode before starting to write.  (We cannot use the layout lock for this exclusion because the server revokes our layout lock when we have to update the layout.)  This lock must be held across i/o restarts, so it should be taken before the layout lock.  (Or it could possibly be taken with the layout lock?  We have to be careful about ordering/pairing issues with the layout and append bits, I have not thought about this carefully yet.)

            Note that excluding O_APPEND writes does require excluding multiple O_APPEND writes on the same node as well.  This can be done using the local tree_lock in the write path, locking it from 0 to EOF.

            This combination of things should allow not instantiating the full layout and locking every object.  It's a fair bit of work.

            Note of course that this is a split client/server solution, so it will need a compatibility flag so the client knows it can use the O_APPEND flag.  The good news is that this should interop safely with older clients - The older clients simply instantiate and lock everything for O_APPEND, which will give the correct exclusion vs newer clients.

            pfarrell Patrick Farrell (Inactive) added a comment - spitzcor and vitaly_fertman of Cray indicated they're still interested in taking aim directly at the problem, and asked me to break down the approach I had most recently settled on.  I'm still a fan of the current "special layout for O_APPEND" patch, but this fix could be complementary (and it may not be available for a while - The current plan is not too bad, but it's not trivial either).  So I'll do that here. There are two basic problems/constraints we are trying to meet. Append writes must be atomic, meaning two things: 1. No "write tearing".  Every byte of a particular O_APPEND write must be adjacent, no gaps.  (This is true of any write, but it is more relevant for O_APPEND writes, which try to go to EOF, so if we are not careful, a restarted append write will restart at a new EOF.) 2. They must not be mixed with other O_APPEND writes.  If two O_APPEND writes, A and B, are racing, either AB or BA is valid, but it is not valid for any part of A or B to overwrite the other one.  This is of course not true of regular writes, which are started at a specific offset.  This means that if a regular write is racing with an O_APPEND write, it can overwrite part of the O_APPEND write.  This is acceptable. The first problem (write tearing) is solved by getting the file size before starting the write and using the same one throughout the O_APPEND write.  (The current code checks the file size repeatedly, I believe for every iteration of cl_io_loop.)  This means that if another write races with our O_APPEND write we will not 'tear' the O_APPEND write by moving to a new EOF in the middle of the write.  Note that we must also retain the size across i/o restarts, for the case where we have to update the layout in the middle of an O_APPEND write. The second problem requires that we allow only one O_APPEND write at a time.  There are probably a few ways to solve this, but I think the correct way (it is definitely the simplest way) is to add another bit to the MDT IBITS lock, an O_APPEND bit.  All O_APPEND writes must ask for this bit in PW mode before starting to write.  (We cannot use the layout lock for this exclusion because the server revokes our layout lock when we have to update the layout.)  This lock must be held across i/o restarts, so it should be taken before the layout lock.  (Or it could possibly be taken with the layout lock?  We have to be careful about ordering/pairing issues with the layout and append bits, I have not thought about this carefully yet.) Note that excluding O_APPEND writes does require excluding multiple O_APPEND writes on the same node as well.  This can be done using the local tree_lock in the write path, locking it from 0 to EOF. This combination of things should allow not instantiating the full layout and locking every object.  It's a fair bit of work. Note of course that this is a split client/server solution, so it will need a compatibility flag so the client knows it can use the O_APPEND flag.  The good news is that this should interop safely with older clients - The older clients simply instantiate and lock everything for O_APPEND, which will give the correct exclusion vs newer clients.

            dauchy,

            OK, that seems like a reasonable request.  adilger, let me know if you have any objections, but I'll see about adding it to the current patch.

            pfarrell Patrick Farrell (Inactive) added a comment - dauchy , OK, that seems like a reasonable request.  adilger , let me know if you have any objections, but I'll see about adding it to the current patch.

            People

              pfarrell Patrick Farrell (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: