Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2613

opening and closing file can generate 'unreclaimable slab' space

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.1.3, Lustre 2.1.4
    • 3
    • 6116

    Description

      We have a lot of nodes with a large amount of unreclaimable memory (over 4GB). Whatever we try to do (manually shrinking the cache, clearing lru locks, ...) the memory can't be recovered. The only way to get the memory back is to umount the lustre filesystem.

      After some troubleshooting, I was able to wrote a small reproducer where I just open(2) then close(2) files in O_RDWR (my reproducer use to open thousand of files to emphasize the issue).

      Attached 2 programs :

      • gentree.c (cc -o gentree gentree.c -lpthread) to generate a tree of known files (no need to use readdir in reproducer.c)
      • reproducer.c (cc -o reproducer reproduver.c -lpthread) to reproduce the issue.
        The macro BASE_DIR has to be adjust according the local cluster configuration (you should provide the name of a directory located on a lustre filesystem).

      There is no link between the 2 phases as rebooting the client between gentree & reproducer does't avoid the problem. Running gentree (which open as much files as reproducer) doesn't show the issue.

      Attachments

        1. gentree.c
          3 kB
        2. logs_01.tar.gz
          7 kB
        3. reproducer.c
          2 kB

        Issue Links

          Activity

            [LU-2613] opening and closing file can generate 'unreclaimable slab' space

            I'm not sure if there is any good way to handle duplicated transnos in replay:

            • Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed)
            • Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1;

            Then server how to replay the two open requests with same transno during recovery?

            niu Niu Yawei (Inactive) added a comment - I'm not sure if there is any good way to handle duplicated transnos in replay: Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed) Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1; Then server how to replay the two open requests with same transno during recovery?

            See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

            adilger Andreas Dilger added a comment - See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

            I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.

            niu Niu Yawei (Inactive) added a comment - I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.
            tappro Mikhail Pershin added a comment - - edited

            Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code.

            Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as:

            last_committed value = min(max(inode_version accessed by this export),
                                               last_committed)
            

            'accessed' means not just updates but open/close too.

            tappro Mikhail Pershin added a comment - - edited Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code. Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as: last_committed value = min(max(inode_version accessed by this export), last_committed) 'accessed' means not just updates but open/close too.

            Andreas,

            that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

            shadow Alexey Lyashkov added a comment - Andreas, that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

            Niu, I think my proposal should work even if there are no changes being made to the filesystem at all. The open transno would be inode_version and the close transno would be max(inode_version, export last_committed), so it doesn't expose any newer transno to the client. This also ensures that when the files are closed, the open/close RPCs are < last_committed, and will be dropped from the replay list on the client.

            adilger Andreas Dilger added a comment - Niu, I think my proposal should work even if there are no changes being made to the filesystem at all. The open transno would be inode_version and the close transno would be max(inode_version, export last_committed) , so it doesn't expose any newer transno to the client. This also ensures that when the files are closed, the open/close RPCs are < last_committed, and will be dropped from the replay list on the client.

            Niu,

            window open only in case none updates in cluster exist. Any other cases will be solve problem, but it's unlikely in cluster where any client . if affected client will move to idle after open/close series - it will have a last_commited updates via ping.

            shadow Alexey Lyashkov added a comment - Niu, window open only in case none updates in cluster exist. Any other cases will be solve problem, but it's unlikely in cluster where any client . if affected client will move to idle after open/close series - it will have a last_commited updates via ping.

            Andreas, I think your proposal is based on the assumption of there are other clients doing disk update in the cluster, right? What if there isn't any client doing disk updates or the disk updates frequency isn't high enough? I think that's the situation described in this ticket.

            niu Niu Yawei (Inactive) added a comment - Andreas, I think your proposal is based on the assumption of there are other clients doing disk update in the cluster, right? What if there isn't any client doing disk updates or the disk updates frequency isn't high enough? I think that's the situation described in this ticket.

            Thinking about this further, I suspect the following will work correctly for new and old clients.

            The actual RPC open transno could be inode_version, but will have a later XID from the client, so it will sort correctly in the client replay list. The close RPC transno can be max(inode_version of inode, max inode_version accessed by this export - 1). The client RPC replay ordering should be in [transno, XID] order, so if this client also created the file, then the later open/close RPCs would still be replayed after it. When the close gets a reply, this will also naturally ensure that both the open/close RPC transnos are < last_committed, if the object create itself was committed.

            For both open and close, it should be enough to return:

                    last_committed value = min(max(inode_version accessed by this export),
                                               last_committed)
            

            so that the client never "sees" a last_committed value newer than the inode_version (i.e. last change transaction) of any object it has accessed, and the actual last_committed value in case the inode was changed since the most recent transaction.

            I don't think there is any problem with the client or server having duplicate transaction numbers, per comment in ptlrpc_retain_replayable_request():

                            /* We may have duplicate transnos if we create and then
                             * open a file, or for closes retained if to match creating
                             * opens, so use req->rq_xid as a secondary key.
                             * (See bugs 684, 685, and 428.)
                             * XXX no longer needed, but all opens need transnos!
            

            This avoids the ever-growing transaction number for open+close that do not actually affect the on-disk state (except for keeping objects open for open-but-unlinked files). So long as the client has the open after the object was first created, and the close before the current commit (which can't affect any open-unlinked state if some other client still has not sent a close RPC).

            adilger Andreas Dilger added a comment - Thinking about this further, I suspect the following will work correctly for new and old clients. The actual RPC open transno could be inode_version, but will have a later XID from the client, so it will sort correctly in the client replay list. The close RPC transno can be max(inode_version of inode, max inode_version accessed by this export - 1). The client RPC replay ordering should be in [transno, XID] order, so if this client also created the file, then the later open/close RPCs would still be replayed after it. When the close gets a reply, this will also naturally ensure that both the open/close RPC transnos are < last_committed, if the object create itself was committed. For both open and close, it should be enough to return: last_committed value = min(max(inode_version accessed by this export), last_committed) so that the client never "sees" a last_committed value newer than the inode_version (i.e. last change transaction) of any object it has accessed, and the actual last_committed value in case the inode was changed since the most recent transaction. I don't think there is any problem with the client or server having duplicate transaction numbers, per comment in ptlrpc_retain_replayable_request() : /* We may have duplicate transnos if we create and then * open a file, or for closes retained if to match creating * opens, so use req->rq_xid as a secondary key. * (See bugs 684, 685, and 428.) * XXX no longer needed, but all opens need transnos! This avoids the ever-growing transaction number for open+close that do not actually affect the on-disk state (except for keeping objects open for open-but-unlinked files). So long as the client has the open after the object was first created, and the close before the current commit (which can't affect any open-unlinked state if some other client still has not sent a close RPC).

            I've reduced the severity of this bug from Blocker to Major. Clearly, if the problem has existed since before 1.8.0 it cannot be affecting a huge number of users, though it definitely appears to be a problem with using RobinHood (or other tool) to open and close a large number of files. Also, there is a simple workaround - periodically touch any file on the client to force a transaction so that the last_committed value is updated, and the saved RPCs will be flushed.

            Presumably, Robin Hood cannot be modified to get the information it needs without the open/close (e.g. stat() instead of fstat() and similar)? That would be even less work on the part of the client. In the short term, to work around the client bug, it could also be made to modify some temporary file in the filesystem (e.g. mknod(), then periodic utimes() to generate transactions) until this problem is resolved.

            The correct long-term solution to this problem is as Mike suggested early on in this bug - to decouple the open-replay handling from the RPC replay mechanism, since it isn't really the RPC layer's job to re-open files that are no longer involved in the transactions. The RPC replay is of course correct for open(O_CREAT) that did not yet commit and/or close, but it doesn't make sense to keep non-creating open/close RPCs around after that time. We had previously discussed moving the file open-replay handling up to the llite layer in the context of "Simplified Interoperability" (http://wiki.lustre.org/images/0/0b/Simplified_InteropRecovery.pdf and https://projectlava.xyratex.com/show_bug.cgi?id=18496).

            In the short term (2.1 and 2.4) there are a few compromise solutions possible. If the server is doing other transactions, then it might make sense to return a last_committed value < the the most recent "fake" transaction number. One possibility is to return last_committed = min(last_committed, inode_version - 1) in the RPC replies so that the clients don't get any state that they do not need, but at least depend on the recovery state of any file they have accessed?

            Alternately, at one time we discussed returning duplicate (old) transaction numbers for opens that do not create the file. This allows the files to be replayed in the proper order after recovery, but they do not change the state on disk.

            adilger Andreas Dilger added a comment - I've reduced the severity of this bug from Blocker to Major. Clearly, if the problem has existed since before 1.8.0 it cannot be affecting a huge number of users, though it definitely appears to be a problem with using RobinHood (or other tool) to open and close a large number of files. Also, there is a simple workaround - periodically touch any file on the client to force a transaction so that the last_committed value is updated, and the saved RPCs will be flushed. Presumably, Robin Hood cannot be modified to get the information it needs without the open/close (e.g. stat() instead of fstat() and similar)? That would be even less work on the part of the client. In the short term, to work around the client bug, it could also be made to modify some temporary file in the filesystem (e.g. mknod(), then periodic utimes() to generate transactions) until this problem is resolved. The correct long-term solution to this problem is as Mike suggested early on in this bug - to decouple the open-replay handling from the RPC replay mechanism, since it isn't really the RPC layer's job to re-open files that are no longer involved in the transactions. The RPC replay is of course correct for open(O_CREAT) that did not yet commit and/or close, but it doesn't make sense to keep non-creating open/close RPCs around after that time. We had previously discussed moving the file open-replay handling up to the llite layer in the context of "Simplified Interoperability" ( http://wiki.lustre.org/images/0/0b/Simplified_InteropRecovery.pdf and https://projectlava.xyratex.com/show_bug.cgi?id=18496 ). In the short term (2.1 and 2.4) there are a few compromise solutions possible. If the server is doing other transactions, then it might make sense to return a last_committed value < the the most recent "fake" transaction number. One possibility is to return last_committed = min(last_committed, inode_version - 1) in the RPC replies so that the clients don't get any state that they do not need, but at least depend on the recovery state of any file they have accessed? Alternately, at one time we discussed returning duplicate (old) transaction numbers for opens that do not create the file. This allows the files to be replayed in the proper order after recovery, but they do not change the state on disk.

            Tappro, i just don't say it's wrong to track per client base, i just say - client should be know about any transno committed as it's have no cost to implement and was done in previous version.

            i agree, it's open a window when it's single client - but it's very very rare case and we may use a cluster wide transno as short/middle time solution, to fix 2.1 and 2.4.

            But i will be happy if you will able to deliver a better fix in 2.5 without loosing compatibility with older clients.

            shadow Alexey Lyashkov added a comment - Tappro, i just don't say it's wrong to track per client base, i just say - client should be know about any transno committed as it's have no cost to implement and was done in previous version. i agree, it's open a window when it's single client - but it's very very rare case and we may use a cluster wide transno as short/middle time solution, to fix 2.1 and 2.4. But i will be happy if you will able to deliver a better fix in 2.5 without loosing compatibility with older clients.

            People

              hongchao.zhang Hongchao Zhang
              louveta Alexandre Louvet (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: