Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2613

opening and closing file can generate 'unreclaimable slab' space

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.1
    • Lustre 2.1.3, Lustre 2.1.4
    • 3
    • 6116

    Description

      We have a lot of nodes with a large amount of unreclaimable memory (over 4GB). Whatever we try to do (manually shrinking the cache, clearing lru locks, ...) the memory can't be recovered. The only way to get the memory back is to umount the lustre filesystem.

      After some troubleshooting, I was able to wrote a small reproducer where I just open(2) then close(2) files in O_RDWR (my reproducer use to open thousand of files to emphasize the issue).

      Attached 2 programs :

      • gentree.c (cc -o gentree gentree.c -lpthread) to generate a tree of known files (no need to use readdir in reproducer.c)
      • reproducer.c (cc -o reproducer reproduver.c -lpthread) to reproduce the issue.
        The macro BASE_DIR has to be adjust according the local cluster configuration (you should provide the name of a directory located on a lustre filesystem).

      There is no link between the 2 phases as rebooting the client between gentree & reproducer does't avoid the problem. Running gentree (which open as much files as reproducer) doesn't show the issue.

      Attachments

        1. gentree.c
          3 kB
        2. logs_01.tar.gz
          7 kB
        3. reproducer.c
          2 kB

        Issue Links

          Activity

            [LU-2613] opening and closing file can generate 'unreclaimable slab' space

            Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed?

            That's ture for close request but not ture for open request, open request has to be retained regardless of transno.

            Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed?

            No, we didn't do that, but it's easy to fix if we change the client code.

            Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed.

            One thing can't be handled well by this manner is: If a client never do update operation, it'll always have zero exp_last_committed, which means any replied trasno can't be larger than 0... We may fix it by generating an update operation upon each client connection (connect becomes an update operation? that looks quite dirty to me), or we should just ignore this rare case (no any update from a client) for this moment? Any suggestion? Thanks.

            niu Niu Yawei (Inactive) added a comment - Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? That's ture for close request but not ture for open request, open request has to be retained regardless of transno. Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? No, we didn't do that, but it's easy to fix if we change the client code. Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed. One thing can't be handled well by this manner is: If a client never do update operation, it'll always have zero exp_last_committed, which means any replied trasno can't be larger than 0... We may fix it by generating an update operation upon each client connection (connect becomes an update operation? that looks quite dirty to me), or we should just ignore this rare case (no any update from a client) for this moment? Any suggestion? Thanks.

            Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed.

            In any case, I don't mind also fixing this on the client as long as it doesn't break the interoperability. The client should be smart enough to cancel matching pairs of open+close if they are <= last_committed. That won't solve the problem by itself, but in conjunction with the server changes not to invent fake transnos > last_committed it should work.

            I think this is a combination of two bugs - server giving out transnos that are > last_committed and never committing them, and also that the client does not drop RPCs if the last_committed doesn't change. This should be possible to handle without scanning the list each time I think.

            adilger Andreas Dilger added a comment - Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed. In any case, I don't mind also fixing this on the client as long as it doesn't break the interoperability. The client should be smart enough to cancel matching pairs of open+close if they are <= last_committed. That won't solve the problem by itself, but in conjunction with the server changes not to invent fake transnos > last_committed it should work. I think this is a combination of two bugs - server giving out transnos that are > last_committed and never committing them, and also that the client does not drop RPCs if the last_committed doesn't change. This should be possible to handle without scanning the list each time I think.

            Thank you, Andreas.

            Looks the first option needs to change client (or even protocol), I think if we have to change client & protocol, we'd choose some cleaner & easier way: reply back two last_committed transnos (last_committed & on-disk last_committed, similar to my patchset 1) or totally change the open replay mechanism as you mentioned.

            I tried the second way today and discovered another problem: for performance reason, client only tries to scan the replay list to free the committed/closed requests whenever the last_committed is bumped, which means without client code changes, the open requests won't be freed promptly, even they have transno smaller than last_committed. Even if we change the client code, we'd think out some better way rather than scanning the replay list on every reply.

            So, as a short term solution, I think we'd go back to keep using my patchset 2 (update disk on server periodically), or ask user to do this in userspace by themselves? What do you think about?

            niu Niu Yawei (Inactive) added a comment - Thank you, Andreas. Looks the first option needs to change client (or even protocol), I think if we have to change client & protocol, we'd choose some cleaner & easier way: reply back two last_committed transnos (last_committed & on-disk last_committed, similar to my patchset 1) or totally change the open replay mechanism as you mentioned. I tried the second way today and discovered another problem: for performance reason, client only tries to scan the replay list to free the committed/closed requests whenever the last_committed is bumped, which means without client code changes, the open requests won't be freed promptly, even they have transno smaller than last_committed. Even if we change the client code, we'd think out some better way rather than scanning the replay list on every reply. So, as a short term solution, I think we'd go back to keep using my patchset 2 (update disk on server periodically), or ask user to do this in userspace by themselves? What do you think about?

            Three options would be:

            • client B gets back same data to do replay as client A, so it doesn't matter which one is replayed. However, this might get more complex to handle, but has the benefit of increasing robustness of the replay.
            • client B gets back a fake transno if openA is uncommitted, like it does today. This means there is still a commit to be done, do last_committed will be increased to either cover this fake transno, or at worst the number of fake transno is limited to the few seconds until commit.
            • commit on share of the open file. I don't like this much, because it slows down the common code path.
            adilger Andreas Dilger added a comment - Three options would be: client B gets back same data to do replay as client A, so it doesn't matter which one is replayed. However, this might get more complex to handle, but has the benefit of increasing robustness of the replay. client B gets back a fake transno if openA is uncommitted, like it does today. This means there is still a commit to be done, do last_committed will be increased to either cover this fake transno, or at worst the number of fake transno is limited to the few seconds until commit. commit on share of the open file. I don't like this much, because it slows down the common code path.

            I'm not sure if there is any good way to handle duplicated transnos in replay:

            • Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed)
            • Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1;

            Then server how to replay the two open requests with same transno during recovery?

            niu Niu Yawei (Inactive) added a comment - I'm not sure if there is any good way to handle duplicated transnos in replay: Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed) Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1; Then server how to replay the two open requests with same transno during recovery?

            See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

            adilger Andreas Dilger added a comment - See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

            I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.

            niu Niu Yawei (Inactive) added a comment - I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.
            tappro Mikhail Pershin added a comment - - edited

            Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code.

            Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as:

            last_committed value = min(max(inode_version accessed by this export),
                                               last_committed)
            

            'accessed' means not just updates but open/close too.

            tappro Mikhail Pershin added a comment - - edited Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code. Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as: last_committed value = min(max(inode_version accessed by this export), last_committed) 'accessed' means not just updates but open/close too.

            Andreas,

            that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

            shadow Alexey Lyashkov added a comment - Andreas, that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

            Niu, I think my proposal should work even if there are no changes being made to the filesystem at all. The open transno would be inode_version and the close transno would be max(inode_version, export last_committed), so it doesn't expose any newer transno to the client. This also ensures that when the files are closed, the open/close RPCs are < last_committed, and will be dropped from the replay list on the client.

            adilger Andreas Dilger added a comment - Niu, I think my proposal should work even if there are no changes being made to the filesystem at all. The open transno would be inode_version and the close transno would be max(inode_version, export last_committed) , so it doesn't expose any newer transno to the client. This also ensures that when the files are closed, the open/close RPCs are < last_committed, and will be dropped from the replay list on the client.

            Niu,

            window open only in case none updates in cluster exist. Any other cases will be solve problem, but it's unlikely in cluster where any client . if affected client will move to idle after open/close series - it will have a last_commited updates via ping.

            shadow Alexey Lyashkov added a comment - Niu, window open only in case none updates in cluster exist. Any other cases will be solve problem, but it's unlikely in cluster where any client . if affected client will move to idle after open/close series - it will have a last_commited updates via ping.

            People

              hongchao.zhang Hongchao Zhang
              louveta Alexandre Louvet (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: