[LU-2613] opening and closing file can generate 'unreclaimable slab' space - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.1.3, Lustre 2.1.4
Labels:
- JL
- mn4

Severity:
3
Rank (Obsolete):
6116

Description

We have a lot of nodes with a large amount of unreclaimable memory (over 4GB). Whatever we try to do (manually shrinking the cache, clearing lru locks, ...) the memory can't be recovered. The only way to get the memory back is to umount the lustre filesystem.

After some troubleshooting, I was able to wrote a small reproducer where I just open(2) then close(2) files in O_RDWR (my reproducer use to open thousand of files to emphasize the issue).

Attached 2 programs :

gentree.c (cc -o gentree gentree.c -lpthread) to generate a tree of known files (no need to use readdir in reproducer.c)
reproducer.c (cc -o reproducer reproduver.c -lpthread) to reproduce the issue.
The macro BASE_DIR has to be adjust according the local cluster configuration (you should provide the name of a directory located on a lustre filesystem).

There is no link between the 2 phases as rebooting the client between gentree & reproducer does't avoid the problem. Running gentree (which open as much files as reproducer) doesn't show the issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gentree.c
3 kB
14/Jan/13 8:38 AM
logs_01.tar.gz
7 kB
15/Jan/13 6:08 AM
reproducer.c
2 kB
14/Jan/13 8:38 AM

Issue Links

is duplicated by

LU-3399 MDT don't update client last commited correctly so produce OOM on client

Open

LU-3381 lock enqueue fails for open(".lustre/fid", ...)

Closed

is related to

LU-4272 lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed from lovsub_device_free

Resolved

LU-4270 Interop 2.5.0<->2.6 failure on test suite sanity test_209: open/close requests are not freed

Resolved

Activity

[LU-2613] opening and closing file can generate 'unreclaimable slab' space

Alexey Lyashkov added a comment - 12/Jun/13 10:54 AM

looks i found one more issue in same area.
mdt save a locks in difficult replies via mdt_object_unlock function and ost via oti_to_request function, but ptlrpc will freed a difficult replies only if last committed larger then export last_commited.

void ptlrpc_commit_replies(struct obd_export *exp)
{
        struct ptlrpc_reply_state *rs, *nxt;
        DECLARE_RS_BATCH(batch);
        ENTRY;

        rs_batch_init(&batch);
        /* Find any replies that have been committed and get their service
         * to attend to complete them. */
        
        /* CAVEAT EMPTOR: spinlock ordering!!! */
        spin_lock(&exp->exp_uncommitted_replies_lock);
        cfs_list_for_each_entry_safe(rs, nxt, &exp->exp_uncommitted_replies,
                                     rs_obd_list) {
                LASSERT (rs->rs_difficult);
                /* VBR: per-export last_committed */
                LASSERT(rs->rs_export);
                if (rs->rs_transno <= exp->exp_last_committed) {
                        cfs_list_del_init(&rs->rs_obd_list);
                        rs_batch_add(&batch, rs);
                }
        }
        spin_unlock(&exp->exp_uncommitted_replies_lock);
        rs_batch_fini(&batch);
        EXIT;
}

but exp_last_commited is zero until modification request is hit but we have send a locks via getattr request without any transaction opened.

Alexey Lyashkov added a comment - 12/Jun/13 10:54 AM looks i found one more issue in same area. mdt save a locks in difficult replies via mdt_object_unlock function and ost via oti_to_request function, but ptlrpc will freed a difficult replies only if last committed larger then export last_commited. void ptlrpc_commit_replies(struct obd_export *exp) { struct ptlrpc_reply_state *rs, *nxt; DECLARE_RS_BATCH(batch); ENTRY; rs_batch_init(&batch); /* Find any replies that have been committed and get their service * to attend to complete them. */ /* CAVEAT EMPTOR: spinlock ordering!!! */ spin_lock(&exp->exp_uncommitted_replies_lock); cfs_list_for_each_entry_safe(rs, nxt, &exp->exp_uncommitted_replies, rs_obd_list) { LASSERT (rs->rs_difficult); /* VBR: per-export last_committed */ LASSERT(rs->rs_export); if (rs->rs_transno <= exp->exp_last_committed) { cfs_list_del_init(&rs->rs_obd_list); rs_batch_add(&batch, rs); } } spin_unlock(&exp->exp_uncommitted_replies_lock); rs_batch_fini(&batch); EXIT; } but exp_last_commited is zero until modification request is hit but we have send a locks via getattr request without any transaction opened.

Andreas Dilger added a comment - 09/Jun/13 6:37 PM

Per previous comments, the client exp_last_committed should be min(last_committed, max(inode_version of all inodes accessed). That ensures that I'd the client is accessing committed inodes that they can be discarded on close (assuming it isn't the very last file created).

Andreas Dilger added a comment - 09/Jun/13 6:37 PM Per previous comments, the client exp_last_committed should be min(last_committed, max(inode_version of all inodes accessed). That ensures that I'd the client is accessing committed inodes that they can be discarded on close (assuming it isn't the very last file created).

Niu Yawei (Inactive) added a comment - 09/Jun/13 8:20 AM

Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed?

That's ture for close request but not ture for open request, open request has to be retained regardless of transno.

Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed?

No, we didn't do that, but it's easy to fix if we change the client code.

Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed.

One thing can't be handled well by this manner is: If a client never do update operation, it'll always have zero exp_last_committed, which means any replied trasno can't be larger than 0... We may fix it by generating an update operation upon each client connection (connect becomes an update operation? that looks quite dirty to me), or we should just ignore this rare case (no any update from a client) for this moment? Any suggestion? Thanks.

Niu Yawei (Inactive) added a comment - 09/Jun/13 8:20 AM Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? That's ture for close request but not ture for open request, open request has to be retained regardless of transno. Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? No, we didn't do that, but it's easy to fix if we change the client code. Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed. One thing can't be handled well by this manner is: If a client never do update operation, it'll always have zero exp_last_committed, which means any replied trasno can't be larger than 0... We may fix it by generating an update operation upon each client connection (connect becomes an update operation? that looks quite dirty to me), or we should just ignore this rare case (no any update from a client) for this moment? Any suggestion? Thanks.

Andreas Dilger added a comment - 08/Jun/13 4:27 PM

Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed.

In any case, I don't mind also fixing this on the client as long as it doesn't break the interoperability. The client should be smart enough to cancel matching pairs of open+close if they are <= last_committed. That won't solve the problem by itself, but in conjunction with the server changes not to invent fake transnos > last_committed it should work.

I think this is a combination of two bugs - server giving out transnos that are > last_committed and never committing them, and also that the client does not drop RPCs if the last_committed doesn't change. This should be possible to handle without scanning the list each time I think.

Andreas Dilger added a comment - 08/Jun/13 4:27 PM Niu, for second option I thought the client would not even try to put the RPC into the replay list if transno <= last_committed? Is it not checked in the close RPC callback to drop both requests from this list if both transnos are <= last_committed? Iff a fake transno is given to the client to avoid a duplicate open+create transno, it should be inode_version+1, and upon close if the real last_committed is >= inode_version+1 then it should send inode_version+1 back to the client for last_committed. In any case, I don't mind also fixing this on the client as long as it doesn't break the interoperability. The client should be smart enough to cancel matching pairs of open+close if they are <= last_committed. That won't solve the problem by itself, but in conjunction with the server changes not to invent fake transnos > last_committed it should work. I think this is a combination of two bugs - server giving out transnos that are > last_committed and never committing them, and also that the client does not drop RPCs if the last_committed doesn't change. This should be possible to handle without scanning the list each time I think.

Niu Yawei (Inactive) added a comment - 07/Jun/13 4:04 AM

Thank you, Andreas.

Looks the first option needs to change client (or even protocol), I think if we have to change client & protocol, we'd choose some cleaner & easier way: reply back two last_committed transnos (last_committed & on-disk last_committed, similar to my patchset 1) or totally change the open replay mechanism as you mentioned.

I tried the second way today and discovered another problem: for performance reason, client only tries to scan the replay list to free the committed/closed requests whenever the last_committed is bumped, which means without client code changes, the open requests won't be freed promptly, even they have transno smaller than last_committed. Even if we change the client code, we'd think out some better way rather than scanning the replay list on every reply.

So, as a short term solution, I think we'd go back to keep using my patchset 2 (update disk on server periodically), or ask user to do this in userspace by themselves? What do you think about?

Niu Yawei (Inactive) added a comment - 07/Jun/13 4:04 AM Thank you, Andreas. Looks the first option needs to change client (or even protocol), I think if we have to change client & protocol, we'd choose some cleaner & easier way: reply back two last_committed transnos (last_committed & on-disk last_committed, similar to my patchset 1) or totally change the open replay mechanism as you mentioned. I tried the second way today and discovered another problem: for performance reason, client only tries to scan the replay list to free the committed/closed requests whenever the last_committed is bumped, which means without client code changes, the open requests won't be freed promptly, even they have transno smaller than last_committed. Even if we change the client code, we'd think out some better way rather than scanning the replay list on every reply. So, as a short term solution, I think we'd go back to keep using my patchset 2 (update disk on server periodically), or ask user to do this in userspace by themselves? What do you think about?

Andreas Dilger added a comment - 03/Jun/13 2:57 PM

Three options would be:

client B gets back same data to do replay as client A, so it doesn't matter which one is replayed. However, this might get more complex to handle, but has the benefit of increasing robustness of the replay.
client B gets back a fake transno if openA is uncommitted, like it does today. This means there is still a commit to be done, do last_committed will be increased to either cover this fake transno, or at worst the number of fake transno is limited to the few seconds until commit.
commit on share of the open file. I don't like this much, because it slows down the common code path.

Andreas Dilger added a comment - 03/Jun/13 2:57 PM Three options would be: client B gets back same data to do replay as client A, so it doesn't matter which one is replayed. However, this might get more complex to handle, but has the benefit of increasing robustness of the replay. client B gets back a fake transno if openA is uncommitted, like it does today. This means there is still a commit to be done, do last_committed will be increased to either cover this fake transno, or at worst the number of fake transno is limited to the few seconds until commit. commit on share of the open file. I don't like this much, because it slows down the common code path.

Niu Yawei (Inactive) added a comment - 03/Jun/13 7:35 AM

I'm not sure if there is any good way to handle duplicated transnos in replay:

Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed)
Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1;

Then server how to replay the two open requests with same transno during recovery?

Niu Yawei (Inactive) added a comment - 03/Jun/13 7:35 AM I'm not sure if there is any good way to handle duplicated transnos in replay: Client A open_create file with transno T1, and the child version set as T1; (T1 is not committed) Client B open_rdonly the same file, so the returned should be max(versions, exp_last_committed) = T1; Then server how to replay the two open requests with same transno during recovery?

Andreas Dilger added a comment - 31/May/13 9:00 AM

See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

Andreas Dilger added a comment - 31/May/13 9:00 AM See my earlier comment in this bug. We used to have duplicate transnos for open/close at one time in the past, or at least it was something we strongly thought about, and the client code is ready for this to happen.

Niu Yawei (Inactive) added a comment - 31/May/13 2:23 AM

I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.

Niu Yawei (Inactive) added a comment - 31/May/13 2:23 AM I see, but I'm not sure if duplicated transno will bring us trouble. I'll try to cooke a patch to find it out later. Thanks, Andreas.

Mikhail Pershin added a comment - 30/May/13 4:56 PM - edited

Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code.

Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as:

last_committed value = min(max(inode_version accessed by this export),
                                   last_committed)

'accessed' means not just updates but open/close too.

Mikhail Pershin added a comment - 30/May/13 4:56 PM - edited Andreas, that may work, yes. Besides it will be really good to get rid of mdt_empty_transno() code. Alexey, with your example the imp_peer_committed_transno will be still updated because last_committed is to be returned from server as: last_committed value = min(max(inode_version accessed by this export), last_committed) 'accessed' means not just updates but open/close too.

Alexey Lyashkov added a comment - 30/May/13 11:59 AM

Andreas,

that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

Alexey Lyashkov added a comment - 30/May/13 11:59 AM Andreas, that is will be don't work. because if client mounted lustre with noatime, and executed just a open(O_RDONLY)+close they have a zero in peer_last_commited, because none updates will be send to him.

People

Assignee:: Hongchao Zhang

Reporter:: Alexandre Louvet (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 31 Start watching this issue

Dates

Created:: 14/Jan/13 8:38 AM

Updated:: 12/Feb/14 11:10 PM

Resolved:: 12/Feb/14 11:10 PM