[LU-1151] 1.8 client using excessive slab when mounting 2.1 server Created: 29/Feb/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

lustre 1.8.5.0-6chaos


Attachments: File momus-mds1-lustre.log.gz     File momus-mds1.lustre.log.1331829143.gz     File sierra1150_lustre.log.txt.bz2    
Issue Links:
Duplicate
is duplicated by LU-1150 1.8 client using excessive slab when ... Closed
Severity: 3
Rank (Obsolete): 9846

 Description   

We have lustre 1.8 clients that are mounting 2.1 servers. These clients have slab usage that is growing to excessive amounts, for instance 20GB in slab on a 24GB node. On a node with 20GB of slab, about 13GB was accounted for in /proc/sys/lustre/memused.

Upon umount'ing the 2.1 filesytem, nearly all of the 20GB slab was freed very quickly.

The vast majority of the slab usage is in the generic slabs, mostly 16k and 8k.

I am attaching the log "sierra1150_lustre.log.txt.bz2", which was gathered by enabling full debugging and running something like this:

lctl debug_daemon enable; umount /p/lscratchd; lctl debug_daemon disable

The log isn't quite as large as I would have liked, so I probably need to let the daemon run longer...or maybe try to use a larger in-memory buffer, if that is possible with slab usage as it is...

Looking at the rest of the cluster after clearing some of the worst slab problems, we still have slab usage that varies pretty widely from 1GB to 9GB across the rest of the nodes, and I suspect that it only increases with time.

This slab does not free itself with normal memory pressure, and is causing applications to OOM, so this is a pretty big problem for our production systems at the moment.

I don't really need a fix for 1.8, but if the problem is common with the 2.1 code, we'll definitely need a 2.1 fix.

Our 1.8 is getting old, so perhaps there is a fix that I missed. But I didn't turn up anything obvious when I did a few searches.



 Comments   
Comment by Christopher Morrone [ 29/Feb/12 ]

I forgot to set the "Priority" higher. It should be at least "Major", but I don't have the power to change it after initial submission time.

Comment by Peter Jones [ 29/Feb/12 ]

Don't worry Chris - I fixed the priority on your behalf!

Bobi,

could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 29/Feb/12 ]

At first glance this looks like saved RPC open requests (1882 of them, one example below, earlier parts in the log don't look interesting), so possibly there is a problem with deciding when the file is closed and sending the RPC to the MDS? Each RPC (intent open + close) takes about 17kB, so 1880 * 17kB = 32MB, which isn't enough to account for 20GB of memory.

It is possible to find the amount of memory that Lustre has allocated directly via:

$ lctl get_param memused*
memused=971035
memused=23744355

memused_max=24133830

The first number is the LNet usage, while the second is Lustre usage, and the memused_max is the maximum allocation by Lustre.

00000100:00100000:5:1330476725.738402:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen  req@ffff810609702800 x1385125338817546/t203791993381 o101->lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301
00000100:00000001:5:1330476725.738408:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0)
00000002:00000010:5:1330476725.738409:0:22985:0:(mdc_request.c:611:mdc_commit_open()) kfreed 'mod': 24 at ffff810538807520.
00000100:00000001:5:1330476725.738410:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered
00000100:00000040:5:1330476725.738411:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff810609702800 x1385125338817546/t203791993381 o101->lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301
00000100:00000001:5:1330476725.738414:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered
00000100:00000010:5:1330476725.738414:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 5128 at ffff81042d2ac000.
00000100:00000040:5:1330476725.738416:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1882
00000020:00000001:5:1330476725.738417:0:22985:0:(genops.c:816:class_import_put()) Process entered
00000020:00000040:5:1330476725.738417:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1883 obd=lsd-MDT0000-mdc-ffff81033774b800
00000020:00000001:5:1330476725.738418:0:22985:0:(genops.c:835:class_import_put()) Process leaving
00000100:00000010:5:1330476725.738419:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 1024 at ffff810609702400.
00000100:00000010:5:1330476725.738420:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed 'request': 584 at ffff810609702800.
00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving
00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
00000100:00100000:5:1330476725.738422:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen  req@ffff8103c2d05000 x1385125338817547/t203791993385 o35->lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0
00000100:00000001:5:1330476725.738425:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0)
00000100:00000001:5:1330476725.738426:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered
00000100:00000040:5:1330476725.738426:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff8103c2d05000 x1385125338817547/t203791993385 o35->lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0
00000100:00000001:5:1330476725.738429:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered
00000100:00000010:5:1330476725.738430:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 9864 at ffff8103f4fa0000.
00000100:00000040:5:1330476725.738433:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1881
00000020:00000001:5:1330476725.738433:0:22985:0:(genops.c:816:class_import_put()) Process entered
00000020:00000040:5:1330476725.738434:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1882 obd=lsd-MDT0000-mdc-ffff81033774b800
00000020:00000001:5:1330476725.738435:0:22985:0:(genops.c:835:class_import_put()) Process leaving
00000100:00000010:5:1330476725.738435:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 360 at ffff8104914b5200.
00000100:00000010:5:1330476725.738436:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed 'request': 584 at ffff8103c2d05000.
00000100:00000001:5:1330476725.738437:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving
00000100:00000001:5:1330476725.738438:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
Comment by Christopher Morrone [ 29/Feb/12 ]

Yes, I've been looking at the memused. I think my confusion was just that Lustre is tracking REQUESTED memory, where as slabinfo reports total slab memory, including the waste. In other words, if Lustre allocates 9K, it will come from the 16K slab and waste 7K. Lustre's meminfo doesn't record the wasted 7K. Not that there is really anything wrong with that; I just failed to account for it.

So I think that explains the discrepancy of slabinfo showing more usage that the combined lnet/lustre memused fields report.

Do you have enough information on this for now? Is there something else that I can get you? There are other production fires to attend to, but this bug is my top priority at the moment.

Also be aware that there is all manner of lustre problems happening now or recently. One of the miriade problems is causing clients to reconnect to server pretty frequently, and there are problems with both 1.8 and 2.1 servers that can prevent that from being successful (with 2.1, its the shrinker bug. I don't know whats wrong with the 1.8 server yet, but it might be something else). So it may be that some of the 1.8 memory problems are due to exercising various cleanup and recovery paths that are out of the ordinary.

But that is vague speculation. I just don't want you guys to think that things are otherwise quiet and slab usage is just growing in a vacuum.

Comment by Zhenyu Xu [ 01/Mar/12 ]

00000100:02000000:9:1328810057.695332:0:20270:0:(import.c:1346:ptlrpc_import_recovery_state_machine()) lsd-MDT0000-mdc-ffff81033774b800: Connection restored to lsd-MDT0000 (at 172.16.64.141@tcp)
...
before umount, logs showed client just try to recover MDS connections, and lots of expired requests happened
...
00000100:00000100:9:1328837650.839813:0:20270:0:(client.c:1484:ptlrpc_expire_one_request()) @@@ timeout (sent at 1328837593, 57s ago) req@ffff81040e05cc00 x1385125273128097/t0 o38->lsd-MDT0000_UUID@172.16.64.141@tcp:12/10
...
00000020:00000040:1:1330476656.479243:0:20271:0:(genops.c:809:class_import_get()) import ffff8101bfae1800 refcount=1456110 obd=lsd-MDT0000-mdc-ffff81033774b800
The first time import ffff8101bfae1800 showed in the log
...
00000020:00000040:0:1330476681.565948:0:20269:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1456109 obd=lsd-MDT0000-mdc-ffff81033774b800
...
no logs about this import ffff8101bfae1800
00000020:00000040:5:1330476725.738417:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1883 obd=lsd-MDT0000-mdc-ffff81033774b800
=======> 44 seconds later, the import's refcount dramatically dropped to 1883

Comment by Christopher Morrone [ 01/Mar/12 ]

Ah, great, the ~1.5 million references certainly point the way to the memory usage.

We have other nodes with lots of memory stuck in slab right now, so if you have a suggestion for getting more information about those references just let me know.

Comment by Christopher Morrone [ 01/Mar/12 ]

The plans are still fluid, but it is starting to look like our users are not comfortable with a rapid upgrade to CHAOS5 and Lustre 2.1 on our compute clusters. It may be that the roll out is slow enough that I will need to patch and upgrade 1.8 on our compute clusters.

In the mean time, we may start doing umount/mount at the end of jobs as a possible temporary work around. I fear the new bugs we'll hit by doing that constantly, but since things are so bad it is worth a try.

Comment by Zhenyu Xu [ 01/Mar/12 ]

Would it possible to check out the humongous referenced obd_import structure contents? I want to checkout its

__u64 imp_last_replay_transno;
__u64 imp_peer_committed_transno;

values to get to know whether huge mount of replay requests are linked onto it or not.

Comment by Christopher Morrone [ 02/Mar/12 ]

I'll look into it when I get another node I can work on. There are two candidates, but I need to wait until the user's job finishes before I go walking memory.

Comment by Ned Bass [ 08/Mar/12 ]

From first candidate node:

crash> struct obd_import.imp_refcount ffff81063c7f5000
  imp_refcount = {
    counter = 859444
  }, 
crash> struct obd_import.imp_last_replay_transno ffff81063c7f5000
  imp_last_replay_transno = 204432392643, 
crash> struct obd_import.imp_peer_committed_transno ffff81063c7f5000
  imp_peer_committed_transno = 204430868099, 

and from the other one:

crash> struct obd_import.imp_refcount ffff81063d1d6800
  imp_refcount = {
    counter = 161494
  },
crash> struct obd_import.imp_last_replay_transno ffff81063d1d6800
  imp_last_replay_transno = 203717138487, 
crash> struct obd_import.imp_peer_committed_transno ffff81063d1d6800
  imp_peer_committed_transno = 203714086637, 

Let us know if you'd like to see any other data.

Comment by Zhenyu Xu [ 08/Mar/12 ]

those clients do have a lot of pending requests which are not committed on MDT yet, is the MDS busy?

Whether any 2.1 client connecting to this 2.1 server suffer the same memory pressure?

Comment by Zhenyu Xu [ 09/Mar/12 ]

Would you please upload MDS logs? I'd like to check why MDS server got so many requests unhandled. Thanks.

Comment by Ned Bass [ 09/Mar/12 ]

those clients do have a lot of pending requests which are not committed on MDT yet, is the MDS busy?

No, the server is currently not particularly busy at the moment. Also please note that these clients have been sitting idle for a few days, so they should have had plenty of time to commit their pending requests.

Whether any 2.1 client connecting to this 2.1 server suffer the same memory pressure?

I am not aware of any 2.1 clients having this issue.

Would you please upload MDS logs? I'd like to check why MDS server got so many requests unhandled. Thanks.

Uploading momus-mds1-lustre.log.gz which only has default debug levels. It does have entries going back to Mar 1, and our clients hit the issue on Mar 2.

Comment by Zhenyu Xu [ 11/Mar/12 ]

from mds log, I found MGS took at least (1331314772 - 1330619506)=695266 seconds ~= 8days to handle llog request.

20000000:01000000:7.0:1330619506.677853:0:11940:0:(mgs_handler.c:748:mgs_handle()) @@@ llog_init  req@ffff88042b105850 x1385125352729595/t0(0) o-1->bec17d60-303b-9c69-d653-8a476440786a@NET_0x50005c0a8748c_UUID:0/0 lens 264/0 e 0 to 0 dl 1330619567 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1
...

20000000:01000000:14.0:1331314772.631315:0:11563:0:(mgs_handler.c:759:mgs_handle()) @@@ llog read header  req@ffff880428832050 x1395901311135885/t0(0) o-1->b8fd6e82-d2d8-a6fb-a5ca-80fffdf4a765@NET_0x50005c0a8700d_UUID:0/0 lens 240/0 e 0 to 0 dl 1331314833 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1

and tons of network error (-113:EHOSTUNREACH) connecting 172.16.66.28, what is the 172.16.66.28 node?

Comment by Ned Bass [ 12/Mar/12 ]

and tons of network error (-113:EHOSTUNREACH) connecting 172.16.66.28, what is the 172.16.66.28 node?

That is an lnet router for one of smaller clusters (not the cluster involved in this issue).

The uuid in those log entries resolves to node sierra1139, but this node does not appear to ever have been affected by the excessive slab issue. I'm not sure why it took so long to handle the llog request.

Comment by Zhenyu Xu [ 14/Mar/12 ]

Would it possible to grab some MGS/MDS log of it servering the suffered clients? The key point here is to know what happens to not push forward server's commit transno, it could be server's problem of not committing client's request, or communication problem between them.

Comment by Ned Bass [ 14/Mar/12 ]

Would it possible to grab some MGS/MDS log of it servering the suffered clients?

Can you please clarify what you want me to capture? Do you want me to unmount the clients? What debug levels?

Comment by Zhenyu Xu [ 14/Mar/12 ]

It would be great to capture -1 level debug log (lctl dk > [MDS log file]) of the MDS/MGS when it servers the suffered clients, i.e. don't umount the under mem pressure clients. I want to know the cause why MDS/MGS missed so many commit request from the suffered clients. Thank you.

Comment by Ned Bass [ 15/Mar/12 ]

Uploading new MDS log.

Comment by Ned Bass [ 15/Mar/12 ]

I enabled -1 debugging on the MDS then ran the following in Lustre on an affected client:

dd if=/dev/zero of=foo bs=1M count=1
ls -al foo
rm foo

This activity seemed to flush the memory from the client slab.

Before:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1474592 1474375  99%    0.03K  13166      112     52664K size-32
1331044 1330817  99%    1.00K 332761        4   1331044K size-1024
1249384 1248004  99%    0.06K  21176       59     84704K size-64
467268 467268 100%    8.00K 467268        1   3738144K size-8192
438360 436835  99%    0.50K  54795        8    219180K size-512
429758 429758 100%   16.00K 429758        1   6876128K size-16384

After:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1080688 1044603  96%    0.03K   9649      112     38596K size-32
105433  66449  63%    0.06K   1787       59      7148K size-64
 60060  59420  98%    0.25K   4004       15     16016K size-256
 51180  15743  30%    0.12K   1706       30      6824K size-128
 42348  41751  98%    1.00K  10587        4     42348K size-1024
 37549  37549 100%    8.00K  37549        1    300392K size-8192
Comment by Zhenyu Xu [ 18/Mar/12 ]

Havn't found abnormality in the MDS log. According to the previous comment, a normal dd operation alleviated a client's slab pressure, so is it the 1.8 clients starting to grow slab usage after they are mounted upon 2.1 servers without further operations on the clients? Or even after the alleviating dd ops, the slab comsumption on the client will still grow high as time pass by?

Comment by Ned Bass [ 19/Mar/12 ]

We don't yet know how to make the slab consumption to grow. The instances have either been rare or under-reported, but we're definitely not seeing this on all 1.8 clients mounting 2.1 servers. All instances have been on 1.8 clients running user jobs that were presumably doing I/O to lustre. So the problem may be associated with some specific usage pattern that we haven't yet pinpointed. I was not expecting the dd/ls/rm operation to alleviate the problem. I believe when we first found clients in this state we had to unmount lustre to free the slab memory, but we may not have tried just doing I/O. We have one more node in this state available for debugging. Let me know if you'd like any specific test run with it.

Comment by Mahmoud Hanafi [ 01/Jun/12 ]

Looks like we are hitting this same bug. Our 1.8.6 clients connecting to 2.1.1 servers are seen their slab grow. Unmounted lustre free up the slab.

Comment by Zhenyu Xu [ 04/Jun/12 ]

Mahmoud,

Is it easy to reproduce the issue? Would you mind collecting some affected client and MDS log? We'd like to know the client ops pattern which causes the memory pressure, from previous analysis, clients got huge amount of pending requests which seems not be commited on MDS (or MDS failed to pass the commited transction number back to the client).

Comment by Andreas Dilger [ 13/Jul/12 ]

In such a situation (large slab usage on client) it should be possible to enable "rpctrace,info" debugging on the client, wait 20s, send a "touch", and grab the debug log afterward. This should be enough time to get some RPCs, and then check in ptlrpc_free_committed() how many RPCs are being kept, what the last_committed value is, etc. It would be useful to know if the RPCs are MDS or OSS related, and what the opcode (if any) is not being cleaned up. The most complex RPC handling is for opens, since they need to be saved until after close in order to reopen the file if the MDS crashes.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:13:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.