[LU-1151] 1.8 client using excessive slab when mounting 2.1 server Created: 29/Feb/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre 1.8.5.0-6chaos |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9846 | ||||||||
| Description |
|
We have lustre 1.8 clients that are mounting 2.1 servers. These clients have slab usage that is growing to excessive amounts, for instance 20GB in slab on a 24GB node. On a node with 20GB of slab, about 13GB was accounted for in /proc/sys/lustre/memused. Upon umount'ing the 2.1 filesytem, nearly all of the 20GB slab was freed very quickly. The vast majority of the slab usage is in the generic slabs, mostly 16k and 8k. I am attaching the log "sierra1150_lustre.log.txt.bz2", which was gathered by enabling full debugging and running something like this: lctl debug_daemon enable; umount /p/lscratchd; lctl debug_daemon disable The log isn't quite as large as I would have liked, so I probably need to let the daemon run longer...or maybe try to use a larger in-memory buffer, if that is possible with slab usage as it is... Looking at the rest of the cluster after clearing some of the worst slab problems, we still have slab usage that varies pretty widely from 1GB to 9GB across the rest of the nodes, and I suspect that it only increases with time. This slab does not free itself with normal memory pressure, and is causing applications to OOM, so this is a pretty big problem for our production systems at the moment. I don't really need a fix for 1.8, but if the problem is common with the 2.1 code, we'll definitely need a 2.1 fix. Our 1.8 is getting old, so perhaps there is a fix that I missed. But I didn't turn up anything obvious when I did a few searches. |
| Comments |
| Comment by Christopher Morrone [ 29/Feb/12 ] |
|
I forgot to set the "Priority" higher. It should be at least "Major", but I don't have the power to change it after initial submission time. |
| Comment by Peter Jones [ 29/Feb/12 ] |
|
Don't worry Chris - I fixed the priority on your behalf! Bobi, could you please look into this one? Thanks Peter |
| Comment by Andreas Dilger [ 29/Feb/12 ] |
|
At first glance this looks like saved RPC open requests (1882 of them, one example below, earlier parts in the log don't look interesting), so possibly there is a problem with deciding when the file is closed and sending the RPC to the MDS? Each RPC (intent open + close) takes about 17kB, so 1880 * 17kB = 32MB, which isn't enough to account for 20GB of memory. It is possible to find the amount of memory that Lustre has allocated directly via: $ lctl get_param memused* memused=971035 memused=23744355 memused_max=24133830 The first number is the LNet usage, while the second is Lustre usage, and the memused_max is the maximum allocation by Lustre. 00000100:00100000:5:1330476725.738402:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen req@ffff810609702800 x1385125338817546/t203791993381 o101->lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301 00000100:00000001:5:1330476725.738408:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0) 00000002:00000010:5:1330476725.738409:0:22985:0:(mdc_request.c:611:mdc_commit_open()) kfreed 'mod': 24 at ffff810538807520. 00000100:00000001:5:1330476725.738410:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered 00000100:00000040:5:1330476725.738411:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff810609702800 x1385125338817546/t203791993381 o101->lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301 00000100:00000001:5:1330476725.738414:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered 00000100:00000010:5:1330476725.738414:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 5128 at ffff81042d2ac000. 00000100:00000040:5:1330476725.738416:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1882 00000020:00000001:5:1330476725.738417:0:22985:0:(genops.c:816:class_import_put()) Process entered 00000020:00000040:5:1330476725.738417:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1883 obd=lsd-MDT0000-mdc-ffff81033774b800 00000020:00000001:5:1330476725.738418:0:22985:0:(genops.c:835:class_import_put()) Process leaving 00000100:00000010:5:1330476725.738419:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 1024 at ffff810609702400. 00000100:00000010:5:1330476725.738420:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed 'request': 584 at ffff810609702800. 00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving 00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1) 00000100:00100000:5:1330476725.738422:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen req@ffff8103c2d05000 x1385125338817547/t203791993385 o35->lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0 00000100:00000001:5:1330476725.738425:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0) 00000100:00000001:5:1330476725.738426:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered 00000100:00000040:5:1330476725.738426:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff8103c2d05000 x1385125338817547/t203791993385 o35->lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0 00000100:00000001:5:1330476725.738429:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered 00000100:00000010:5:1330476725.738430:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 9864 at ffff8103f4fa0000. 00000100:00000040:5:1330476725.738433:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1881 00000020:00000001:5:1330476725.738433:0:22985:0:(genops.c:816:class_import_put()) Process entered 00000020:00000040:5:1330476725.738434:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1882 obd=lsd-MDT0000-mdc-ffff81033774b800 00000020:00000001:5:1330476725.738435:0:22985:0:(genops.c:835:class_import_put()) Process leaving 00000100:00000010:5:1330476725.738435:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 360 at ffff8104914b5200. 00000100:00000010:5:1330476725.738436:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed 'request': 584 at ffff8103c2d05000. 00000100:00000001:5:1330476725.738437:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving 00000100:00000001:5:1330476725.738438:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1) |
| Comment by Christopher Morrone [ 29/Feb/12 ] |
|
Yes, I've been looking at the memused. I think my confusion was just that Lustre is tracking REQUESTED memory, where as slabinfo reports total slab memory, including the waste. In other words, if Lustre allocates 9K, it will come from the 16K slab and waste 7K. Lustre's meminfo doesn't record the wasted 7K. Not that there is really anything wrong with that; I just failed to account for it. So I think that explains the discrepancy of slabinfo showing more usage that the combined lnet/lustre memused fields report. Do you have enough information on this for now? Is there something else that I can get you? There are other production fires to attend to, but this bug is my top priority at the moment. Also be aware that there is all manner of lustre problems happening now or recently. One of the miriade problems is causing clients to reconnect to server pretty frequently, and there are problems with both 1.8 and 2.1 servers that can prevent that from being successful (with 2.1, its the shrinker bug. I don't know whats wrong with the 1.8 server yet, but it might be something else). So it may be that some of the 1.8 memory problems are due to exercising various cleanup and recovery paths that are out of the ordinary. But that is vague speculation. I just don't want you guys to think that things are otherwise quiet and slab usage is just growing in a vacuum. |
| Comment by Zhenyu Xu [ 01/Mar/12 ] |
|
00000100:02000000:9:1328810057.695332:0:20270:0:(import.c:1346:ptlrpc_import_recovery_state_machine()) lsd-MDT0000-mdc-ffff81033774b800: Connection restored to lsd-MDT0000 (at 172.16.64.141@tcp) |
| Comment by Christopher Morrone [ 01/Mar/12 ] |
|
Ah, great, the ~1.5 million references certainly point the way to the memory usage. We have other nodes with lots of memory stuck in slab right now, so if you have a suggestion for getting more information about those references just let me know. |
| Comment by Christopher Morrone [ 01/Mar/12 ] |
|
The plans are still fluid, but it is starting to look like our users are not comfortable with a rapid upgrade to CHAOS5 and Lustre 2.1 on our compute clusters. It may be that the roll out is slow enough that I will need to patch and upgrade 1.8 on our compute clusters. In the mean time, we may start doing umount/mount at the end of jobs as a possible temporary work around. I fear the new bugs we'll hit by doing that constantly, but since things are so bad it is worth a try. |
| Comment by Zhenyu Xu [ 01/Mar/12 ] |
|
Would it possible to check out the humongous referenced obd_import structure contents? I want to checkout its __u64 imp_last_replay_transno; values to get to know whether huge mount of replay requests are linked onto it or not. |
| Comment by Christopher Morrone [ 02/Mar/12 ] |
|
I'll look into it when I get another node I can work on. There are two candidates, but I need to wait until the user's job finishes before I go walking memory. |
| Comment by Ned Bass [ 08/Mar/12 ] |
|
From first candidate node: crash> struct obd_import.imp_refcount ffff81063c7f5000
imp_refcount = {
counter = 859444
},
crash> struct obd_import.imp_last_replay_transno ffff81063c7f5000
imp_last_replay_transno = 204432392643,
crash> struct obd_import.imp_peer_committed_transno ffff81063c7f5000
imp_peer_committed_transno = 204430868099,
and from the other one: crash> struct obd_import.imp_refcount ffff81063d1d6800
imp_refcount = {
counter = 161494
},
crash> struct obd_import.imp_last_replay_transno ffff81063d1d6800
imp_last_replay_transno = 203717138487,
crash> struct obd_import.imp_peer_committed_transno ffff81063d1d6800
imp_peer_committed_transno = 203714086637,
Let us know if you'd like to see any other data. |
| Comment by Zhenyu Xu [ 08/Mar/12 ] |
|
those clients do have a lot of pending requests which are not committed on MDT yet, is the MDS busy? Whether any 2.1 client connecting to this 2.1 server suffer the same memory pressure? |
| Comment by Zhenyu Xu [ 09/Mar/12 ] |
|
Would you please upload MDS logs? I'd like to check why MDS server got so many requests unhandled. Thanks. |
| Comment by Ned Bass [ 09/Mar/12 ] |
No, the server is currently not particularly busy at the moment. Also please note that these clients have been sitting idle for a few days, so they should have had plenty of time to commit their pending requests.
I am not aware of any 2.1 clients having this issue.
Uploading momus-mds1-lustre.log.gz which only has default debug levels. It does have entries going back to Mar 1, and our clients hit the issue on Mar 2. |
| Comment by Zhenyu Xu [ 11/Mar/12 ] |
|
from mds log, I found MGS took at least (1331314772 - 1330619506)=695266 seconds ~= 8days to handle llog request. 20000000:01000000:7.0:1330619506.677853:0:11940:0:(mgs_handler.c:748:mgs_handle()) @@@ llog_init req@ffff88042b105850 x1385125352729595/t0(0) o-1->bec17d60-303b-9c69-d653-8a476440786a@NET_0x50005c0a8748c_UUID:0/0 lens 264/0 e 0 to 0 dl 1330619567 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1 ... 20000000:01000000:14.0:1331314772.631315:0:11563:0:(mgs_handler.c:759:mgs_handle()) @@@ llog read header req@ffff880428832050 x1395901311135885/t0(0) o-1->b8fd6e82-d2d8-a6fb-a5ca-80fffdf4a765@NET_0x50005c0a8700d_UUID:0/0 lens 240/0 e 0 to 0 dl 1331314833 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1 and tons of network error (-113:EHOSTUNREACH) connecting 172.16.66.28, what is the 172.16.66.28 node? |
| Comment by Ned Bass [ 12/Mar/12 ] |
That is an lnet router for one of smaller clusters (not the cluster involved in this issue). The uuid in those log entries resolves to node sierra1139, but this node does not appear to ever have been affected by the excessive slab issue. I'm not sure why it took so long to handle the llog request. |
| Comment by Zhenyu Xu [ 14/Mar/12 ] |
|
Would it possible to grab some MGS/MDS log of it servering the suffered clients? The key point here is to know what happens to not push forward server's commit transno, it could be server's problem of not committing client's request, or communication problem between them. |
| Comment by Ned Bass [ 14/Mar/12 ] |
Can you please clarify what you want me to capture? Do you want me to unmount the clients? What debug levels? |
| Comment by Zhenyu Xu [ 14/Mar/12 ] |
|
It would be great to capture -1 level debug log (lctl dk > [MDS log file]) of the MDS/MGS when it servers the suffered clients, i.e. don't umount the under mem pressure clients. I want to know the cause why MDS/MGS missed so many commit request from the suffered clients. Thank you. |
| Comment by Ned Bass [ 15/Mar/12 ] |
|
Uploading new MDS log. |
| Comment by Ned Bass [ 15/Mar/12 ] |
|
I enabled -1 debugging on the MDS then ran the following in Lustre on an affected client: dd if=/dev/zero of=foo bs=1M count=1 This activity seemed to flush the memory from the client slab. Before: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 1474592 1474375 99% 0.03K 13166 112 52664K size-32 1331044 1330817 99% 1.00K 332761 4 1331044K size-1024 1249384 1248004 99% 0.06K 21176 59 84704K size-64 467268 467268 100% 8.00K 467268 1 3738144K size-8192 438360 436835 99% 0.50K 54795 8 219180K size-512 429758 429758 100% 16.00K 429758 1 6876128K size-16384 After: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 1080688 1044603 96% 0.03K 9649 112 38596K size-32 105433 66449 63% 0.06K 1787 59 7148K size-64 60060 59420 98% 0.25K 4004 15 16016K size-256 51180 15743 30% 0.12K 1706 30 6824K size-128 42348 41751 98% 1.00K 10587 4 42348K size-1024 37549 37549 100% 8.00K 37549 1 300392K size-8192 |
| Comment by Zhenyu Xu [ 18/Mar/12 ] |
|
Havn't found abnormality in the MDS log. According to the previous comment, a normal dd operation alleviated a client's slab pressure, so is it the 1.8 clients starting to grow slab usage after they are mounted upon 2.1 servers without further operations on the clients? Or even after the alleviating dd ops, the slab comsumption on the client will still grow high as time pass by? |
| Comment by Ned Bass [ 19/Mar/12 ] |
|
We don't yet know how to make the slab consumption to grow. The instances have either been rare or under-reported, but we're definitely not seeing this on all 1.8 clients mounting 2.1 servers. All instances have been on 1.8 clients running user jobs that were presumably doing I/O to lustre. So the problem may be associated with some specific usage pattern that we haven't yet pinpointed. I was not expecting the dd/ls/rm operation to alleviate the problem. I believe when we first found clients in this state we had to unmount lustre to free the slab memory, but we may not have tried just doing I/O. We have one more node in this state available for debugging. Let me know if you'd like any specific test run with it. |
| Comment by Mahmoud Hanafi [ 01/Jun/12 ] |
|
Looks like we are hitting this same bug. Our 1.8.6 clients connecting to 2.1.1 servers are seen their slab grow. Unmounted lustre free up the slab. |
| Comment by Zhenyu Xu [ 04/Jun/12 ] |
|
Mahmoud, Is it easy to reproduce the issue? Would you mind collecting some affected client and MDS log? We'd like to know the client ops pattern which causes the memory pressure, from previous analysis, clients got huge amount of pending requests which seems not be commited on MDS (or MDS failed to pass the commited transction number back to the client). |
| Comment by Andreas Dilger [ 13/Jul/12 ] |
|
In such a situation (large slab usage on client) it should be possible to enable "rpctrace,info" debugging on the client, wait 20s, send a "touch", and grab the debug log afterward. This should be enough time to get some RPCs, and then check in ptlrpc_free_committed() how many RPCs are being kept, what the last_committed value is, etc. It would be useful to know if the RPCs are MDS or OSS related, and what the opcode (if any) is not being cleaned up. The most complex RPC handling is for opens, since they need to be saved until after close in order to reopen the file if the MDS crashes. |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |