<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:13:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1151] 1.8 client using excessive slab when mounting 2.1 server</title>
                <link>https://jira.whamcloud.com/browse/LU-1151</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have lustre 1.8 clients that are mounting 2.1 servers.  These clients have slab usage that is growing to excessive amounts, for instance 20GB in slab on a 24GB node.  On a node with 20GB of slab, about 13GB was accounted for in /proc/sys/lustre/memused.&lt;/p&gt;

&lt;p&gt;Upon umount&apos;ing the 2.1 filesytem, nearly all of the 20GB slab was freed very quickly.&lt;/p&gt;

&lt;p&gt;The vast majority of the slab usage is in the generic slabs, mostly 16k and 8k.&lt;/p&gt;

&lt;p&gt;I am attaching the log &quot;sierra1150_lustre.log.txt.bz2&quot;, which was gathered by enabling full debugging and running something like this:&lt;/p&gt;

&lt;p&gt;  lctl debug_daemon enable; umount /p/lscratchd; lctl debug_daemon disable&lt;/p&gt;

&lt;p&gt;The log isn&apos;t quite as large as I would have liked, so I probably need to let the daemon run longer...or maybe try to use a larger in-memory buffer, if that is possible with slab usage as it is...&lt;/p&gt;

&lt;p&gt;Looking at the rest of the cluster after clearing some of the worst slab problems, we still have slab usage that varies pretty widely from 1GB to 9GB across the rest of the nodes, and I suspect that it only increases with time.&lt;/p&gt;

&lt;p&gt;This slab does not free itself with normal memory pressure, and is causing applications to OOM, so this is a pretty big problem for our production systems at the moment.&lt;/p&gt;

&lt;p&gt;I don&apos;t really need a fix for 1.8, but if the problem is common with the 2.1 code, we&apos;ll definitely need a 2.1 fix.&lt;/p&gt;

&lt;p&gt;Our 1.8 is getting old, so perhaps there is a fix that I missed.  But I didn&apos;t turn up anything obvious when I did a few searches.&lt;/p&gt;</description>
                <environment>lustre 1.8.5.0-6chaos</environment>
        <key id="13374">LU-1151</key>
            <summary>1.8 client using excessive slab when mounting 2.1 server</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Wed, 29 Feb 2012 13:20:40 +0000</created>
                <updated>Mon, 29 May 2017 03:20:40 +0000</updated>
                            <resolved>Mon, 29 May 2017 03:20:40 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="29975" author="morrone" created="Wed, 29 Feb 2012 13:23:19 +0000"  >&lt;p&gt;I forgot to set the &quot;Priority&quot; higher.  It should be at least &quot;Major&quot;, but I don&apos;t have the power to change it after initial submission time.&lt;/p&gt;</comment>
                            <comment id="29977" author="pjones" created="Wed, 29 Feb 2012 13:48:31 +0000"  >&lt;p&gt;Don&apos;t worry Chris - I fixed the priority on your behalf!&lt;/p&gt;

&lt;p&gt;Bobi,&lt;/p&gt;

&lt;p&gt;could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="30007" author="adilger" created="Wed, 29 Feb 2012 17:18:03 +0000"  >&lt;p&gt;At first glance this looks like saved RPC open requests (1882 of them, one example below, earlier parts in the log don&apos;t look interesting), so possibly there is a problem with deciding when the file is closed and sending the RPC to the MDS?  Each RPC (intent open + close) takes about 17kB, so 1880 * 17kB = 32MB, which isn&apos;t enough to account for 20GB of memory.&lt;/p&gt;

&lt;p&gt;It is possible to find the amount of memory that Lustre has allocated directly via:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ lctl get_param memused*
memused=971035
memused=23744355

memused_max=24133830
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first number is the LNet usage, while the second is Lustre usage, and the memused_max is the maximum allocation by Lustre.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:5:1330476725.738402:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen  req@ffff810609702800 x1385125338817546/t203791993381 o101-&amp;gt;lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301
00000100:00000001:5:1330476725.738408:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0)
00000002:00000010:5:1330476725.738409:0:22985:0:(mdc_request.c:611:mdc_commit_open()) kfreed &apos;mod&apos;: 24 at ffff810538807520.
00000100:00000001:5:1330476725.738410:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered
00000100:00000040:5:1330476725.738411:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff810609702800 x1385125338817546/t203791993381 o101-&amp;gt;lsd-MDT0000_UUID@172.16.64.141@tcp:12/10 lens 1024/5128 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/301
00000100:00000001:5:1330476725.738414:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered
00000100:00000010:5:1330476725.738414:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed &apos;request-&amp;gt;rq_repbuf&apos;: 5128 at ffff81042d2ac000.
00000100:00000040:5:1330476725.738416:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1882
00000020:00000001:5:1330476725.738417:0:22985:0:(genops.c:816:class_import_put()) Process entered
00000020:00000040:5:1330476725.738417:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1883 obd=lsd-MDT0000-mdc-ffff81033774b800
00000020:00000001:5:1330476725.738418:0:22985:0:(genops.c:835:class_import_put()) Process leaving
00000100:00000010:5:1330476725.738419:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed &apos;request-&amp;gt;rq_reqmsg&apos;: 1024 at ffff810609702400.
00000100:00000010:5:1330476725.738420:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed &apos;request&apos;: 584 at ffff810609702800.
00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving
00000100:00000001:5:1330476725.738421:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
00000100:00100000:5:1330476725.738422:0:22985:0:(client.c:1946:ptlrpc_free_committed()) @@@ free request with old gen  req@ffff8103c2d05000 x1385125338817547/t203791993385 o35-&amp;gt;lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0
00000100:00000001:5:1330476725.738425:0:22985:0:(client.c:1947:ptlrpc_free_committed()) Process leaving via free_req (rc=0 : 0 : 0)
00000100:00000001:5:1330476725.738426:0:22985:0:(client.c:1813:__ptlrpc_req_finished()) Process entered
00000100:00000040:5:1330476725.738426:0:22985:0:(client.c:1825:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff8103c2d05000 x1385125338817547/t203791993385 o35-&amp;gt;lsd-MDT0000_UUID@172.16.64.141@tcp:23/10 lens 360/9864 e 0 to 1 dl 1330303590 ref 1 fl Interpret:R/4/0 rc 0/0
00000100:00000001:5:1330476725.738429:0:22985:0:(client.c:1746:__ptlrpc_free_req()) Process entered
00000100:00000010:5:1330476725.738430:0:22985:0:(client.c:1777:__ptlrpc_free_req()) kfreed &apos;request-&amp;gt;rq_repbuf&apos;: 9864 at ffff8103f4fa0000.
00000100:00000040:5:1330476725.738433:0:22985:0:(client.c:1782:__ptlrpc_free_req()) PUTting export ffff81018e306200 : new refcount 1881
00000020:00000001:5:1330476725.738433:0:22985:0:(genops.c:816:class_import_put()) Process entered
00000020:00000040:5:1330476725.738434:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1882 obd=lsd-MDT0000-mdc-ffff81033774b800
00000020:00000001:5:1330476725.738435:0:22985:0:(genops.c:835:class_import_put()) Process leaving
00000100:00000010:5:1330476725.738435:0:22985:0:(client.c:1796:__ptlrpc_free_req()) kfreed &apos;request-&amp;gt;rq_reqmsg&apos;: 360 at ffff8104914b5200.
00000100:00000010:5:1330476725.738436:0:22985:0:(client.c:1799:__ptlrpc_free_req()) kfreed &apos;request&apos;: 584 at ffff8103c2d05000.
00000100:00000001:5:1330476725.738437:0:22985:0:(client.c:1801:__ptlrpc_free_req()) Process leaving
00000100:00000001:5:1330476725.738438:0:22985:0:(client.c:1829:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="30069" author="morrone" created="Wed, 29 Feb 2012 19:56:18 +0000"  >&lt;p&gt;Yes, I&apos;ve been looking at the memused.  I think my confusion was just that Lustre is tracking REQUESTED memory, where as slabinfo reports total slab memory, including the waste.  In other words, if Lustre allocates 9K, it will come from the 16K slab and waste 7K.  Lustre&apos;s meminfo doesn&apos;t record the wasted 7K.  Not that there is really anything wrong with that; I just failed to account for it.&lt;/p&gt;

&lt;p&gt;So I think that explains the discrepancy of slabinfo showing more usage that the combined lnet/lustre memused fields report.&lt;/p&gt;

&lt;p&gt;Do you have enough information on this for now?  Is there something else that I can get you?  There are other production fires to attend to, but this bug is my top priority at the moment.&lt;/p&gt;

&lt;p&gt;Also be aware that there is all manner of lustre problems happening now or recently.  One of the miriade problems is causing clients to reconnect to server pretty frequently, and there are problems with both 1.8 and 2.1 servers that can prevent that from being successful (with 2.1, its the shrinker bug.  I don&apos;t know whats wrong with the 1.8 server yet, but it might be something else).  So it may be that some of the 1.8 memory problems are due to exercising various cleanup and recovery paths that are out of the ordinary.&lt;/p&gt;

&lt;p&gt;But that is vague speculation.  I just don&apos;t want you guys to think that things are otherwise quiet and slab usage is just growing in a vacuum.&lt;/p&gt;</comment>
                            <comment id="30090" author="bobijam" created="Thu, 1 Mar 2012 09:02:51 +0000"  >&lt;div class=&quot;panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;panelContent&quot;&gt;
&lt;p&gt;00000100:02000000:9:1328810057.695332:0:20270:0:(import.c:1346:ptlrpc_import_recovery_state_machine()) lsd-MDT0000-mdc-ffff81033774b800: Connection restored to lsd-MDT0000 (at 172.16.64.141@tcp)&lt;br/&gt;
...&lt;br/&gt;
&lt;font color=&quot;red&quot;&gt;before umount, logs showed client just try to recover MDS connections, and lots of expired requests happened&lt;/font&gt;&lt;br/&gt;
...&lt;br/&gt;
00000100:00000100:9:1328837650.839813:0:20270:0:(client.c:1484:ptlrpc_expire_one_request()) @@@ timeout (sent at 1328837593, 57s ago)  req@ffff81040e05cc00 x1385125273128097/t0 o38-&amp;gt;lsd-MDT0000_UUID@172.16.64.141@tcp:12/10&lt;br/&gt;
...&lt;br/&gt;
00000020:00000040:1:1330476656.479243:0:20271:0:(genops.c:809:class_import_get()) import ffff8101bfae1800 refcount=1456110 obd=lsd-MDT0000-mdc-ffff81033774b800&lt;br/&gt;
&lt;font color=&quot;blue&quot;&gt;The first time import ffff8101bfae1800 showed in the log&lt;/font&gt;&lt;br/&gt;
...&lt;br/&gt;
00000020:00000040:0:1330476681.565948:0:20269:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1456109 obd=lsd-MDT0000-mdc-ffff81033774b800&lt;br/&gt;
...&lt;br/&gt;
&lt;font color=&quot;blue&quot;&gt;no logs about this import ffff8101bfae1800&lt;/font&gt;&lt;br/&gt;
00000020:00000040:5:1330476725.738417:0:22985:0:(genops.c:824:class_import_put()) import ffff8101bfae1800 refcount=1883 obd=lsd-MDT0000-mdc-ffff81033774b800                            &lt;br/&gt;
&lt;font color=&quot;blue&quot;&gt; =======&amp;gt; 44 seconds later, the import&apos;s refcount dramatically dropped to 1883 &lt;/font&gt;&lt;/p&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="30167" author="morrone" created="Thu, 1 Mar 2012 14:32:16 +0000"  >&lt;p&gt;Ah, great, the ~1.5 million references certainly point the way to the memory usage.&lt;/p&gt;

&lt;p&gt;We have other nodes with lots of memory stuck in slab right now, so if you have a suggestion for getting more information about those references just let me know.&lt;/p&gt;</comment>
                            <comment id="30250" author="morrone" created="Thu, 1 Mar 2012 18:34:01 +0000"  >&lt;p&gt;The plans are still fluid, but it is starting to look like our users are not comfortable with a rapid upgrade to CHAOS5 and Lustre 2.1 on our compute clusters.  It may be that the roll out is slow enough that I will need to patch and upgrade 1.8 on our compute clusters.&lt;/p&gt;

&lt;p&gt;In the mean time, we may start doing umount/mount at the end of jobs as a possible temporary work around.  I fear the new bugs we&apos;ll hit by doing that constantly, but since things are so bad it is worth a try.&lt;/p&gt;</comment>
                            <comment id="30269" author="bobijam" created="Thu, 1 Mar 2012 23:47:13 +0000"  >&lt;p&gt;Would it possible to check out the humongous referenced obd_import structure contents? I want to checkout its&lt;/p&gt;


&lt;p&gt;        __u64                     imp_last_replay_transno;&lt;br/&gt;
        __u64                     imp_peer_committed_transno;&lt;/p&gt;

&lt;p&gt;values to get to know whether huge mount of replay requests are linked onto it or not.&lt;/p&gt;</comment>
                            <comment id="30360" author="morrone" created="Fri, 2 Mar 2012 17:46:10 +0000"  >&lt;p&gt;I&apos;ll look into it when I get another node I can work on.  There are two candidates, but I need to wait until the user&apos;s job finishes before I go walking memory.&lt;/p&gt;</comment>
                            <comment id="30735" author="nedbass" created="Thu, 8 Mar 2012 18:15:53 +0000"  >&lt;p&gt;From first candidate node:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct obd_import.imp_refcount ffff81063c7f5000
  imp_refcount = {
    counter = 859444
  }, 
crash&amp;gt; struct obd_import.imp_last_replay_transno ffff81063c7f5000
  imp_last_replay_transno = 204432392643, 
crash&amp;gt; struct obd_import.imp_peer_committed_transno ffff81063c7f5000
  imp_peer_committed_transno = 204430868099, 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and from the other one:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct obd_import.imp_refcount ffff81063d1d6800
  imp_refcount = {
    counter = 161494
  },
crash&amp;gt; struct obd_import.imp_last_replay_transno ffff81063d1d6800
  imp_last_replay_transno = 203717138487, 
crash&amp;gt; struct obd_import.imp_peer_committed_transno ffff81063d1d6800
  imp_peer_committed_transno = 203714086637, 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;Let us know if you&apos;d like to see any other data.&lt;/p&gt;</comment>
                            <comment id="30744" author="bobijam" created="Thu, 8 Mar 2012 22:49:51 +0000"  >&lt;p&gt;those clients do have a lot of pending requests which are not committed on MDT yet, is the MDS busy?&lt;/p&gt;

&lt;p&gt;Whether any 2.1 client connecting to this 2.1 server suffer the same memory pressure?&lt;/p&gt;</comment>
                            <comment id="30754" author="bobijam" created="Fri, 9 Mar 2012 04:00:45 +0000"  >&lt;p&gt;Would you please upload MDS logs? I&apos;d like to check why MDS server got so many requests unhandled. Thanks.&lt;/p&gt;</comment>
                            <comment id="30766" author="nedbass" created="Fri, 9 Mar 2012 12:51:30 +0000"  >&lt;blockquote&gt;&lt;p&gt;those clients do have a lot of pending requests which are not committed on MDT yet, is the MDS busy?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;No, the server is currently not particularly busy at the moment.  Also please note that these clients have been sitting idle for a few days, so they should have had plenty of time to commit their pending requests.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Whether any 2.1 client connecting to this 2.1 server suffer the same memory pressure?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I am not aware of any 2.1 clients having this issue.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Would you please upload MDS logs? I&apos;d like to check why MDS server got so many requests unhandled. Thanks.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Uploading momus-mds1-lustre.log.gz which only has default debug levels.  It does have entries going back to Mar 1, and our clients hit the issue on Mar 2.&lt;/p&gt;</comment>
                            <comment id="30792" author="bobijam" created="Sun, 11 Mar 2012 00:26:31 +0000"  >&lt;p&gt;from mds log, I found MGS took at least (1331314772 - 1330619506)=695266 seconds ~= 8days to handle llog request.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
20000000:01000000:7.0:1330619506.677853:0:11940:0:(mgs_handler.c:748:mgs_handle()) @@@ llog_init  req@ffff88042b105850 x1385125352729595/t0(0) o-1-&amp;gt;bec17d60-303b-9c69-d653-8a476440786a@NET_0x50005c0a8748c_UUID:0/0 lens 264/0 e 0 to 0 dl 1330619567 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1
...

20000000:01000000:14.0:1331314772.631315:0:11563:0:(mgs_handler.c:759:mgs_handle()) @@@ llog read header  req@ffff880428832050 x1395901311135885/t0(0) o-1-&amp;gt;b8fd6e82-d2d8-a6fb-a5ca-80fffdf4a765@NET_0x50005c0a8700d_UUID:0/0 lens 240/0 e 0 to 0 dl 1331314833 ref 1 fl Interpret:/ffffffff/ffffffff rc 0/-1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and tons of network error (-113:EHOSTUNREACH) connecting 172.16.66.28, what is the 172.16.66.28 node?&lt;/p&gt;</comment>
                            <comment id="30842" author="nedbass" created="Mon, 12 Mar 2012 13:10:35 +0000"  >&lt;blockquote&gt;&lt;p&gt;and tons of network error (-113:EHOSTUNREACH) connecting 172.16.66.28, what is the 172.16.66.28 node?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;That is an lnet router for one of smaller clusters (not the cluster involved in this issue).&lt;/p&gt;

&lt;p&gt;The uuid in those log entries resolves to node sierra1139, but this node does not appear to ever have been affected by the excessive slab issue.  I&apos;m not sure why it took so long to handle the llog request.&lt;/p&gt;</comment>
                            <comment id="31072" author="bobijam" created="Wed, 14 Mar 2012 00:24:54 +0000"  >&lt;p&gt;Would it possible to grab some MGS/MDS log of it servering the suffered clients? The key point here is to know what happens to not push forward server&apos;s commit transno, it could be server&apos;s problem of not committing client&apos;s request, or communication problem between them.&lt;/p&gt;</comment>
                            <comment id="31182" author="nedbass" created="Wed, 14 Mar 2012 14:36:27 +0000"  >&lt;blockquote&gt;&lt;p&gt;Would it possible to grab some MGS/MDS log of it servering the suffered clients? &lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Can you please clarify what you want me to capture?  Do you want me to unmount the clients?  What debug levels?&lt;/p&gt;</comment>
                            <comment id="31255" author="bobijam" created="Wed, 14 Mar 2012 22:21:15 +0000"  >&lt;p&gt;It would be great to capture -1 level debug log (lctl dk &amp;gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;MDS log file&amp;#93;&lt;/span&gt;) of the MDS/MGS when it servers the suffered clients, i.e. don&apos;t umount the under mem pressure clients. I want to know the cause why MDS/MGS missed so many commit request from the suffered clients. Thank you.&lt;/p&gt;
</comment>
                            <comment id="31286" author="nedbass" created="Thu, 15 Mar 2012 12:47:02 +0000"  >&lt;p&gt;Uploading new MDS log.&lt;/p&gt;</comment>
                            <comment id="31287" author="nedbass" created="Thu, 15 Mar 2012 12:49:29 +0000"  >&lt;p&gt;I enabled -1 debugging on the MDS then ran the following in Lustre on an affected client:&lt;/p&gt;

&lt;p&gt;dd if=/dev/zero of=foo bs=1M count=1&lt;br/&gt;
ls -al foo&lt;br/&gt;
rm foo&lt;/p&gt;


&lt;p&gt;This activity seemed to flush the memory from the client slab.&lt;/p&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1474592 1474375  99%    0.03K  13166      112     52664K size-32
1331044 1330817  99%    1.00K 332761        4   1331044K size-1024
1249384 1248004  99%    0.06K  21176       59     84704K size-64
467268 467268 100%    8.00K 467268        1   3738144K size-8192
438360 436835  99%    0.50K  54795        8    219180K size-512
429758 429758 100%   16.00K 429758        1   6876128K size-16384
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1080688 1044603  96%    0.03K   9649      112     38596K size-32
105433  66449  63%    0.06K   1787       59      7148K size-64
 60060  59420  98%    0.25K   4004       15     16016K size-256
 51180  15743  30%    0.12K   1706       30      6824K size-128
 42348  41751  98%    1.00K  10587        4     42348K size-1024
 37549  37549 100%    8.00K  37549        1    300392K size-8192
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="31468" author="bobijam" created="Sun, 18 Mar 2012 21:37:32 +0000"  >&lt;p&gt;Havn&apos;t found abnormality in the MDS log. According to the previous comment, a normal dd operation alleviated a client&apos;s slab pressure, so is it the 1.8 clients starting to grow slab usage after they are mounted upon 2.1 servers without further operations on the clients? Or even after the alleviating dd ops, the slab comsumption on the client will still grow high as time pass by? &lt;/p&gt;
</comment>
                            <comment id="31500" author="nedbass" created="Mon, 19 Mar 2012 14:22:38 +0000"  >&lt;p&gt;We don&apos;t yet know how to make the slab consumption to grow.  The instances have either been rare or under-reported, but we&apos;re definitely not seeing this on all 1.8 clients mounting 2.1 servers.  All instances have been on 1.8 clients running user jobs that were presumably doing I/O to lustre.  So the problem may be associated with some specific usage pattern that we haven&apos;t yet pinpointed.  I was not expecting the dd/ls/rm operation to alleviate the problem.  I believe when we first found clients in this state we had to unmount lustre to free the slab memory, but we may not have tried just doing I/O.  We have one more node in this state available for debugging.  Let me know if you&apos;d like any specific test run with it.&lt;/p&gt;</comment>
                            <comment id="39804" author="mhanafi" created="Fri, 1 Jun 2012 11:13:56 +0000"  >&lt;p&gt;Looks like we are hitting this same bug. Our 1.8.6 clients connecting to 2.1.1 servers are seen their slab grow. Unmounted lustre free up the slab. &lt;/p&gt;
</comment>
                            <comment id="39890" author="bobijam" created="Mon, 4 Jun 2012 01:59:13 +0000"  >&lt;p&gt;Mahmoud,&lt;/p&gt;

&lt;p&gt;Is it easy to reproduce the issue? Would you mind collecting some affected client and MDS log? We&apos;d like to know the client ops pattern which causes the memory pressure, from previous analysis, clients got huge amount of pending requests which seems not be commited on MDS (or MDS failed to pass the commited transction number back to the client).&lt;/p&gt;</comment>
                            <comment id="41823" author="adilger" created="Fri, 13 Jul 2012 15:10:04 +0000"  >&lt;p&gt;In such a situation (large slab usage on client) it should be possible to enable &quot;rpctrace,info&quot; debugging on the client, wait 20s, send a &quot;touch&quot;, and grab the debug log afterward.  This should be enough time to get some RPCs, and then check in ptlrpc_free_committed() how many RPCs are being kept, what the last_committed value is, etc.  It would be useful to know if the RPCs are MDS or OSS related, and what the opcode (if any) is not being cleaned up.  The most complex RPC handling is for opens, since they need to be saved until after close in order to reopen the file if the MDS crashes.&lt;/p&gt;</comment>
                            <comment id="197363" author="adilger" created="Mon, 29 May 2017 03:20:40 +0000"  >&lt;p&gt;Close old ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="13373">LU-1150</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="10942" name="momus-mds1-lustre.log.gz" size="6226945" author="nedbass" created="Fri, 9 Mar 2012 12:52:53 +0000"/>
                            <attachment id="10955" name="momus-mds1.lustre.log.1331829143.gz" size="8826778" author="nedbass" created="Thu, 15 Mar 2012 12:47:02 +0000"/>
                            <attachment id="10908" name="sierra1150_lustre.log.txt.bz2" size="2914949" author="morrone" created="Wed, 29 Feb 2012 13:20:40 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 27 Jun 2014 13:20:40 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvytr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9846</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 29 Feb 2012 13:20:40 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>