<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:00:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6485] sanity-hsm test 500 memory leak</title>
                <link>https://jira.whamcloud.com/browse/LU-6485</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;TEI-3148 brought out a problem with sanity-hsm test 500 causing a memory leak. sanity-hsm test 500 calls the tests in llapi_hsm_test.c. In particular, test #2 of the llapi_hsm_test.c tests is causing or bringing out the memory leak. The interesting thing is that, for ZFS, the memory leak causes Lustre not to be shut down and the test will time out after an hour. For ldiskfs runs, this is not true, the tests shut down cleanly. In all cases, no error is detected and the test status is PASS. This ticket is to report the memory leak cause by llapi_hsm_test test #2.&lt;/p&gt;

&lt;p&gt;I ran a series of tests to pinpoint what test was causing the leak. For the last test, I hard coded that test 500 would fail in the hopes that dumping the logs would tell us something, but a quick look over the logs didn&#8217;t show me anything obviously wrong. The test patch is at &lt;a href=&quot;http://review.whamcloud.com/#/c/14337/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14337/&lt;/a&gt; and the latest results that this patch produced with only running sanity-hsm test 500 and subtest #2 is at  &lt;a href=&quot;https://testing.hpdd.intel.com/test_sessions/cff1ffc8-e91e-11e4-9e6e-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sessions/cff1ffc8-e91e-11e4-9e6e-5254006e85c2&lt;/a&gt;. From the end of the suite_stdout log, we see&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;17:09:46:Stopping /mnt/ost2 (opts:-f) on shadow-25vm3
17:09:46:CMD: shadow-25vm3 umount -d -f /mnt/ost2
17:09:57:CMD: shadow-25vm3 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
17:09:57:CMD: shadow-25vm3 ! zpool list -H lustre-ost2 &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 ||
17:09:57:			grep -q ^lustre-ost2/ /proc/mounts ||
17:09:57:			zpool export  lustre-ost2
17:09:57:0022
17:09:57:CMD: shadow-25vm2.shadow.whamcloud.com lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
17:09:57:LustreError: 10994:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 3236953, leaked: 48
17:09:57:
17:09:57:Memory leaks detected
18:10:18:********** Timeout by autotest system **********
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For a review-dne-part-2 master failure, results are at &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/b89ca232-e8d4-11e4-9e6e-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/b89ca232-e8d4-11e4-9e6e-5254006e85c2&lt;/a&gt; . From the suite_stdout_log:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;07:17:53:Stopping /mnt/ost8 (opts:-f) on shadow-48vm4
07:17:53:CMD: shadow-48vm4 umount -d -f /mnt/ost8
07:17:53:CMD: shadow-48vm4 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:17:53:0022
07:17:53:CMD: shadow-48vm6.shadow.whamcloud.com lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:18:05:LustreError: 17219:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 54130094, leaked: 48
07:18:05:
07:18:05:Memory leaks detected
07:18:05:sanity-hsm returned 0
07:18:05:running: ost-pools 
07:18:05:run_suite ost-pools /usr/lib64/lustre/tests/ost-pools.sh
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In a b2_7 sanity-hsm run, the memory leak is reported twice at &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/0fca08aa-ca31-11e4-b9cd-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/0fca08aa-ca31-11e4-b9cd-5254006e85c2&lt;/a&gt; in the site_stdout logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;07:29:53:CMD: shadow-5vm11 umount -d -f /mnt/ost2
07:30:04:CMD: shadow-5vm11 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:04:CMD: shadow-5vm11 ! zpool list -H lustre-ost2 &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 ||
07:30:04:			grep -q ^lustre-ost2/ /proc/mounts ||
07:30:04:			zpool export  lustre-ost2
07:30:04:0022
07:30:04:CMD: shadow-5vm1.shadow.whamcloud.com lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:16:LustreError: 32001:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 23480928, leaked: 48
07:30:16:
07:30:16:Memory leaks detected
07:30:16:sanity-hsm returned 0
07:30:16:Stopping clients: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com /mnt/lustre (opts:)
07:30:16:CMD: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com running=\$(grep -c /mnt/lustre&apos; &apos; /proc/mounts);
07:30:16:if [ \$running -ne 0 ] ; then
07:30:16:echo Stopping client \$(hostname) /mnt/lustre opts:;
07:30:16:lsof /mnt/lustre || need_kill=no;
07:30:16:if [ x != x -a x\$need_kill != xno ]; then
07:30:16:    pids=\$(lsof -t /mnt/lustre | sort -u);
07:30:16:    if [ -n \&quot;\$pids\&quot; ]; then
07:30:16:             kill -9 \$pids;
07:30:16:    fi
07:30:16:fi;
07:30:16:while umount  /mnt/lustre 2&amp;gt;&amp;amp;1 | grep -q busy; do
07:30:16:    echo /mnt/lustre is still busy, wait one second &amp;amp;&amp;amp; sleep 1;
07:30:16:done;
07:30:16:fi
07:30:16:Stopping clients: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com /mnt/lustre2 (opts:)
07:30:16:CMD: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com running=\$(grep -c /mnt/lustre2&apos; &apos; /proc/mounts);
07:30:16:if [ \$running -ne 0 ] ; then
07:30:16:echo Stopping client \$(hostname) /mnt/lustre2 opts:;
07:30:16:lsof /mnt/lustre2 || need_kill=no;
07:30:16:if [ x != x -a x\$need_kill != xno ]; then
07:30:16:    pids=\$(lsof -t /mnt/lustre2 | sort -u);
07:30:16:    if [ -n \&quot;\$pids\&quot; ]; then
07:30:16:             kill -9 \$pids;
07:30:16:    fi
07:30:16:fi;
07:30:16:while umount  /mnt/lustre2 2&amp;gt;&amp;amp;1 | grep -q busy; do
07:30:16:    echo /mnt/lustre2 is still busy, wait one second &amp;amp;&amp;amp; sleep 1;
07:30:16:done;
07:30:16:fi
07:30:16:CMD: shadow-5vm12 grep -c /mnt/mds1&apos; &apos; /proc/mounts
07:30:16:CMD: shadow-5vm12 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:16:CMD: shadow-5vm12 ! zpool list -H lustre-mdt1 &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 ||
07:30:16:			grep -q ^lustre-mdt1/ /proc/mounts ||
07:30:16:			zpool export  lustre-mdt1
07:30:16:CMD: shadow-5vm11 grep -c /mnt/ost1&apos; &apos; /proc/mounts
07:30:16:CMD: shadow-5vm11 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:16:CMD: shadow-5vm11 ! zpool list -H lustre-ost1 &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 ||
07:30:16:			grep -q ^lustre-ost1/ /proc/mounts ||
07:30:16:			zpool export  lustre-ost1
07:30:16:CMD: shadow-5vm11 grep -c /mnt/ost2&apos; &apos; /proc/mounts
07:30:16:CMD: shadow-5vm11 lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:16:CMD: shadow-5vm11 ! zpool list -H lustre-ost2 &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 ||
07:30:16:			grep -q ^lustre-ost2/ /proc/mounts ||
07:30:16:			zpool export  lustre-ost2
07:30:16:0022
07:30:16:error: dl: No such file or directory opening /proc/fs/lustre/devices
07:30:16:opening /dev/obd failed: No such device
07:30:16:hint: the kernel modules may not be loaded
07:30:16:Error getting device list: No such device: check dmesg.
07:30:16:CMD: shadow-5vm1.shadow.whamcloud.com lsmod | grep lnet &amp;gt; /dev/null &amp;amp;&amp;amp; lctl dl | grep &apos; ST &apos;
07:30:16:LustreError: 32001:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 23480928, leaked: 48
07:30:16:
07:30:16:mv: cannot stat `/tmp/debug&apos;: No such file or directory
07:30:16:Memory leaks detected
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This problem looks to be from the day test 500 was added to sanity-hsm. The first run of test 500 in Maloo with logs is on 2014-11-27 is at &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/d1cdb39a-7610-11e4-ad19-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/d1cdb39a-7610-11e4-ad19-5254006e85c2&lt;/a&gt;. This run exhibits the memory leak problem.&lt;/p&gt;</description>
                <environment></environment>
        <key id="29616">LU-6485</key>
            <summary>sanity-hsm test 500 memory leak</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                            <label>hsm</label>
                    </labels>
                <created>Wed, 22 Apr 2015 21:46:47 +0000</created>
                <updated>Tue, 10 May 2016 06:24:56 +0000</updated>
                            <resolved>Sat, 19 Sep 2015 05:40:49 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="113172" author="jhammond" created="Wed, 22 Apr 2015 22:33:06 +0000"  >&lt;p&gt;James, this is probably a struct kkuc_ct_data (it has the right size and smell). In mdc_precleanup() we call libcfs_kkuc_group_rem(0, KUC_GRP_HSM, NULL) which will leak any registered kcds. We should try to align this with what we do in lmv_hsm_ct_unregister(). But I&apos;m not sure this would be completely correct.&lt;/p&gt;</comment>
                            <comment id="113177" author="adilger" created="Wed, 22 Apr 2015 22:57:10 +0000"  >&lt;p&gt;I was just going to comment that debugging memory leaks in Lustre is fairly straight forward (at least until the upstream kernel forces us to rip out all the OBD_ALLOC*() macros).  Install the libcfs module and turn on allocation tracing and increase the debug buffer size &lt;tt&gt;lctl set_param debug=+malloc debug_mb=500&lt;/tt&gt; (or use debug_daemon if the test is long and generates a lot of logs).  Then, run the problematic test and clean up, and dump the kernel debug log before removing the libcfs module (where kernel debug logs are handled).  This is done automatically by the test framework by &lt;tt&gt;lustre_rmmod&lt;/tt&gt; to dump the &lt;tt&gt;$TMP/debug&lt;/tt&gt; file just before &lt;tt&gt;libcfs&lt;/tt&gt; is removed.&lt;/p&gt;

&lt;p&gt;Once you have a debug log, then run &lt;tt&gt;perl lustre/tests/leak_finder.pl &amp;lt;debug_log&amp;gt;&lt;/tt&gt; and grep for &lt;tt&gt;Leak:&lt;/tt&gt; messages.  If you don&apos;t do a full mount/unmount around your test you may get some spurious lines from leak_finder.pl, which can be excluded based on the allocation size.&lt;/p&gt;

&lt;p&gt;If there isn&apos;t already a wiki page that describes this process it might be a good idea to create one?&lt;/p&gt;</comment>
                            <comment id="113328" author="pjones" created="Fri, 24 Apr 2015 17:54:09 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please look into this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="113734" author="hongchao.zhang" created="Wed, 29 Apr 2015 11:49:18 +0000"  >&lt;p&gt;status update: the cause of memory leak is clear, and the patch is under way&lt;/p&gt;</comment>
                            <comment id="113753" author="gerrit" created="Wed, 29 Apr 2015 14:17:21 +0000"  >&lt;p&gt;Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/14638&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14638&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6485&quot; title=&quot;sanity-hsm test 500 memory leak&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6485&quot;&gt;&lt;del&gt;LU-6485&lt;/del&gt;&lt;/a&gt; libcfs: distinguish kernelcomm by serial&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 790de2ae7690fe5ae6c71813d5160713242c907c&lt;/p&gt;</comment>
                            <comment id="127879" author="gerrit" created="Sat, 19 Sep 2015 03:20:19 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/14638/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14638/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6485&quot; title=&quot;sanity-hsm test 500 memory leak&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6485&quot;&gt;&lt;del&gt;LU-6485&lt;/del&gt;&lt;/a&gt; libcfs: embed kr_data into kkuc_reg&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7f67aa42f9123caef3cee714f1e2cee3c6848892&lt;/p&gt;</comment>
                            <comment id="127906" author="pjones" created="Sat, 19 Sep 2015 05:40:49 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                            <comment id="151601" author="hongchao.zhang" created="Tue, 10 May 2016 06:24:56 +0000"  >&lt;p&gt;the patch against b2_7_fe is tracked at &lt;a href=&quot;http://review.whamcloud.com/#/c/20083/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/20083/&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxbd3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>