<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:39:21 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4065] sanity-hsm test_300 failure: &apos;cdt state is not stopped&apos; </title>
                <link>https://jira.whamcloud.com/browse/LU-4065</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;The test results are at: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/8e9cca2c-2c8b-11e3-85ee-52540035b04c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the client test_log: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;== sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 14:22:47 (1380835367)
Stop coordinator and remove coordinator state at mount
mdt.scratch-MDT0000.hsm_control=shutdown
Changed after 0s: from &apos;&apos; to &apos;stopping&apos;
Waiting 10 secs for update
Updated after 8s: wanted &apos;stopped&apos; got &apos;stopped&apos;
Failing mds1 on mds
Stopping /lustre/scratch/mdt0 (opts:) on mds
pdsh@c15: mds: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to mds
14:23:15 (1380835395) waiting for mds network 900 secs ...
14:23:15 (1380835395) network interface is UP
mount facets: mds1
Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
Started scratch-MDT0000
c15: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 25 sec
Changed after 0s: from &apos;&apos; to &apos;enabled&apos;
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 20s: wanted &apos;stopped&apos; got &apos;enabled&apos;
 sanity-hsm test_300: @@@@@@ FAIL: cdt state is not stopped 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:298:cdt_check_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:3063:test_300()
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</description>
                <environment>Luster master build # 1715&lt;br/&gt;
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, three clients; one agent + client, one with robinhood/db running + client and one just running as Lustre clients </environment>
        <key id="21269">LU-4065</key>
            <summary>sanity-hsm test_300 failure: &apos;cdt state is not stopped&apos; </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                            <label>HSM</label>
                    </labels>
                <created>Fri, 4 Oct 2013 16:30:14 +0000</created>
                <updated>Mon, 10 Apr 2017 15:26:54 +0000</updated>
                            <resolved>Fri, 2 Oct 2015 13:00:38 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                    <fixVersion>Lustre 2.5.2</fixVersion>
                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="68466" author="jcl" created="Sun, 6 Oct 2013 19:48:03 +0000"  >&lt;p&gt;We recently have seen lctl set_param -P was not working as expected &lt;/p&gt;</comment>
                            <comment id="68490" author="bfaccini" created="Mon, 7 Oct 2013 14:07:24 +0000"  >&lt;p&gt;Hello JC,&lt;/p&gt;

&lt;p&gt;Do you mean cdt_state permanent/on-disk change may have failed ??&lt;/p&gt;</comment>
                            <comment id="68510" author="jay" created="Mon, 7 Oct 2013 17:19:41 +0000"  >&lt;p&gt;Hi JC, if I remember it correctly, we had a conversation about this to make procfs exported all the time so that we can set the parameter from configuration. Can you refresh my memory on this?&lt;/p&gt;</comment>
                            <comment id="68632" author="jcl" created="Tue, 8 Oct 2013 19:52:43 +0000"  >&lt;p&gt;During recent &quot;by hand&quot; tests, we have seen  lctl set_param -P was not working like for test 300. But gerrit is usually ok. We also have seen with latest master that even sanity-hsm was no more starting when running manually at home. Thomas or Aurelien should open a ticket with the details.&lt;/p&gt;</comment>
                            <comment id="68635" author="jhammond" created="Tue, 8 Oct 2013 20:03:53 +0000"  >&lt;p&gt;JC, when you test locally do you install Lustre on all nodes or do you run from a build directory? Using &apos;lctl set_param -P&apos; requires that lctl be installed at /usr/sbin/lctl on all nodes. See &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4041&quot; title=&quot;lctl upcall path is hard coded to /usr/sbin/lctl in process_param2_config()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4041&quot;&gt;LU-4041&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="68654" author="adegremont" created="Wed, 9 Oct 2013 07:58:47 +0000"  >&lt;p&gt;I confirm. We are doing this in our test environment where sanity-hsm is run without Lustre utils being installed in standard paths. This is really &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4041&quot; title=&quot;lctl upcall path is hard coded to /usr/sbin/lctl in process_param2_config()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4041&quot;&gt;LU-4041&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="68684" author="jhammond" created="Wed, 9 Oct 2013 17:51:49 +0000"  >&lt;p&gt;Using set_param ...hsm_control=shutdown on a CDT that is already stopped will create a state where the CDT cannot be started:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# llmount.sh
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopped
# lctl set_param mdt.lustre-MDT0000.hsm_control=shutdown 
mdt.lustre-MDT0000.hsm_control=shutdown
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopping
# lctl set_param mdt.lustre-MDT0000.hsm_control=enabled
mdt.lustre-MDT0000.hsm_control=enabled
error: set_param: setting /proc/fs/lustre/mdt/lustre-MDT0000/hsm_control=enabled: Operation already in progress
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=stopping
# lctl set_param mdt.lustre-MDT0000.hsm_control=disabled
mdt.lustre-MDT0000.hsm_control=disabled
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=disabled
# lctl set_param mdt.lustre-MDT0000.hsm_control=enabled
mdt.lustre-MDT0000.hsm_control=enabled
# lctl get_param mdt.lustre-MDT0000.hsm_control
mdt.lustre-MDT0000.hsm_control=enabled
# pgrep hsm
#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Amusingly if you try to remount the MDT to resolve this state then you get the following.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Message from syslogd@t at Oct  9 13:00:45 ...
 kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) ASSERTION( cdt-&amp;gt;cdt_state == CDT_STOPPED ) failed: 

Message from syslogd@t at Oct  9 13:00:45 ...
 kernel:LustreError: 7626:0:(mdt_coordinator.c:417:hsm_cdt_procfs_fini()) LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="68685" author="jhammond" created="Wed, 9 Oct 2013 17:57:37 +0000"  >&lt;p&gt;I think we need some more protective logic in lprocfs_wr_hsm_cdt_control() and its callees. Shouldn&apos;t every write to hsm_control force a call to mdt_hsm_cdt_wakeup()?&lt;/p&gt;</comment>
                            <comment id="69870" author="bfaccini" created="Fri, 25 Oct 2013 09:51:26 +0000"  >&lt;p&gt;Assigned to me since I am working on a fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4093&quot; title=&quot;sanity-hsm test_24d: wanted &amp;#39;SUCCEED&amp;#39; got &amp;#39;WAITING&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4093&quot;&gt;&lt;del&gt;LU-4093&lt;/del&gt;&lt;/a&gt; where I tend to use cdt_shutdown/cdt_restart upon each copytool_cleanup in sanity-hsm test-suite, thus I am strongly concerned with accurate cdt_state actually and for sure I already experienced the saeme LBUG than John !!&lt;/p&gt;</comment>
                            <comment id="69908" author="bfaccini" created="Fri, 25 Oct 2013 14:28:31 +0000"  >&lt;p&gt;Concerning the cdt_state that can become easily stale, I think I have a patch for it, the problem seems to be that both the shutdown/disabled cmds assume CDT is started and thus simply set cdt_state=STOPPING/DISABLED (expecting for CDT thread to check it and gracefully shutdown or go to sleep! ) and then if umount/stop MDT you simply trigger the ASSERTION( cdt-&amp;gt;cdt_state == CDT_STOPPED ) in hsm_cdt_procfs_fini(). Better/enforced checks in lprocfs_wr_hsm_cdt_control() should fix this.&lt;/p&gt;

&lt;p&gt;Even if I don&apos;t think it is fully safe (against concurrent hsm_control/hsm_state updates &#8230;) my patch at &lt;a href=&quot;http://review.whamcloud.com/8074&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/8074&lt;/a&gt; should prevent such cdt_state inaccurate values.&lt;/p&gt;

</comment>
                            <comment id="71956" author="bfaccini" created="Wed, 20 Nov 2013 14:38:45 +0000"  >&lt;p&gt;Patch has land, but I wonder if we need to add extra check/test in sanity-hsm to verify that even with multiple/random/replicated hsm_control setting (like John&apos;s serie to trigger LBUG), cdt_state is kept accurate ??&lt;/p&gt;</comment>
                            <comment id="72874" author="jamesanunez" created="Thu, 5 Dec 2013 04:03:36 +0000"  >&lt;p&gt;sanity-hsm test 300 is still failing on the OpenSFS cluster with master build #1790. I&apos;ve uploaded results from two runs of sanity-hsm where test 300 fails at &lt;a href=&quot;https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sessions/b223c634-5d51-11e3-ad71-52540035b04c&lt;/a&gt; and &lt;a href=&quot;https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sessions/643460bc-5d4e-11e3-956b-52540035b04c&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Test 300 consistently fails on this cluster.&lt;/p&gt;</comment>
                            <comment id="73107" author="bfaccini" created="Mon, 9 Dec 2013 18:06:32 +0000"  >&lt;p&gt;In the 2 tests sessions you just uploaded, sanity-hsm/test_300 seems to fail because the permanent/on-disk config llog setting (&quot;lctl set_param -d -P mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control&quot; in sanity-hsm/cdt_clear_mount_state() function), is mysteriously not re-read after umount/re-mount.&lt;br/&gt;
I st thought it could be the consequence of the follwing msgs found in MDS dmesg :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 12:30:45 (1386189045)
Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params  &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;--- config log set
Lustre: Failing over scratch-MDT0000
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1 previous similar message
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Lustre: 9522:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386189056/real 1386189056]  req@ffff8803cc092000 x1453343831874984/t0(0) o251-&amp;gt;MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386189062 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount scratch-MDT0000 complete
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: 
format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn&apos;t end in newline
LustreError: 9685:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2
Lustre: MGS: non-config logname received: params  &amp;lt;&amp;lt;&amp;lt;&amp;lt;----- PB ??
Lustre: Skipped 6 previous similar messages
LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: scratch-MDD0000: changelog on
Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 7 clients reconnect
LustreError: 9700:0:(ldlm_lib.c:1733:check_for_next_transno()) scratch-MDT0000: waking for gap in transno, VBR is OFF (skip: 77309411692, ql: 1, comp: 6, conn: 7, next: 77309411694, last_committed: 77309411691)
Lustre: scratch-MDT0000: Recovery over after 0:21, of 7 clients 7 recovered and 0 were evicted.
Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not &apos;stopped&apos; on mds1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But in fact I have the same on my test platform where I am unable to reproduce any sanity-hsm/test_300 sub-test failure actually &#8230;&lt;/p&gt;

&lt;p&gt;James, I am not sure to remember, but did you not say that test_300 can even fail when run as the single sub-test of sanity-hsm ? If yes, can you run the following sequence manually with full debug-logs enabled on MDS/MGS :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl set_param debug=-1
lctl clear
lctl mark shutdown
lctl set_param mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control=shutdown
lctl mark set_param-d-P
lctl set_param -d -P mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control
lctl mark get_param
lctl get_param mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control
lctl mark umount
umount &amp;lt;MDT&amp;gt;
lctl mark mount
mount &amp;lt;MDT&amp;gt;
sleep 10
lctl mark get_param
lctl get_param mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="73162" author="jamesanunez" created="Tue, 10 Dec 2013 00:47:03 +0000"  >&lt;p&gt;Yes, I can get test_300 to fail running just that test. &lt;/p&gt;

&lt;p&gt;I ran sanity-hsm test 300 and it failed and then I ran the commands you asked for with the following output:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c08 ~]# lctl set_param debug=-1
debug=-1
[root@c08 ~]# lctl clear
[root@c08 ~]# lctl mark shutdown
[root@c08 ~]# lctl set_param mdt.scratch-MDT0000.hsm_control=shutdown
mdt.scratch-MDT0000.hsm_control=shutdown
[root@c08 ~]# lctl mark set_param-d-P
[root@c08 ~]# lctl set_param -d -P mdt.scratch-MDT0000.hsm_control
error: executing set_param: Success
[root@c08 ~]# lctl mark get_param
 [root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control
mdt.scratch-MDT0000.hsm_control=stopped
[root@c08 ~]# umount /lustre/scratch/mdt0
[root@c08 ~]# lctl mark mount
[root@c08 ~]# mount -t lustre /dev/sda3 /lustre/scratch/mdt0
[root@c08 ~]# sleep 10
[root@c08 ~]# lctl mark get_param
[root@c08 ~]# lctl get_param mdt.scratch-MDT0000.hsm_control
mdt.scratch-MDT0000.hsm_control=enabled
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;dmesg on the MDS/MGS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: shutdown
Lustre: MGS: non-config logname received: params
Lustre: Skipped 8 previous similar messages
Lustre: DEBUG MARKER: set_param-d-P
Lustre: Disabling parameter general.mdt.scratch-MDT0000.hsm_control in log params
LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1
Lustre: DEBUG MARKER: get_param
Lustre: Failing over scratch-MDT0000
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.113@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 1 previous similar message
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.112@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: 137-5: scratch-MDT0000_UUID: not available for connect from 192.168.2.115@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 2 previous similar messages
Lustre: 7239:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1386634800/real 1386634800]  req@ffff880409bd9800 x1453343832768544/t0(0) o251-&amp;gt;MGC192.168.2.108@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1386634806 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: server umount scratch-MDT0000 complete
Lustre: DEBUG MARKER: mount
Lustre: 7244:0:(obd_mount.c:1246:lustre_fill_super()) VFS Op: sb ffff8803e594b400
LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts: 
format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn&apos;t end in newline
LustreError: 7308:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) scratch-MDT0000: cannot take the layout locks needed for registered restore: -2
LustreError: 11-0: scratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: scratch-MDD0000: changelog on
Lustre: scratch-MDT0000: Will be in recovery for at least 5:00, or until 4 clients reconnect
Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7284:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7286:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x1:0x0]: returning size 4096
Lustre: 7324:0:(mdt_handler.c:449:mdt_pack_attr2body()) [0x200000002:0x2:0x0]: returning size 4096
Lustre: MGS: non-config logname received: params
Lustre: Skipped 12 previous similar messages
Lustre: scratch-MDT0000: Recovery over after 0:15, of 4 clients 4 recovered and 0 were evicted.
Lustre: DEBUG MARKER: get_param
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="73185" author="bfaccini" created="Tue, 10 Dec 2013 11:01:08 +0000"  >&lt;p&gt;Well, seems that I missed to ask you to collect the MDS Lustre debug-log at the end !! Sorry &#8230;&lt;br/&gt;
Can you re-run the same sequence of commands and add an &quot;lctl dk /tmp/lustre-debug.log&quot; at the end ? Then I would like you to attach both the corresponding lustre-debug.log file and dmesg output.&lt;br/&gt;
Thanks again for your help.&lt;/p&gt;

&lt;p&gt;BTW, the new+strange &quot;error: executing set_param: Success&quot; and &quot;LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1&quot; msgs upon &quot;lctl set_param -d -P mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control&quot;, after I checked in the source code they only/finally indicate that the parameter&apos;s value was already the default/disabled. I may push a cosmetic change to avoid this.&lt;/p&gt;</comment>
                            <comment id="73197" author="jamesanunez" created="Tue, 10 Dec 2013 15:39:10 +0000"  >&lt;p&gt;Here are the logs you requested.&lt;/p&gt;</comment>
                            <comment id="73277" author="bfaccini" created="Wed, 11 Dec 2013 12:43:18 +0000"  >&lt;p&gt;James, I am afraid I forgot to ask you to also grow the debug log buffer size with &quot;lctl set_param debug_mb=2048&quot;, because the debug-log misses the traces for the first commands.&lt;/p&gt;

&lt;p&gt;Can you also &quot;mark&quot; the umount, I forgot too.&lt;/p&gt;

&lt;p&gt;I also would like you to dump the config llog content after the &quot;lctl set_param -d -P mdt.scratch-MDT0000.hsm_control&quot;, to do so you need to mount the MDT as &quot;-t ldiskfs&quot; separately and the go/cd to CONFIGS sub-dir to run the command &quot;llog_reader params&quot;. &lt;/p&gt;

&lt;p&gt;Last, I created a separate ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4374&quot; title=&quot;lctl set_param -d -P mdt.&amp;lt;FSname&amp;gt;-MDT0000.hsm_control, returns a false error when default already set&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4374&quot;&gt;&lt;del&gt;LU-4374&lt;/del&gt;&lt;/a&gt; for the false error/msgs &quot;error: executing set_param: Success&quot; and &quot;LustreError: 7226:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1&quot; upon &quot;lctl set_param -d -P mdt.&amp;lt;FSNAME&amp;gt;-MDT0000.hsm_control&quot; and when default is already set.&lt;/p&gt;</comment>
                            <comment id="73359" author="bfaccini" created="Thu, 12 Dec 2013 14:28:47 +0000"  >&lt;p&gt;I requested access to OpenSFS cluster, waiting for it now and in order to work on the issue on the platform where you reproduct!&lt;/p&gt;</comment>
                            <comment id="79036" author="jamesanunez" created="Tue, 11 Mar 2014 20:25:41 +0000"  >&lt;p&gt;Another instance of this failure at &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/7f2aafdc-a959-11e3-95fe-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/7f2aafdc-a959-11e3-95fe-52540035b04c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="79098" author="bfaccini" created="Wed, 12 Mar 2014 08:43:36 +0000"  >&lt;p&gt;Humm, I thought (and hoped!) that this problem have disappeared because it is the first failure occurrence of sanity-hsm/test_300 subtest since mid-december, according to maloo/auto-tests stats.&lt;/p&gt;

&lt;p&gt;I will try to find a way to reproduce it, but still I still strongly suspect it to occur with specific environment/platform, it may not be so easy.&lt;/p&gt;

&lt;p&gt;BTW James, the new instance you pointed indicates details that I am not familiar with : Test group = &quot;acc-sm-c20&quot;, Arch/Lustre version = &quot;2.5.1-RC3--PRISTINE-2.6.32-431.5.1.el6&lt;span class=&quot;error&quot;&gt;&amp;#91;_lustre&amp;#93;&lt;/span&gt;.x86_64&quot;. Can you better describe me the configuration/platform/... you were using ? Is it again on OpenSFS cluster?&lt;/p&gt;

&lt;p&gt;Also, the test log shows the following error :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn&apos;t end in newline
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which is due to a missing new line at the end of the msg passed to CERROR, with the additional consequence of 2 merged lines/traces in Lustre debug-log.&lt;br/&gt;
This is still present in master and I pushed a patch to fix it at &lt;a href=&quot;http://review.whamcloud.com/9597&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9597&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="79114" author="jamesanunez" created="Wed, 12 Mar 2014 13:16:38 +0000"  >&lt;p&gt;Bruno, &lt;/p&gt;

&lt;p&gt;Yes, this was run on the OpenSFS platform with RHEL 6.5 Lustre 2.5.1 RC3, single MGS/MDS, single OSS with two OSTs, one node running the latest Robinhood, one agent/client and one client. I don&apos;t know why the test group is &quot;acc-sm-c20&quot;; I was just running sanity-hsm on its own with no other tests.&lt;/p&gt;</comment>
                            <comment id="79243" author="jamesanunez" created="Thu, 13 Mar 2014 15:27:48 +0000"  >&lt;p&gt;Since I&apos;m testing HSM on 2.5.1, I thought I would collect the information Bruno asked for on 09/Oct/13. I ran sanity-hsm test 300 alone and got the &quot;cdt state not stopped on mds1&quot; On the MDT, the first few commands succeeded, but the unmount hung and later crashed. &lt;/p&gt;

&lt;p&gt;Here&apos;s what I see in dmesg captured by crash:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: == sanity-hsm test 300: On disk coordinator state kept between MDT umount/mount == 09:30:44 (1394641844)
&amp;lt;6&amp;gt;Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
&amp;lt;4&amp;gt;Lustre: Failing over lscratch-MDT0000
&amp;lt;3&amp;gt;LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
&amp;lt;4&amp;gt;Lustre: 16111:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641848/real 1394641848]  req@ffff88039ae00800 x1462381936924596/t0(0) o251-&amp;gt;MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641854 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
&amp;lt;4&amp;gt;Lustre: server umount lscratch-MDT0000 complete
&amp;lt;6&amp;gt;LDISKFS-fs (sda3): mounted filesystem with ordered data mode. quota=on. Opts:
 
&amp;lt;3&amp;gt;LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.119@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
&amp;lt;4&amp;gt;Lustre: lscratch-MDT0000: used disk, loading
&amp;lt;6&amp;gt;format at mdt_coordinator.c:977:mdt_hsm_cdt_start doesn&apos;t end in newline
&amp;lt;3&amp;gt;LustreError: 16267:0:(mdt_coordinator.c:977:mdt_hsm_cdt_start()) lscratch-MDT0000: cannot take the layout locks needed for registered restore: -2
&amp;lt;4&amp;gt;Lustre: MGS: non-config logname received: params
&amp;lt;3&amp;gt;LustreError: 11-0: lscratch-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
&amp;lt;6&amp;gt;Lustre: lscratch-MDD0000: changelog on
&amp;lt;4&amp;gt;Lustre: lscratch-MDT0000: Will be in recovery for at least 5:00, or until 5 clients reconnect
&amp;lt;6&amp;gt;Lustre: lscratch-MDT0000: Recovery over after 0:29, of 5 clients 5 recovered and 0 were evicted.
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: mdc.lscratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 29 sec
&amp;lt;4&amp;gt;Lustre: MGS: non-config logname received: params
&amp;lt;4&amp;gt;Lustre: Skipped 4 previous similar messages
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: sanity-hsm test_300: @@@@@@ FAIL: hsm_control state is not &apos;stopped&apos; on mds1
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: == sanity-hsm test complete, duration 92 sec == 09:32:06 (1394641926)
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: shutdown
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: set_param-d-P
&amp;lt;6&amp;gt;Lustre: Disabling parameter general.mdt.lscratch-MDT0000.hsm_control in log params
&amp;lt;3&amp;gt;LustreError: 17173:0:(mgs_handler.c:744:mgs_iocontrol()) MGS: setparam err: rc = 1
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: get_param
&amp;lt;4&amp;gt;Lustre: DEBUG MARKER: umount
&amp;lt;4&amp;gt;Lustre: Failing over lscratch-MDT0000
&amp;lt;3&amp;gt;LustreError: 137-5: lscratch-MDT0000_UUID: not available for connect from 192.168.2.118@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
&amp;lt;4&amp;gt;------------[ cut here ]------------
&amp;lt;4&amp;gt;WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
&amp;lt;4&amp;gt;Hardware name: X8DTT-H
&amp;lt;4&amp;gt;list_del corruption. prev-&amp;gt;next should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
&amp;lt;4&amp;gt;Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
&amp;lt;4&amp;gt;Pid: 17120, comm: hsm_cdtr Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81071e27&amp;gt;] ? warn_slowpath_common+0x87/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81071f16&amp;gt;] ? warn_slowpath_fmt+0x46/0x50
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81084220&amp;gt;] ? process_timeout+0x0/0x10
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8129489e&amp;gt;] ? list_del+0x6e/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109b681&amp;gt;] ? remove_wait_queue+0x31/0x50
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8ef6e&amp;gt;] ? mdt_coordinator+0xbce/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff810096f0&amp;gt;] ? __switch_to+0xd0/0x320
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81065df0&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81528090&amp;gt;] ? thread_return+0x4e/0x76e
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8e3a0&amp;gt;] ? mdt_coordinator+0x0/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109aee6&amp;gt;] ? kthread+0x96/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c20a&amp;gt;] ? child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109ae50&amp;gt;] ? kthread+0x0/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;---[ end trace 4c18e762bb5c5a65 ]---
&amp;lt;4&amp;gt;------------[ cut here ]------------
&amp;lt;4&amp;gt;WARNING: at lib/list_debug.c:51 list_del+0x8d/0xa0() (Tainted: G        W  --
-------------   )
&amp;lt;4&amp;gt;Hardware name: X8DTT-H
&amp;lt;4&amp;gt;list_del corruption. next-&amp;gt;prev should be ffff88039810de18, but was 5a5a5a5a5a5a5a5a
&amp;lt;4&amp;gt;Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
&amp;lt;4&amp;gt;Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81071e27&amp;gt;] ? warn_slowpath_common+0x87/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81071f16&amp;gt;] ? warn_slowpath_fmt+0x46/0x50
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81084220&amp;gt;] ? process_timeout+0x0/0x10
&amp;lt;4&amp;gt; [&amp;lt;ffffffff812948bd&amp;gt;] ? list_del+0x8d/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109b681&amp;gt;] ? remove_wait_queue+0x31/0x50
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8ef6e&amp;gt;] ? mdt_coordinator+0xbce/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff810096f0&amp;gt;] ? __switch_to+0xd0/0x320
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81065df0&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81528090&amp;gt;] ? thread_return+0x4e/0x76e
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8e3a0&amp;gt;] ? mdt_coordinator+0x0/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109aee6&amp;gt;] ? kthread+0x96/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c20a&amp;gt;] ? child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109ae50&amp;gt;] ? kthread+0x0/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;---[ end trace 4c18e762bb5c5a66 ]---
&amp;lt;4&amp;gt;Lustre: 17182:0:(client.c:1901:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394641933/real 1394641933]  req@ffff880392bf6c00 x1462381936924904/t0(0) o251-&amp;gt;MGC192.168.2.116@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1394641939 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
&amp;lt;4&amp;gt;general protection fault: 0000 [#1] SMP 
&amp;lt;4&amp;gt;last sysfs file: /sys/devices/system/cpu/online
&amp;lt;4&amp;gt;CPU 7 
&amp;lt;4&amp;gt;Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables nfsd exportfs nfs lockd fscache auth_rpc
gss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode iTCO_wdt iTCO_vendor_support serio_raw i2c_i801 i2c_core sg lpc_ich mfd_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_pi
ix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack]
&amp;lt;4&amp;gt;
&amp;lt;4&amp;gt;Pid: 17120, comm: hsm_cdtr Tainted: G        W  ---------------    2.6.32-431.5.1.el6_lustre.x86_64 #1 Supermicro X8DTT-H/X8DTT-H
&amp;lt;4&amp;gt;RIP: 0010:[&amp;lt;ffffffff8128a039&amp;gt;]  [&amp;lt;ffffffff8128a039&amp;gt;] strnlen+0x9/0x40
&amp;lt;4&amp;gt;RSP: 0018:ffff88039810dae0  EFLAGS: 00010286
&amp;lt;4&amp;gt;RAX: ffffffff817b5a4e RBX: ffff8803a9176000 RCX: 0000000000000002
&amp;lt;4&amp;gt;RDX: 5a5a5a5a5a5a5a66 RSI: ffffffffffffffff RDI: 5a5a5a5a5a5a5a66
&amp;lt;4&amp;gt;RBP: ffff88039810dae0 R08: 0000000000000073 R09: ffff88043839cd40
&amp;lt;4&amp;gt;R10: 0000000000000001 R11: 000000000000000f R12: ffff8803a9175134
&amp;lt;4&amp;gt;R13: 5a5a5a5a5a5a5a66 R14: 00000000ffffffff R15: 0000000000000000
&amp;lt;4&amp;gt;FS:  0000000000000000(0000) GS:ffff880045ce0000(0000) knlGS:0000000000000000
&amp;lt;4&amp;gt;CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
&amp;lt;4&amp;gt;CR2: 0000000000448000 CR3: 000000082d703000 CR4: 00000000000007e0
&amp;lt;4&amp;gt;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
&amp;lt;4&amp;gt;DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
&amp;lt;4&amp;gt;Process hsm_cdtr (pid: 17120, threadinfo ffff88039810c000, task ffff88082d41a080)
&amp;lt;4&amp;gt;Stack:
&amp;lt;4&amp;gt; ffff88039810db20 ffffffff8128b2f0 ffff880000033b48 ffff8803a9175134
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; ffffffffa0ea175a ffffffffa0ea1758 ffff88039810dcb0 ffff8803a9176000
&amp;lt;4&amp;gt;&amp;lt;d&amp;gt; ffff88039810dbc0 ffffffff8128c738 0000000000000004 0000000affffffff
&amp;lt;4&amp;gt;Call Trace:
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8128b2f0&amp;gt;] string+0x40/0x100
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8128c738&amp;gt;] vsnprintf+0x218/0x5e0
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa052f27b&amp;gt;] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa053effa&amp;gt;] libcfs_debug_vmsg2+0x2ea/0xbc0 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81071e34&amp;gt;] ? warn_slowpath_common+0x94/0xc0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100bb8e&amp;gt;] ? apic_timer_interrupt+0xe/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa053f911&amp;gt;] libcfs_debug_msg+0x41/0x50 [libcfs]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81058d53&amp;gt;] ? __wake_up+0x53/0x70
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8f2af&amp;gt;] mdt_coordinator+0xf0f/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff810096f0&amp;gt;] ? __switch_to+0xd0/0x320
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81065df0&amp;gt;] ? default_wake_function+0x0/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff81528090&amp;gt;] ? thread_return+0x4e/0x76e
&amp;lt;4&amp;gt; [&amp;lt;ffffffffa0e8e3a0&amp;gt;] ? mdt_coordinator+0x0/0x16b0 [mdt]
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109aee6&amp;gt;] kthread+0x96/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c20a&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8109ae50&amp;gt;] ? kthread+0x0/0xa0
&amp;lt;4&amp;gt; [&amp;lt;ffffffff8100c200&amp;gt;] ? child_rip+0x0/0x20
&amp;lt;4&amp;gt;Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e &amp;lt;80&amp;gt; 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
&amp;lt;1&amp;gt;RIP  [&amp;lt;ffffffff8128a039&amp;gt;] strnlen+0x9/0x40
&amp;lt;4&amp;gt; RSP &amp;lt;ffff88039810dae0&amp;gt;

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="85140" author="jlevi" created="Thu, 29 May 2014 15:46:56 +0000"  >&lt;p&gt;Patch landed to Master. Please reopen ticket if more work is needed.&lt;/p&gt;</comment>
                            <comment id="93234" author="jamesanunez" created="Thu, 4 Sep 2014 20:38:49 +0000"  >&lt;p&gt;I&apos;ve run into this error again while testing HSM with the 2.5.3-RC1 version of Lustre on the OpenSFS cluster. The results are at &lt;a href=&quot;https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sessions/60308f12-3472-11e4-995a-5254006e85c2&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="98750" author="bfaccini" created="Mon, 10 Nov 2014 10:09:25 +0000"  >&lt;p&gt;James, according to this auto-test failure report it is unclear for me how this happened ...&lt;br/&gt;
BTW, I ran a Maloo/auto-tests search for sanity-hsm/test_300 failures and seems that it runs pretty good and the failure you reported was the last/only one since 2 months now :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Name            Status          Run at                  Duration        Return code     Error                                           Test log        Bugs
test_300        FAIL    2014-09-05 01:47:58 UTC         31              1               Restart of mds1 failed!                         Preview 10
test_300        FAIL    2014-09-02 13:14:59 UTC         80              1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10
test_300        FAIL    2014-08-27 01:19:32 UTC         107             1               post-failover df: 1                             Preview 10
test_300        FAIL    2014-08-21 19:55:39 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-07-01 22:43:31 UTC         599             1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10
test_300        FAIL    2014-07-01 10:47:59 UTC         107             1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10
test_300        FAIL    2014-06-21 14:32:23 UTC         723             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-06-02 17:16:48 UTC         331             1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10
test_300        FAIL    2014-05-28 05:46:52 UTC         714             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-27 09:56:15 UTC         975             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-22 19:10:52 UTC         105             1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10      LU-4065
test_300        FAIL    2014-05-08 18:55:20 UTC         716             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-05-07 22:34:02 UTC         746             1               import is not in FULL state                     Preview 10      LU-4018
test_300        FAIL    2014-03-26 22:02:07 UTC         77              1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10      LU-4125, LU-4065
test_300        FAIL    2014-03-11 20:01:49 UTC         79              1               hsm_control state is not &apos;stopped&apos; on mds1      Preview 10      LU-4065, LU-4065
test_300        FAIL    2013-12-08 19:39:12 UTC         755             1               import is not in FULL state                                     LU-4018, LU-4361
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="99562" author="sergey" created="Wed, 19 Nov 2014 15:31:52 +0000"  >&lt;p&gt;Hello, we faced the same error in seagate and now using following solution &lt;a href=&quot;http://review.whamcloud.com/#/c/12783/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/12783/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Short explanation:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;&quot;cdt_set_mount_state enabled&quot; sets params on server. It uses lctl set_param -P.&lt;/li&gt;
	&lt;li&gt;copytool_cleanup set hsm_control=shutdown and waiting when hsm_control becames &quot;stopped&quot;(after &quot;shutdown&quot;).&lt;/li&gt;
	&lt;li&gt;At this moment(copytool_cleanup waits for &quot;stopped&quot;) mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.&lt;/li&gt;
	&lt;li&gt;Mgc starts log processing with delay about 10 seconds (see mgc_requeue_thread).&lt;br/&gt;
copytool_cleanup waits 20 seconds and get hsm_control==enabled because this parameter was modified at step 3.&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="104316" author="bfaccini" created="Thu, 22 Jan 2015 11:25:10 +0000"  >&lt;p&gt;Sergei, thanks for this comment and patch, but I just wonder if you found the related issue/failure only when running sanity-hsm/test_300 sub-test or also with others usage of &quot;cdt_set_mount_state()&quot; within sanity-hsm ?? BTW, I don&apos;t understand why you refer to copytool_cleanup() in your description.&lt;/p&gt;

&lt;p&gt;Also, did you find the detailed behavior you have described by analyzing  MGS/MDS nodes debug-log? &lt;/p&gt;

&lt;p&gt;And last, your patch has failed to pass auto-tests for unrelated failures, and I think you need to rebase it to avoid such issues.&lt;/p&gt;

&lt;p&gt;Last, auto-tests/Maloo query indicates that issue for this ticket has only been encountered 1 time (on 2014-11-19 15:49:17 UTC) during the past 4 months. And I am presently analyzing this failure&apos;s associated MDS/MGS debug-log, and in parallel I am trying to reproduce in-house.&lt;/p&gt;
</comment>
                            <comment id="104690" author="sergey" created="Mon, 26 Jan 2015 14:30:04 +0000"  >&lt;p&gt;Yes we hit this bug in a lot of subtests in sanity-hsm: 402, 3, 106 ...&lt;br/&gt;
In our case this was race between copytool_cleanup and cdt_mount_state. So usually copytool_cleanup failed.&lt;/p&gt;

&lt;p&gt;About why test_300 may fail(my view):&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;cdt_set_mount_state sets param using -P&lt;/li&gt;
	&lt;li&gt;&quot;cdt_check_state stopped&quot;  waiting when hsm_control becames &quot;stopped&quot;(after after cdt_shutdown and cdt_clear_mount_state)&lt;/li&gt;
	&lt;li&gt;At this moment(&quot;cdt_check_state stopped&quot; waits for &quot;stopped&quot;) mgc retrieves and applys configuration from server with hsm_control=enabled that was set in step 1.&lt;/li&gt;
&lt;/ol&gt;


&lt;blockquote&gt;&lt;p&gt;Also, did you find the detailed behavior you have described by analyzing MGS/MDS nodes debug-log?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;yes&lt;/p&gt;

&lt;p&gt;About reproducing the problem. You may try to make custom build with MGC_TIMEOUT_MIN_SECONDS = 10 or 15. If it will not brake something else it may help.&lt;/p&gt;</comment>
                            <comment id="128659" author="sergey" created="Mon, 28 Sep 2015 18:29:40 +0000"  >&lt;p&gt;There is a +1 from Andreas in gerrit. Can somebody else do review to move forward ?&lt;/p&gt;</comment>
                            <comment id="129095" author="gerrit" created="Fri, 2 Oct 2015 04:14:39 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12783/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12783/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4065&quot; title=&quot;sanity-hsm test_300 failure: &amp;#39;cdt state is not stopped&amp;#39; &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4065&quot;&gt;&lt;del&gt;LU-4065&lt;/del&gt;&lt;/a&gt; tests: hsm copytool_cleanup improvement&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 73bca6c1f4923cdf673fa11486aec04ec3576051&lt;/p&gt;</comment>
                            <comment id="129120" author="jgmitter" created="Fri, 2 Oct 2015 13:00:38 +0000"  >&lt;p&gt;Landed for 2.8.0&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13911" name="lustre-debug.log.tgz" size="1282705" author="jamesanunez" created="Tue, 10 Dec 2013 15:39:10 +0000"/>
                            <attachment id="13910" name="lustre-dmesg.txt" size="2385" author="jamesanunez" created="Tue, 10 Dec 2013 15:39:10 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw4tr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10892</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>