<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:03:03 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6765] mds-survey triggers crash via BUG:sleeping function called from invalid context</title>
                <link>https://jira.whamcloud.com/browse/LU-6765</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Running mds-survey on a newly created file system triggers a crash and reboot.&lt;/p&gt;

&lt;p&gt;The MDS and OSS nodes are up, lustre is running.  Whether the filesystem is mounted on any clients has effect on the problem - it occurs either way.  Backend is ZFS.  mds-survey using all defaults; no environment variables set to control it.&lt;/p&gt;

&lt;p&gt;shell shows almost nothing leading up to the crash:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@zwicky-lcrash-mds1:2015-06-24.3]# mds-survey
Wed Jun 24 12:35:05 PDT 2015 /usr/bin/mds-survey from zwicky-lcrash-mds1
mdt 1 file  100000 dir    4 thr    4 create
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Console output is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: Echo OBD driver; http://www.lustre.org/                                                                          
LustreError: 68263:0:(echo_client.c:1676:echo_md_lookup()) lookup MDT0000-tests: rc = -2                                 
LustreError: 68263:0:(echo_client.c:1875:echo_md_destroy_internal()) Can&apos;t find child MDT0000-tests: rc = -2             
Lustre: ctl-lcrash-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt               
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1106                                           
in_atomic(): 0, irqs_disabled(): 1, pid: 68300, name: lctl                                                               
Pid: 68300, comm: lctl Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1                    
Call Trace:                                                                                                              
 [&amp;lt;ffffffff8105e6aa&amp;gt;] ? __might_sleep+0xda/0x100                                                                         
 [&amp;lt;ffffffff8104e05b&amp;gt;] ? __do_page_fault+0x10b/0x510                                                                      
 [&amp;lt;ffffffffa07c0683&amp;gt;] ? libcfs_debug_vmsg2+0x5e3/0xbe0 [libcfs]                                                          
 [&amp;lt;ffffffff8153421e&amp;gt;] ? do_page_fault+0x3e/0xa0                                                                          
 [&amp;lt;ffffffff815315d5&amp;gt;] ? page_fault+0x25/0x30
 [&amp;lt;ffffffff8105d0e2&amp;gt;] ? task_rq_lock+0x42/0xa0
 [&amp;lt;ffffffff81065a3c&amp;gt;] ? try_to_wake_up+0x3c/0x3e0
 [&amp;lt;ffffffffa12dd263&amp;gt;] ? echo_object_free+0x2b3/0x460 [obdecho]
 [&amp;lt;ffffffff81065e35&amp;gt;] ? wake_up_process+0x15/0x20
 [&amp;lt;ffffffff8152efb2&amp;gt;] ? __mutex_unlock_slowpath+0x42/0x60
 [&amp;lt;ffffffff8152ef2b&amp;gt;] ? mutex_unlock+0x1b/0x20
 [&amp;lt;ffffffffa0968051&amp;gt;] ? lu_site_purge+0x411/0x500 [obdclass]
 [&amp;lt;ffffffffa0968581&amp;gt;] ? lu_object_limit+0x71/0x80 [obdclass]
 [&amp;lt;ffffffffa09686c0&amp;gt;] ? lu_object_find_try+0x130/0x260 [obdclass]
 [&amp;lt;ffffffffa09688a1&amp;gt;] ? lu_object_find_at+0xb1/0xe0 [obdclass]
 [&amp;lt;ffffffffa07bd2b8&amp;gt;] ? libcfs_log_return+0x28/0x40 [libcfs]
 [&amp;lt;ffffffffa12292f1&amp;gt;] ? mdd_lookup+0x111/0x180 [mdd]
 [&amp;lt;ffffffffa12dea33&amp;gt;] ? echo_md_create_internal+0x153/0x640 [obdecho]
 [&amp;lt;ffffffffa12e8bb2&amp;gt;] ? echo_md_handler+0x1302/0x1860 [obdecho]
 [&amp;lt;ffffffffa12ea98c&amp;gt;] ? echo_client_iocontrol+0x187c/0x29e0 [obdecho]
 [&amp;lt;ffffffff8113ca91&amp;gt;] ? lru_cache_add_lru+0x21/0x40
 [&amp;lt;ffffffff8115b2fd&amp;gt;] ? page_add_new_anon_rmap+0x9d/0xf0
 [&amp;lt;ffffffff81176e8c&amp;gt;] ? __kmalloc+0x22c/0x240
 [&amp;lt;ffffffffa093131c&amp;gt;] ? class_handle_ioctl+0x165c/0x21e0 [obdclass]
 [&amp;lt;ffffffffa09182ab&amp;gt;] ? obd_class_ioctl+0x4b/0x190 [obdclass]
 [&amp;lt;ffffffff811a5882&amp;gt;] ? vfs_ioctl+0x22/0xa0
 [&amp;lt;ffffffff811a5ea4&amp;gt;] ? do_vfs_ioctl+0x84/0x5e0
 [&amp;lt;ffffffff811a6481&amp;gt;] ? sys_ioctl+0x81/0xa0
 [&amp;lt;ffffffff8100b0b2&amp;gt;] ? system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

</description>
                <environment>Lustre 2.7.54&lt;br/&gt;
SPL/ZFS 0.6.4.1-1&lt;br/&gt;
TOSS kernel 2.6.32-504.8.1.2chaos.ch5.3.x86_64</environment>
        <key id="30815">LU-6765</key>
            <summary>mds-survey triggers crash via BUG:sleeping function called from invalid context</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>patch</label>
                    </labels>
                <created>Wed, 24 Jun 2015 21:08:04 +0000</created>
                <updated>Tue, 22 Dec 2015 03:19:48 +0000</updated>
                            <resolved>Tue, 11 Aug 2015 13:52:30 +0000</resolved>
                                    <version>Lustre 2.7.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="119528" author="ofaaland" created="Wed, 24 Jun 2015 21:09:17 +0000"  >&lt;p&gt;Crash dump is available if there is information you want from it.&lt;/p&gt;</comment>
                            <comment id="119531" author="ofaaland" created="Wed, 24 Jun 2015 21:46:50 +0000"  >&lt;p&gt;Perhaps this is a dupe of &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-5747&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-5747&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="119534" author="ofaaland" created="Wed, 24 Jun 2015 21:53:43 +0000"  >&lt;p&gt;There was a second BUG entry in the console output:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2015-06-24 12:35:07 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
2015-06-24 12:35:07 IP: [&amp;lt;ffffffff8105d0e2&amp;gt;] task_rq_lock+0x42/0xa0
2015-06-24 12:35:07 PGD fd61ea067 PUD fd61eb067 PMD 0
2015-06-24 12:35:07 Oops: 0000 [#1] SMP
2015-06-24 12:35:07 last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.0/host0/port-0:0/expander-0:0/port-0:0:2/end_device-0:0:2/target0:0:2/0:0:2:0/state
2015-06-24 12:35:07 CPU 0
2015-06-24 12:35:07 Modules linked in: obdecho(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ptlrpc(U) obdclass(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_sa ib_mad ib_core ib_addr dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sg iTCO_wdt iTCO_vendor_support ses enclosure sd_mod crc_t10dif ipmi_devintf ipmi_si ipmi_msghandler sb_edac edac_core wmi lpc_ich mfd_core ahci i2c_i801 isci libsas ioatdma mpt2sas scsi_transport_sas raid_class ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core [last unloaded: cpufreq_ondemand]
2015-06-24 12:35:07
2015-06-24 12:35:07 Pid: 68300, comm: lctl Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1 Intel Corporation S2600GZ/S2600GZ
2015-06-24 12:35:07 RIP: 0010:[&amp;lt;ffffffff8105d0e2&amp;gt;]  [&amp;lt;ffffffff8105d0e2&amp;gt;] task_rq_lock+0x42/0xa0
2015-06-24 12:35:07 RSP: 0018:ffff880fd61f37c8  EFLAGS: 00010082
2015-06-24 12:35:07 RAX: 0000000000000282 RBX: 00000000000158c0 RCX: ffff880fe291ac78
2015-06-24 12:35:07 RDX: 0000000000000282 RSI: ffff880fd61f3820 RDI: 0000000000000000
2015-06-24 12:35:07 RBP: ffff880fd61f37e8 R08: 0000000000000c0e R09: 0000000000000000
2015-06-24 12:35:07 R10: 0000000000000001 R11: 000000000000000f R12: 0000000000000000
2015-06-24 12:35:07 R13: ffff880fd61f3820 R14: 0000000000000000 R15: 000000000000000f
2015-06-24 12:35:07 FS:  00002aaaabaebb20(0000) GS:ffff880060600000(0000) knlGS:0000000000000000
2015-06-24 12:35:07 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2015-06-24 12:35:07 CR2: 0000000000000008 CR3: 0000000fd61e9000 CR4: 00000000000407f0
2015-06-24 12:35:07 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2015-06-24 12:35:07 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2015-06-24 12:35:07 Process lctl (pid: 68300, threadinfo ffff880fd61f2000, task ffff8810260f1520)
2015-06-24 12:35:07 Stack:
2015-06-24 12:35:07  0000000000000000 ffff880fe0aa8ea0 0000000000000000 0000000000000000
2015-06-24 12:35:07 &amp;lt;d&amp;gt; ffff880fd61f3858 ffffffff81065a3c ffff880fd61f3818 ffffffffa12dd263
2015-06-24 12:35:07 &amp;lt;d&amp;gt; ffff880fd24e5a70 ffff880fd4ebbc78 ffff880fd61f3898 0000000000000282
2015-06-24 12:35:07 Call Trace:
2015-06-24 12:35:07  [&amp;lt;ffffffff81065a3c&amp;gt;] try_to_wake_up+0x3c/0x3e0
2015-06-24 12:35:07  [&amp;lt;ffffffffa12dd263&amp;gt;] ? echo_object_free+0x2b3/0x460 [obdecho]
2015-06-24 12:35:07  [&amp;lt;ffffffff81065e35&amp;gt;] wake_up_process+0x15/0x20
2015-06-24 12:35:07  [&amp;lt;ffffffff8152efb2&amp;gt;] __mutex_unlock_slowpath+0x42/0x60
2015-06-24 12:35:07  [&amp;lt;ffffffff8152ef2b&amp;gt;] mutex_unlock+0x1b/0x20
2015-06-24 12:35:07  [&amp;lt;ffffffffa0968051&amp;gt;] lu_site_purge+0x411/0x500 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffffa0968581&amp;gt;] lu_object_limit+0x71/0x80 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffffa09686c0&amp;gt;] lu_object_find_try+0x130/0x260 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffffa09688a1&amp;gt;] lu_object_find_at+0xb1/0xe0 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffffa07bd2b8&amp;gt;] ? libcfs_log_return+0x28/0x40 [libcfs]
2015-06-24 12:35:07  [&amp;lt;ffffffffa12292f1&amp;gt;] ? mdd_lookup+0x111/0x180 [mdd]
2015-06-24 12:35:07  [&amp;lt;ffffffffa12dea33&amp;gt;] echo_md_create_internal+0x153/0x640 [obdecho]
2015-06-24 12:35:07  [&amp;lt;ffffffffa12e8bb2&amp;gt;] echo_md_handler+0x1302/0x1860 [obdecho]
2015-06-24 12:35:07  [&amp;lt;ffffffffa12ea98c&amp;gt;] echo_client_iocontrol+0x187c/0x29e0 [obdecho]
2015-06-24 12:35:07  [&amp;lt;ffffffff8113ca91&amp;gt;] ? lru_cache_add_lru+0x21/0x40
2015-06-24 12:35:07  [&amp;lt;ffffffff8115b2fd&amp;gt;] ? page_add_new_anon_rmap+0x9d/0xf0
2015-06-24 12:35:07  [&amp;lt;ffffffff81176e8c&amp;gt;] ? __kmalloc+0x22c/0x240
2015-06-24 12:35:07  [&amp;lt;ffffffffa093131c&amp;gt;] class_handle_ioctl+0x165c/0x21e0 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffffa09182ab&amp;gt;] obd_class_ioctl+0x4b/0x190 [obdclass]
2015-06-24 12:35:07  [&amp;lt;ffffffff811a5882&amp;gt;] vfs_ioctl+0x22/0xa0
2015-06-24 12:35:07  [&amp;lt;ffffffff811a5ea4&amp;gt;] do_vfs_ioctl+0x84/0x5e0
2015-06-24 12:35:07  [&amp;lt;ffffffff811a6481&amp;gt;] sys_ioctl+0x81/0xa0
2015-06-24 12:35:07  [&amp;lt;ffffffff8100b0b2&amp;gt;] system_call_fastpath+0x16/0x1b
2015-06-24 12:35:07 Code: 89 74 24 18 0f 1f 44 00 00 48 c7 c3 c0 58 01 00 49 89 fc 49 89 f5 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 55 00 &amp;lt;49&amp;gt; 8b 44 24 08 49 89 de 8b 40 18 4c 03 34 c5 60 0c c0 81 4c 89
2015-06-24 12:35:07 RIP  [&amp;lt;ffffffff8105d0e2&amp;gt;] task_rq_lock+0x42/0xa0
2015-06-24 12:35:07  RSP &amp;lt;ffff880fd61f37c8&amp;gt;
2015-06-24 12:35:07 CR2: 0000000000000008
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="119745" author="pjones" created="Fri, 26 Jun 2015 21:18:12 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise on this ticket?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="120238" author="laisiyao" created="Fri, 3 Jul 2015 00:58:13 +0000"  >&lt;p&gt;This looks to be a memory corruption, I&apos;ll try to reproduce and understand more.&lt;/p&gt;</comment>
                            <comment id="121362" author="ofaaland" created="Wed, 15 Jul 2015 16:43:41 +0000"  >&lt;p&gt;Lai,&lt;/p&gt;

&lt;p&gt;Any update?&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="121500" author="ofaaland" created="Thu, 16 Jul 2015 22:52:56 +0000"  >&lt;p&gt;I find mds-survey runs successfully at earlier commits, e.g.&lt;/p&gt;

&lt;p&gt;6d8c562 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3181&quot; title=&quot;open by FID for write with O_LOV_DELAY_CREATE fails for files on MDT1&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3181&quot;&gt;&lt;del&gt;LU-3181&lt;/del&gt;&lt;/a&gt; mdt: mdt_cross_open ...&lt;br/&gt;
0041b39 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4735&quot; title=&quot;Build Xeon Phi client RPMs for SuSE SLES SP2,SP3&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4735&quot;&gt;&lt;del&gt;LU-4735&lt;/del&gt;&lt;/a&gt; lbuild: Build Xeon Phi ...&lt;br/&gt;
e15e92d &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2675&quot; title=&quot;clang: code cleanups for sparse static analyzer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2675&quot;&gt;&lt;del&gt;LU-2675&lt;/del&gt;&lt;/a&gt; lmv: remove liblustre ...&lt;/p&gt;

&lt;p&gt;I&apos;m bisecting now, hope to finish today.&lt;/p&gt;</comment>
                            <comment id="121501" author="ofaaland" created="Thu, 16 Jul 2015 23:17:16 +0000"  >&lt;p&gt;Bisecting indicates the issue was introduced with this commit.  I&apos;ll run a few times with the commit prior to double-check and post here when I&apos;ve confirmed:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;bc34babc1765f6f99220256e96ce5dc5bb390676&amp;#93;&lt;/span&gt; &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5331&quot; title=&quot;qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5331&quot;&gt;&lt;del&gt;LU-5331&lt;/del&gt;&lt;/a&gt; obdclass: serialize lu_site purge&lt;/p&gt;</comment>
                            <comment id="121504" author="ofaaland" created="Fri, 17 Jul 2015 00:08:44 +0000"  >&lt;p&gt;I see at least one flaw in lu_site_purge().&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;CFS_INIT_LIST_HEAD(&amp;amp;dispose);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;occurs before ls_purge_mutex is taken, outside the critical section.  So one thread could call CFS_INIT_LIST_HEAD while another thread is adding entries to dispose via&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;cfs_list_move(&amp;amp;h-&amp;gt;loh_lru, &amp;amp;dispose);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m not sure there aren&apos;t other issues, but I&apos;ll submit a patch for that much.&lt;/p&gt;</comment>
                            <comment id="121506" author="ofaaland" created="Fri, 17 Jul 2015 00:54:29 +0000"  >&lt;p&gt;Nope, I was wrong.  dispose is local.  Looking further.&lt;/p&gt;</comment>
                            <comment id="121508" author="ofaaland" created="Fri, 17 Jul 2015 01:27:14 +0000"  >&lt;p&gt;I verified that I reliably encounter the crash when I run mds-survey on lustre built from&lt;/p&gt;

&lt;p&gt;bc34bab &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5331&quot; title=&quot;qsd_handler.c:1139:qsd_op_adjust()) ASSERTION( qqi ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5331&quot;&gt;&lt;del&gt;LU-5331&lt;/del&gt;&lt;/a&gt; obdclass: serialize lu_site purge&lt;/p&gt;

&lt;p&gt;and reliably run mds-survey successfully with lustre built from the prior commit,&lt;/p&gt;

&lt;p&gt;6f104f &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5061&quot; title=&quot;add lnb_ prefix to members of struct niobuf_local&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5061&quot;&gt;&lt;del&gt;LU-5061&lt;/del&gt;&lt;/a&gt; obd: add rnb_ prefix to struct niobuf_remote members&lt;/p&gt;</comment>
                            <comment id="121513" author="pjones" created="Fri, 17 Jul 2015 04:38:31 +0000"  >&lt;p&gt;Nice detective work Olaf! Lai, any suggestions as how to fix tis issue?&lt;/p&gt;</comment>
                            <comment id="121563" author="pjones" created="Fri, 17 Jul 2015 17:47:57 +0000"  >&lt;p&gt;Lai is on vacation so could you please advise Niu?&lt;/p&gt;</comment>
                            <comment id="121643" author="ofaaland" created="Sun, 19 Jul 2015 19:04:23 +0000"  >&lt;p&gt;I also see output from the list_debug code in the kernel, in the console log:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P           ---------------   )
Hardware name: KVM
list_add corruption. prev-&amp;gt;next should be next (ffff880032f2d2a8), but was ffff88000f2f8a70. (prev=ffff88000f2f8a70).
Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) nodemap(U) osd_zfs(U) lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_nat bridge stp llc dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c autofs4 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate vhost_net macvtap macvlan tun virtio_balloon virtio_net i2c_piix4 i2c_core sg ext4 jbd2 mbcache virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Pid: 18354, comm: lctl Tainted: P           ---------------    2.6.32-431.20.3.1chaos.ch5.2.x86_64 #1
Call Trace:
 [&amp;lt;ffffffff81071d87&amp;gt;] ? warn_slowpath_common+0x87/0xc0
 [&amp;lt;ffffffff81071e76&amp;gt;] ? warn_slowpath_fmt+0x46/0x50
 [&amp;lt;ffffffff8129729f&amp;gt;] ? __list_add+0x8f/0xa0
 [&amp;lt;ffffffff8152ccbf&amp;gt;] ? __mutex_lock_slowpath+0xcf/0x180
 [&amp;lt;ffffffff8152af79&amp;gt;] ? printk+0x41/0x48
 [&amp;lt;ffffffff8152cbce&amp;gt;] ? mutex_lock+0x3e/0x60
 [&amp;lt;ffffffffa060895c&amp;gt;] ? lu_site_purge+0xac/0x550 [obdclass]
 [&amp;lt;ffffffffa0609241&amp;gt;] ? lu_object_limit+0x71/0x80 [obdclass]
 [&amp;lt;ffffffffa0609414&amp;gt;] ? lu_object_find_at+0x1c4/0x360 [obdclass]
 [&amp;lt;ffffffffa0dc2b05&amp;gt;] ? lod_index_lookup+0x25/0x30 [lod]
 [&amp;lt;ffffffffa0c037a1&amp;gt;] ? osd_attr_get+0x121/0x1e0 [osd_zfs]
 [&amp;lt;ffffffffa0adfea3&amp;gt;] ? echo_md_create_internal+0x153/0x640 [obdecho]
 [&amp;lt;ffffffffa0ae89f5&amp;gt;] ? echo_md_handler+0x1225/0x1900 [obdecho]
 [&amp;lt;ffffffffa0aed164&amp;gt;] ? echo_client_iocontrol+0x24a4/0x30e0 [obdecho]
 [&amp;lt;ffffffff8128f146&amp;gt;] ? vsnprintf+0x336/0x5e0
 [&amp;lt;ffffffffa04bc27b&amp;gt;] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
 [&amp;lt;ffffffff811702ec&amp;gt;] ? __kmalloc+0x22c/0x240
 [&amp;lt;ffffffffa04ccfe1&amp;gt;] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [&amp;lt;ffffffffa05cd47c&amp;gt;] ? class_handle_ioctl+0x125c/0x1e10 [obdclass]
 [&amp;lt;ffffffffa05b42ab&amp;gt;] ? obd_class_ioctl+0x4b/0x190 [obdclass]
 [&amp;lt;ffffffff8119f1f2&amp;gt;] ? vfs_ioctl+0x22/0xa0
 [&amp;lt;ffffffff8103f9d8&amp;gt;] ? pvclock_clocksource_read+0x58/0xd0
 [&amp;lt;ffffffff8119f814&amp;gt;] ? do_vfs_ioctl+0x84/0x5e0
 [&amp;lt;ffffffff8103ea6c&amp;gt;] ? kvm_clock_read+0x1c/0x20
 [&amp;lt;ffffffff8103ea79&amp;gt;] ? kvm_clock_get_cycles+0x9/0x10
 [&amp;lt;ffffffff810a66f7&amp;gt;] ? getnstimeofday+0x57/0xe0
 [&amp;lt;ffffffff8119fdf1&amp;gt;] ? sys_ioctl+0x81/0xa0
 [&amp;lt;ffffffff810e20de&amp;gt;] ? __audit_syscall_exit+0x25e/0x290
 [&amp;lt;ffffffff8100b0b2&amp;gt;] ? system_call_fastpath+0x16/0x1b
---[ end trace 246d1f5db30ecb0d ]---
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="121682" author="niu" created="Mon, 20 Jul 2015 16:00:13 +0000"  >&lt;p&gt;I found something super suspicious in the echo_client, in echo_device_alloc():&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;                &lt;span class=&quot;code-comment&quot;&gt;/* For MD echo client, it will use the site in MDS stack */&lt;/span&gt;
                ed-&amp;gt;ed_site_myself.cs_lu = *ls;
                ed-&amp;gt;ed_site = &amp;amp;ed-&amp;gt;ed_site_myself;
                ed-&amp;gt;ed_cl.cd_lu_dev.ld_site = &amp;amp;ed-&amp;gt;ed_site_myself.cs_lu;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We copied the lu_site of MDS to ed_site_myself, so a ls_purge_mutex is copied as well... I&apos;m not sure the purpose of this piece of code, apparently we should just set the ed_site to point to the MDS lu_site.&lt;/p&gt;

&lt;p&gt;The code is from&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;commit 9f55850b884cac1c7bbde6d3b02764b712a2921f
Author: wangdi &amp;lt;di.wang@whamcloud.com&amp;gt;
Date:   Wed Nov 16 14:55:23 2011 -0800

    LU-593 obdclass: echo client &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MDS stack

    1. Add interfaces and tools &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; exercising a local MDT
       device &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; performance reasons, in a similar manner
       to obdfilter-survey.
    2. add test_create, test_mkdir, test_lookup, test_destroy,
       test_rmdir, test_setxattr, test_md_getattr in lctl &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt;
       md echo client test.

    Signed-off-by: Wang di &amp;lt;di.wang@whamcloud.com&amp;gt;
    Change-Id: Ibf774a567820ff36b3624e44371c63a9428d82a5
    Reviewed-on: http:&lt;span class=&quot;code-comment&quot;&gt;//review.whamcloud.com/1287
&lt;/span&gt;    Tested-by: Hudson
    Reviewed-by: Fan Yong &amp;lt;yong.fan@whamcloud.com&amp;gt;
    Tested-by: Maloo &amp;lt;whamcloud.maloo@gmail.com&amp;gt;
    Reviewed-by: Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Di, could you take a look? Is it ok to just set the pointer (ed_site) and don&apos;t copy the lu_site here?&lt;/p&gt;</comment>
                            <comment id="121696" author="ofaaland" created="Mon, 20 Jul 2015 16:40:21 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;I think you&apos;re right that the code you found in echo_device_alloc() is incorrect.&lt;/p&gt;

&lt;p&gt;The kernel&apos;s Documentation/mutex-design.txt says:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;   * - a mutex object must not be initialized via memset or copying
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I haven&apos;t yet figured out what the mutex depends on that makes this bad, but I did look at echo_client.c and lu_site_init() is not called nor is ls_purge_mutex initialized directly via mutex_init().&lt;/p&gt;

&lt;p&gt;I&apos;ll explicitly initialize the mutex as a test and see what happens in a few minutes.&lt;/p&gt;</comment>
                            <comment id="121718" author="ofaaland" created="Mon, 20 Jul 2015 17:57:28 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;I made two successful passes through 100,000 file cycle of mds-survey successfully with the patch to initialize ed_site_myself.cs_lu.ls_purge_mutex.  Without the patch, my VM crashes before completing even one cycle.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;diff --git a/lustre/obdecho/echo_client.c b/lustre/obdecho/echo_client.c
index 8b1a526..7d18f0f 100644
--- a/lustre/obdecho/echo_client.c
+++ b/lustre/obdecho/echo_client.c
@@ -857,6 +857,7 @@ &lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; struct lu_device *echo_device_alloc(&lt;span class=&quot;code-keyword&quot;&gt;const&lt;/span&gt; struct lu_env *env,
                 next = ld;
                 &lt;span class=&quot;code-comment&quot;&gt;/* For MD echo client, it will use the site in MDS stack */&lt;/span&gt;
                 ed-&amp;gt;ed_site_myself.cs_lu = *ls;
+                mutex_init(&amp;amp;ed-&amp;gt;ed_site_myself.cs_lu.ls_purge_mutex);
                 ed-&amp;gt;ed_site = &amp;amp;ed-&amp;gt;ed_site_myself;
                 ed-&amp;gt;ed_cl.cd_lu_dev.ld_site = &amp;amp;ed-&amp;gt;ed_site_myself.cs_lu;
                rc = echo_fid_init(ed, obd-&amp;gt;obd_name, lu_site2seq(ls));
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This isn&apos;t necessarily the proper fix, but I think supports your suspicion.&lt;/p&gt;</comment>
                            <comment id="121764" author="gerrit" created="Tue, 21 Jul 2015 00:00:50 +0000"  >&lt;p&gt;Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/15657&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/15657&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6765&quot; title=&quot;mds-survey triggers crash via BUG:sleeping function called from invalid context&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6765&quot;&gt;&lt;del&gt;LU-6765&lt;/del&gt;&lt;/a&gt; obdecho: initialize cs_lu.ls_purge_mutex&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d2b6aaa2b0d712c495f45d54f927ea228ba019f2&lt;/p&gt;</comment>
                            <comment id="121766" author="ofaaland" created="Tue, 21 Jul 2015 00:04:50 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;The patch I uploaded above is not intended as the actual fix, it&apos;s there so I can refer to it for a project I&apos;m working on.  You can disregard it.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="121770" author="niu" created="Tue, 21 Jul 2015 01:53:59 +0000"  >&lt;blockquote&gt;
&lt;p&gt;This isn&apos;t necessarily the proper fix, but I think supports your suspicion.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Right, the fix apparently is to just set ed_site to &apos;ls&apos;, I&apos;d ask Di to confirm it.&lt;/p&gt;

&lt;p&gt;Olaf, thank you for posting a fix, I&apos;ll review it soon.&lt;/p&gt;</comment>
                            <comment id="123827" author="gerrit" created="Tue, 11 Aug 2015 11:36:41 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/15657/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/15657/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6765&quot; title=&quot;mds-survey triggers crash via BUG:sleeping function called from invalid context&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6765&quot;&gt;&lt;del&gt;LU-6765&lt;/del&gt;&lt;/a&gt; obdecho: don&apos;t copy lu_site&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: c45c8ad26004a577dd7ad4270f2756e1f2943639&lt;/p&gt;</comment>
                            <comment id="123836" author="niu" created="Tue, 11 Aug 2015 13:52:30 +0000"  >&lt;p&gt;landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="31109">LU-6860</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxgkv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>