<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:41:49 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11201] NMI watchdog: BUG: soft lockup in lfsck_namespace</title>
                <link>https://jira.whamcloud.com/browse/LU-11201</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;About 4 minutes after recovery ended post MDS startup, the console started reporting soft lockups as below. The node started refusing connections from pdsh and ltop running on the mgmt node started reporting stale data (not getting updates from cerebro).&lt;/p&gt;

&lt;p&gt;The watchdog is repeatedly reporting a stack about every 40 seconds. Usually it is the same stack, below, with the same PID and same CPU (CPU#6). lfs check servers on a compute node with the file system mounted still shows lquake-MDT0008 as active and the node still responds to lctl ping. An ls of a directory stored on MDT0008 hangs in ptlrpc_set_wait() in&lt;br/&gt;
ldlm_cli_enqueue() in mdc_enqueue().&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lfsck_namespace:26532]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl ib_ucm sb_edac intel_powerclamp coretemp rpcrdma intel_rapl rdma_ucm iosf_mbi ib_umad ib_uverbs ib_ipoib ib_iser kvm rdma_cm iw_cm libiscsi irqbypass ib_cm iTCO_wdt iTCO_vendor_support mlx5_ib ib_core joydev mlx5_core pcspkr mlxfw devlink lpc_ich i2c_i801 ioatdma ses enclosure sch_fq_codel sg ipmi_si shpchp acpi_cpufreq acpi_power_meter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) binfmt_misc msr_safe(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 crct10dif_pclmul crct10dif_common i2c_algo_bit crc32_pclmul drm_kms_helper scsi_transport_iscsi crc32c_intel syscopyarea sysfillrect ghash_clmulni_intel sysimgblt dm_multipath ixgbe fb_sys_fops aesni_intel ttm lrw mxm_wmi ahci dca gf128mul libahci glue_helper ablk_helper drm mpt3sas cryptd ptp raid_class libata i2c_core pps_core scsi_transport_sas mdio ipmi_devintf ipmi_msghandler wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
CPU: 3 PID: 26532 Comm: lfsck_namespace Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.1chaos.ch6.x86_64 #1
Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
task: ffffa0df6ab94f10 ti: ffffa0df32650000 task.ti: ffffa0df32650000
RIP: 0010:[&amp;lt;ffffffffc1402f1e&amp;gt;]  [&amp;lt;ffffffffc1402f1e&amp;gt;] lfsck_namespace_filter_linkea_entry.isra.64+0x8e/0x180 [lfsck]
RSP: 0018:ffffa0df32653af0  EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffffa0df32653ab8 RCX: ffffa0df2e1d4971
RDX: 0000000000000000 RSI: ffffa0df30ec8010 RDI: ffffa0df32653bc8
RBP: ffffa0df32653b38 R08: 0000000000000000 R09: 0000000000000025
R10: ffffa0df30ec8010 R11: 0000000000000000 R12: ffffa0df32653a70
R13: ffffa0df32653acc R14: ffffffffc144d09c R15: ffffa0df33d5e240
FS:  0000000000000000(0000) GS:ffffa0df7e6c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffff7ff8000 CR3: 0000005cd280e000 CR4: 00000000001607e0
Call Trace:
 [&amp;lt;ffffffffc140d63d&amp;gt;] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
 [&amp;lt;ffffffffc14159af&amp;gt;] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
 [&amp;lt;ffffffffc087d50e&amp;gt;] ? dmu_buf_rele+0xe/0x10 [zfs]
 [&amp;lt;ffffffffc090225f&amp;gt;] ? zap_unlockdir+0x3f/0x60 [zfs]
 [&amp;lt;ffffffffc1416d82&amp;gt;] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
 [&amp;lt;ffffffffc141a7bd&amp;gt;] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
 [&amp;lt;ffffffffb2e03426&amp;gt;] ? kfree+0x136/0x180
 [&amp;lt;ffffffffc11a9ad8&amp;gt;] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [&amp;lt;ffffffffc13fe6a4&amp;gt;] lfsck_assistant_engine+0x13e4/0x21a0 [lfsck]
 [&amp;lt;ffffffffb2cd5c20&amp;gt;] ? wake_up_state+0x20/0x20
 [&amp;lt;ffffffffc13fd2c0&amp;gt;] ? lfsck_master_engine+0x1450/0x1450 [lfsck]
 [&amp;lt;ffffffffb2cc0ad1&amp;gt;] kthread+0xd1/0xe0
 [&amp;lt;ffffffffb2cc0a00&amp;gt;] ? insert_kthread_work+0x40/0x40
 [&amp;lt;ffffffffb3344837&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
 [&amp;lt;ffffffffb2cc0a00&amp;gt;] ? insert_kthread_work+0x40/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>lustre 2.10.4_1.chaos&lt;br/&gt;
kernel 3.10.0-862.9.1.1chaos.ch6.x86_64&lt;br/&gt;
RHEL 7.5 based&lt;br/&gt;
MDT</environment>
        <key id="52888">LU-11201</key>
            <summary>NMI watchdog: BUG: soft lockup in lfsck_namespace</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 3 Aug 2018 02:36:49 +0000</created>
                <updated>Thu, 3 Jan 2019 19:14:26 +0000</updated>
                            <resolved>Thu, 23 Aug 2018 12:57:05 +0000</resolved>
                                    <version>Lustre 2.10.4</version>
                                    <fixVersion>Lustre 2.12.0</fixVersion>
                    <fixVersion>Lustre 2.10.6</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="231359" author="ofaaland" created="Fri, 3 Aug 2018 02:38:23 +0000"  >&lt;p&gt;I&apos;ve left the system in this state for now, in case there&apos;s more data I can usefully gather.  It is a test system.&lt;/p&gt;</comment>
                            <comment id="231360" author="ofaaland" created="Fri, 3 Aug 2018 02:39:11 +0000"  >&lt;p&gt;The later watchdog reports show a few different stacks.  The one in lfsck_namespace_double_scan_one() appears 10 or 20x as often as the other stacks.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;==&amp;gt; /tmp/t5.lfsck_namespace_double_scan_one &amp;lt;==
Call Trace:
 [&amp;lt;ffffffffc140d63d&amp;gt;] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
 [&amp;lt;ffffffffc14159af&amp;gt;] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
 [&amp;lt;ffffffffc087d50e&amp;gt;] ? dmu_buf_rele+0xe/0x10 [zfs]
 [&amp;lt;ffffffffc090225f&amp;gt;] ? zap_unlockdir+0x3f/0x60 [zfs]
 [&amp;lt;ffffffffc1416d82&amp;gt;] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
 [&amp;lt;ffffffffc141a7bd&amp;gt;] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
 [&amp;lt;ffffffffb2e03426&amp;gt;] ? kfree+0x136/0x180

==&amp;gt; /tmp/t5.sched_show_task &amp;lt;==
Call Trace:
 &amp;lt;IRQ&amp;gt;  [&amp;lt;ffffffffb2cd4d9f&amp;gt;] sched_show_task+0xbf/0x120
 [&amp;lt;ffffffffb2cd8b99&amp;gt;] dump_cpu_task+0x39/0x70
 [&amp;lt;ffffffffb2d51e60&amp;gt;] rcu_dump_cpu_stacks+0x90/0xd0
 [&amp;lt;ffffffffb2d55952&amp;gt;] rcu_check_callbacks+0x482/0x770
 [&amp;lt;ffffffffb2d09130&amp;gt;] ? tick_sched_do_timer+0x50/0x50
 [&amp;lt;ffffffffb2d09130&amp;gt;] ? tick_sched_do_timer+0x50/0x50
 [&amp;lt;ffffffffb2ca95f6&amp;gt;] update_process_times+0x46/0x80

==&amp;gt; /tmp/t5.schedule &amp;lt;==
Call Trace:
 [&amp;lt;ffffffff99cd2148&amp;gt;] ? check_preempt_curr+0x78/0xa0
 [&amp;lt;ffffffff9a337859&amp;gt;] schedule+0x29/0x70
 [&amp;lt;ffffffff9a3350d9&amp;gt;] schedule_timeout+0x289/0x310
 [&amp;lt;ffffffff99f6dda3&amp;gt;] ? number.isra.2+0x323/0x360
 [&amp;lt;ffffffff99cd5b75&amp;gt;] ? wake_up_process+0x15/0x20
 [&amp;lt;ffffffff99cb5a94&amp;gt;] ? wake_up_worker+0x24/0x30
 [&amp;lt;ffffffff99cb62d2&amp;gt;] ? insert_work+0x62/0xa0

==&amp;gt; /tmp/t5.zone_statistics &amp;lt;==
Call Trace:
 [&amp;lt;ffffffff99dc0bf8&amp;gt;] ? zone_statistics+0x88/0xa0
 [&amp;lt;ffffffff99da8d92&amp;gt;] ? get_page_from_freelist+0x502/0xa00
 [&amp;lt;ffffffffc0646ac3&amp;gt;] ? dbuf_find+0x1f3/0x220 [zfs]
 [&amp;lt;ffffffff99da940f&amp;gt;] ? __alloc_pages_nodemask+0x17f/0x470
 [&amp;lt;ffffffff99df5ba8&amp;gt;] ? alloc_pages_current+0x98/0x110
 [&amp;lt;ffffffffc0530798&amp;gt;] ? nvpair_value_common.part.19+0x38/0x170 [znvpair]
 [&amp;lt;ffffffffc0532122&amp;gt;] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="231361" author="ofaaland" created="Fri, 3 Aug 2018 02:39:53 +0000"  >&lt;p&gt;SysRQ show-backtrace-all-active-cpus(l) reports all CPUs idle except for the one with process lfsck_namespace and the one processing the SysRQ.&lt;/p&gt;</comment>
                            <comment id="231367" author="pjones" created="Fri, 3 Aug 2018 04:55:54 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="231380" author="laisiyao" created="Fri, 3 Aug 2018 09:15:18 +0000"  >&lt;p&gt;does the lfsck_namespace process consume 100% of cpu in command &apos;top&apos;?&lt;/p&gt;</comment>
                            <comment id="231404" author="ofaaland" created="Fri, 3 Aug 2018 17:16:29 +0000"  >&lt;p&gt;I&apos;m unable to login to the node.&#160; Attempts to rsh or ssh to it report connection refused.&#160; I don&apos;t know why that would be, since the console log does not indicate the OOMkiller ran.&#160; But whatever is going on is apparently affecting userspace processes on the node.&lt;/p&gt;

&lt;p&gt;The console responds to SysRq, but does not emit a login prompt in response to input.&lt;/p&gt;

&lt;p&gt;So SysRq is the only tool I know of right now to get information about the state of the node.&lt;/p&gt;</comment>
                            <comment id="231406" author="ofaaland" created="Fri, 3 Aug 2018 17:21:25 +0000"  >&lt;p&gt;Or, I could crash the node and then examine the crash dump, likely getting the debug log for example.&lt;/p&gt;</comment>
                            <comment id="231409" author="ofaaland" created="Fri, 3 Aug 2018 17:41:07 +0000"  >&lt;blockquote&gt;&lt;p&gt;does the lfsck_namespace process consume 100% of cpu in command &apos;top&apos;?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I performed SysRq show-backtrace-all-active-cpus(l) 12 times in 16 seconds.&#160; All 12 times, I got back the same results:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The lfsck_namespace process is on the same core in the same stack (CPU: 6 PID: 28738 Comm: lfsck_namespace)&lt;/li&gt;
	&lt;li&gt;One core is in the SysRq / serial console code, handling the request&lt;/li&gt;
	&lt;li&gt;The other 14 cores are reported as idle &quot;NMI backtrace for cpu 9 skipped: idling at ...&quot;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;The NMI watchdog continues to report very regularly.  Here is a sample:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2018-08-03 10:19:17 [59097.755662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:19:45 [59125.753629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:13 [59153.751597] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:20:41 [59181.749565] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:21:21 [59221.746662] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:21:49 [59249.744629] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:17 [59277.742596] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:22:45 [59305.740562] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:13 [59333.738528] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:23:41 [59361.736494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:24:21 [59401.733593] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:24:49 [59429.731557] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:17 [59457.729523] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [lfsck_namespace:28738]
2018-08-03 10:25:45 [59485.727494] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:13 [59513.725459] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:26:41 [59541.723427] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:21 [59581.720524] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:27:49 [59609.718499] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
2018-08-03 10:28:17 [59637.716461] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [lfsck_namespace:28738]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the last 29 instances, in 24 cases the interval between reports is 27.998 seconds, of which we know from the watchdog at least 22 of those seconds, the process was on the core continuously.&lt;/p&gt;

&lt;p&gt;So certainly nearly 100% if not exactly.&lt;/p&gt;</comment>
                            <comment id="231443" author="laisiyao" created="Sat, 4 Aug 2018 14:36:16 +0000"  >&lt;p&gt;I see, it looks lfsck_namespace thread falls in dead loop in lfsck_namespace_filter_linkea_entry(), I&apos;m reviewing related code.&lt;/p&gt;</comment>
                            <comment id="231539" author="ofaaland" created="Mon, 6 Aug 2018 18:22:37 +0000"  >&lt;p&gt;I&apos;ve crashed the node.  If there&apos;s anything useful I can get for you, from the crashdump, let me know.&lt;/p&gt;</comment>
                            <comment id="231543" author="ofaaland" created="Mon, 6 Aug 2018 19:55:23 +0000"  >&lt;p&gt;I also notice every 3 minutes I was getting rcu INFO messages.  Here&apos;s a subset:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2018-08-06 09:24:02 [314964.276905] INFO: rcu_sched self-detected stall on CPU[314964.277946] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314348731 jiffies, g=13439, c=13438, q=148193)                                                                                                      
2018-08-06 09:27:02 [315144.268862] INFO: rcu_sched self-detected stall on CPU[315144.269911] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314528736 jiffies, g=13439, c=13438, q=148199)                                                                                                      
2018-08-06 09:30:02 [315324.260816] INFO: rcu_sched self-detected stall on CPU[315324.261864] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314708741 jiffies, g=13439, c=13438, q=148211)                                                                                                      
2018-08-06 09:33:02 [315504.252765] INFO: rcu_sched self-detected stall on CPU[315504.253806] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=314888746 jiffies, g=13439, c=13438, q=148219)                                                                                                      
2018-08-06 09:36:02 [315684.244711] INFO: rcu_sched self-detected stall on CPU[315684.245760] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 9, t=315068751 jiffies, g=13439, c=13438, q=148241)                                                                                                      
2018-08-06 09:39:02 [315864.236657] INFO: rcu_sched self-detected stall on CPU[315864.237702] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315248756 jiffies, g=13439, c=13438, q=148245)                                                                                                     
2018-08-06 09:42:02 [316044.228615] INFO: rcu_sched self-detected stall on CPU[316044.229654] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 12, t=315428761 jiffies, g=13439, c=13438, q=148259)                                                                                                     
2018-08-06 09:45:02 [316224.220568] INFO: rcu_sched self-detected stall on CPU[316224.221613] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 14, t=315608766 jiffies, g=13439, c=13438, q=148265)                                                                                                     
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After crashing the node I imported the pool and mounted lustre.  lfsck started and I began seeing the NMI watchdog warnings and associated stacks on the console (same stacks as before).  The node is still responsive to rsh, and running top I see one core is at 100% sys all the time.&lt;/p&gt;
</comment>
                            <comment id="231545" author="ofaaland" created="Mon, 6 Aug 2018 20:03:45 +0000"  >&lt;p&gt;While the node was responsive, but I was getting the NMI watchdog warnings, I changed the debug and subsystem_debug masks and gathered debug logs.  I&apos;ll attach them, with &quot;responsive-&quot; filename prefix.&lt;/p&gt;

&lt;p&gt;After several minutes of the node being response and the NMI watchdog warnings appearing, I attempted to stop the MDT, which appeared to hang:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@jeti:127.0.0.1-2018-08-06-11:18:08]# pdsh -w e9 umount -t lustre -a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Within several seconds of the umount (maybe immediately, not sure) the node stopped responding to rsh and now seems to be in the same state I was reporting originally.&lt;/p&gt;

&lt;p&gt;So it seems as if attempting to stop the MDT while lfsck is stuck in the loop you mention creates the symptoms that got my attention originally.&lt;/p&gt;</comment>
                            <comment id="231551" author="ofaaland" created="Mon, 6 Aug 2018 22:13:14 +0000"  >&lt;p&gt;I ran with a debug patch and confirmed:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;lfsck is stuck in the while loop within lfsck_namespace_filter_linkea_entry().&lt;/li&gt;
	&lt;li&gt;It always takes the else path and calls linkea_next_entry(ldata)&lt;/li&gt;
	&lt;li&gt;ldata-&amp;gt;ld_lee is non-NULL but never changes&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="231552" author="ofaaland" created="Mon, 6 Aug 2018 22:33:43 +0000"  >&lt;p&gt;I confirmed the scan is not advancing because lfsck_namespace_filter_linkea_entry() is calling &#160;linkea_next_entry() and ldata-&amp;gt;ld_lee != NULL and ldata-&amp;gt;ld_reclen == 0.&lt;/p&gt;</comment>
                            <comment id="231588" author="laisiyao" created="Tue, 7 Aug 2018 16:00:41 +0000"  >&lt;p&gt;I&apos;m working on a patch to check linkea entry validity on unpack, will commit later.&lt;/p&gt;</comment>
                            <comment id="231613" author="gerrit" created="Wed, 8 Aug 2018 01:34:28 +0000"  >&lt;p&gt;Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32958&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32958&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11201&quot; title=&quot;NMI watchdog: BUG: soft lockup in lfsck_namespace&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11201&quot;&gt;&lt;del&gt;LU-11201&lt;/del&gt;&lt;/a&gt; lfsck: check linkea entry validity&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 487aff8854a59e897b1511f3389196573ca38c3f&lt;/p&gt;</comment>
                            <comment id="231677" author="adilger" created="Wed, 8 Aug 2018 21:18:11 +0000"  >&lt;p&gt;The patch addresses the stuck thread doing the linkea iteration, but do we have any idea how/why the link xattr was broken in the first place?  Is this a particularly old ZFS filesystem that might have some ancient bug that created bad link xattrs?  Has LFSCK previously run successfully on this MDT with this version of Lustre?&lt;/p&gt;</comment>
                            <comment id="231848" author="ofaaland" created="Mon, 13 Aug 2018 05:45:39 +0000"  >&lt;p&gt;I don&apos;t know how the link xattr was broken.&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The pools were created and the targets formatted 9-12 months ago.&#160; I could go back and get specific version information.&#160; The zfs version there at the time was an 0.7 stable release, after the parallel object allocation fix.&lt;/li&gt;
	&lt;li&gt;LFSCK has been run successfully on this MDT in the past, with a prior 2.8.x lustre; but I&apos;ll have to look it up for more details.&#160; I do not think LFSCK has been run on this MDT with this exact version of Lustre.&lt;/li&gt;
	&lt;li&gt;This file system has been bounced back and forth between Lustre 2.8 and Lustre 2.10 at few times.&#160; I believe that&apos;s been without using any 2.10 specific features (e.g. PFL) but it&apos;s possible I&apos;m mistaken about that.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="232478" author="gerrit" created="Thu, 23 Aug 2018 07:18:00 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/32958/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32958/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11201&quot; title=&quot;NMI watchdog: BUG: soft lockup in lfsck_namespace&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11201&quot;&gt;&lt;del&gt;LU-11201&lt;/del&gt;&lt;/a&gt; lfsck: check linkea entry validity&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: a5441a717c3a97071494ff51cfb72a117d12d96e&lt;/p&gt;</comment>
                            <comment id="232494" author="pjones" created="Thu, 23 Aug 2018 12:57:05 +0000"  >&lt;p&gt;Landed for 2.12&lt;/p&gt;</comment>
                            <comment id="232600" author="gerrit" created="Sat, 25 Aug 2018 16:29:59 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/33078&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33078&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11201&quot; title=&quot;NMI watchdog: BUG: soft lockup in lfsck_namespace&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11201&quot;&gt;&lt;del&gt;LU-11201&lt;/del&gt;&lt;/a&gt; lfsck: check linkea entry validity&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 919685c9a1dc33067f1d5ac21ccc00fd8a42642d&lt;/p&gt;</comment>
                            <comment id="233354" author="gerrit" created="Tue, 11 Sep 2018 20:36:46 +0000"  >&lt;p&gt;John L. Hammond (jhammond@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/33078/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33078/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11201&quot; title=&quot;NMI watchdog: BUG: soft lockup in lfsck_namespace&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11201&quot;&gt;&lt;del&gt;LU-11201&lt;/del&gt;&lt;/a&gt; lfsck: check linkea entry validity&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 4829fb05c6ca672775701de85bc495344ac619e9&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="53393">LU-11419</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0007r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>