<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:56:03 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12832] soft lockup in ldlm_bl_xx threads at read for a single shared strided file</title>
                <link>https://jira.whamcloud.com/browse/LU-12832</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;many clients got soft lockup in ldlm_bl_xx at read for a single shared strided file that is same workload of io500 hard. at the end, ior failed becouse of clients were evicted.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mkdir /scratch0/io500.out/ior_hard -p
# lfs setstripe -C 240 -S 16M /scratch0/io500.out/ior_hard

# salloc --nodes=10 --ntasks-per-node=24 mpirun --allow-run-as-root /scratch0/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;one of server log&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct  6 21:27:46 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.151@o2ib10  ns: filter-ai400-OST0000_UUID lock: ffff995029f90480/0xe6d707f7408ef34e lrc: 3/0,0 mode: PW/PW res: [0x19:0x0:0x0].0x0 rrc: 30741 type: EXT [5842382848-&amp;gt;5842853887] (req 5842382848-&amp;gt;5842431999) flags: 0x60000400000020 nid: 10.0.13.151@o2ib10 remote: 0xeff507f1ffb50fc0 expref: 158923 pid: 27145 timeout: 11179 lvb_type: 0
Oct  6 21:27:46 es400nv-vm1 kernel: Lustre: ai400-OST0000: Connection restored to 85106299-51f6-4 (at 10.0.13.154@o2ib10)
Oct  6 21:27:46 es400nv-vm1 kernel: Lustre: Skipped 127 previous similar messages
Oct  6 21:27:49 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.152@o2ib10  ns: filter-ai400-OST0001_UUID lock: ffff9950aee0d8c0/0xe6d707f740dbdd52 lrc: 3/0,0 mode: PW/PW res: [0x8:0x0:0x0].0x0 rrc: 26996 type: EXT [5140467712-&amp;gt;5140746239] (req 5140467712-&amp;gt;5140516863) flags: 0x60000400020020 nid: 10.0.13.152@o2ib10 remote: 0xc5564b8fed3e7bd7 expref: 166664 pid: 26904 timeout: 11182 lvb_type: 0
Oct  6 21:27:49 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) Skipped 2 previous similar messages
Oct  6 21:27:53 es400nv-vm1 kernel: LustreError: 26986:0:(ldlm_lockd.c:2437:ldlm_cancel_handler()) ldlm_cancel from 10.0.13.152@o2ib10 arrived at 1570364871 with bad export cookie 16633772507216549573
Oct  6 21:27:59 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.159@o2ib10  ns: filter-ai400-OST0000_UUID lock: ffff9950587b6400/0xe6d707f7402a10dc lrc: 3/0,0 mode: PW/PW res: [0x15:0x0:0x0].0x0 rrc: 18417 type: EXT [6164672512-&amp;gt;6164860927] (req 6164672512-&amp;gt;6164721663) flags: 0x60000400020020 nid: 10.0.13.159@o2ib10 remote: 0xd7d9cb14933b97f6 expref: 153927 pid: 26985 timeout: 11192 lvb_type: 0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;one of client log&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct  6 21:27:54 c05 kernel: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [ldlm_bl_01:151183]
Oct  6 21:27:54 c05 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ip6table_nat ip6_tables xt_statistic xt_physdev xt_NFLOG nfnetlink_log veth dummy vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ip_set_hash_ip xt_set ip_set xt_nat ipt_REJECT nf_reject_ipv4 xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc nfsv3 nfs_acl nfs lockd grace fscache overlay(T) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE)
Oct  6 21:27:54 c05 kernel: mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sunrpc iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev sg i2c_i801 mei_me lpc_ich ioatdma mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci igb crct10dif_pclmul crct10dif_common libata crc32c_intel mlx4_core(OE) ptp pps_core devlink dca i2c_algo_bit drm_panel_orientation_quirks mlx_compat(OE) [last unloaded: libcfs]
Oct  6 21:27:54 c05 kernel: CPU: 9 PID: 151183 Comm: ldlm_bl_01 Kdump: loaded Tainted: G           OEL ------------ T 3.10.0-957.27.2.el7.x86_64 #1
Oct  6 21:27:54 c05 kernel: Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
Oct  6 21:27:54 c05 kernel: task: ffff8eb4c1eb4100 ti: ffff8eb64c178000 task.ti: ffff8eb64c178000
Oct  6 21:27:54 c05 kernel: RIP: 0010:[&amp;lt;ffffffff89d13536&amp;gt;]  [&amp;lt;ffffffff89d13536&amp;gt;] native_queued_spin_lock_slowpath+0x126/0x200
Oct  6 21:27:54 c05 kernel: RSP: 0018:ffff8eb64c17bc70  EFLAGS: 00000246
Oct  6 21:27:54 c05 kernel: RAX: 0000000000000000 RBX: ffff8ec614df8280 RCX: 0000000000490000
Oct  6 21:27:54 c05 kernel: RDX: ffff8eb67e79b780 RSI: 0000000000310001 RDI: ffff8ec614de0058
Oct  6 21:27:54 c05 kernel: RBP: ffff8eb64c17bc70 R08: ffff8eb67e85b780 R09: 0000000000000000
Oct  6 21:27:54 c05 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8eb64c17bc38
Oct  6 21:27:54 c05 kernel: R13: ffff8eb1aa758000 R14: ffff8eb2af3e46c8 R15: ffff8eb1aa758098
Oct  6 21:27:54 c05 kernel: FS:  0000000000000000(0000) GS:ffff8eb67e840000(0000) knlGS:0000000000000000
Oct  6 21:27:54 c05 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  6 21:27:54 c05 kernel: CR2: 00000000029cb000 CR3: 0000000f51610000 CR4: 00000000003607e0
Oct  6 21:27:54 c05 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  6 21:27:54 c05 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct  6 21:27:54 c05 kernel: Call Trace:
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff8a35e2cb&amp;gt;] queued_spin_lock_slowpath+0xb/0xf
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff8a36c7a0&amp;gt;] _raw_spin_lock+0x20/0x30
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0c3b32a&amp;gt;] cl_object_attr_lock+0x1a/0x20 [obdclass]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0fea746&amp;gt;] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0e3e54a&amp;gt;] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0c14979&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0e49460&amp;gt;] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0e4d97a&amp;gt;] ldlm_cli_cancel_list_local+0xea/0x260 [ptlrpc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0e54370&amp;gt;] ldlm_bl_thread_main+0x5a0/0xa40 [ptlrpc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff89cd7c40&amp;gt;] ? wake_up_state+0x20/0x20
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffffc0e53dd0&amp;gt;] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff89cc2e81&amp;gt;] kthread+0xd1/0xe0
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff89cc2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff8a376c1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Oct  6 21:27:54 c05 kernel: [&amp;lt;ffffffff89cc2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
Oct  6 21:27:54 c05 kernel: Code: 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 a0 bf 94 8a 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 08 &amp;lt;85&amp;gt; c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 
Oct  6 21:27:56 c05 kernel: LustreError: 152233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) ai400-OST0006-osc-ffff8eafdda3d800: namespace resource [0x10:0x0:0x0].0x0 (ffff8ebf4ddcc840) refcount nonzero (1404) after lock cleanup; forcing cleanup.
Oct  6 21:27:56 c05 kernel: Lustre: ai400-OST0006-osc-ffff8eafdda3d800: Connection restored to 10.0.10.178@o2ib10 (at 10.0.10.178@o2ib10)
Oct  6 21:27:58 c05 kernel: LustreError: 152229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) ai400-OST0002-osc-ffff8eafdda3d800: namespace resource [0x10:0x0:0x0].0x0 (ffff8ec4263355c0) refcount nonzero (1729) after lock cleanup; forcing cleanup.
Oct  6 21:27:58 c05 kernel: LustreError: 152229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 59 previous similar messages
Oct  6 21:27:58 c05 kernel: Lustre: ai400-OST0002-osc-ffff8eafdda3d800: Connection restored to 10.0.10.176@o2ib10 (at 10.0.10.176@o2ib10)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>master (commit fcf219d) on both server and client</environment>
        <key id="57098">LU-12832</key>
            <summary>soft lockup in ldlm_bl_xx threads at read for a single shared strided file</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="wshilong">Wang Shilong</assignee>
                                    <reporter username="sihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Sun, 6 Oct 2019 12:30:01 +0000</created>
                <updated>Sat, 1 Aug 2020 00:42:37 +0000</updated>
                            <resolved>Wed, 13 Nov 2019 15:24:29 +0000</resolved>
                                    <version>Lustre 2.13.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="255976" author="simmonsja" created="Sun, 6 Oct 2019 14:56:08 +0000"  >&lt;p&gt;can you try patch&#160;&lt;a href=&quot;https://review.whamcloud.com/#/c/36229/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/36229/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="255981" author="sihara" created="Mon, 7 Oct 2019 04:12:02 +0000"  >&lt;blockquote&gt;&lt;p&gt;can you try patch &lt;a href=&quot;https://review.whamcloud.com/#/c/36229/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/36229/&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;tested this patch. although it&apos;s still problem, but it was another call trace at __write_lock_failed after patch.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct  7 12:45:10 c10 kernel: NMI watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [ldlm_bl_13:31247]
Oct  7 12:45:10 c10 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) xt_statistic xt_physdev xt_NFLOG nfnetlink_log veth dummy vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ip_set_hash_ip xt_set ip_set xt_nat ipt_REJECT nf_reject_ipv4 xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc nfsv3 nfs_acl nfs lockd grace fscache overlay(T) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE)
Oct  7 12:45:10 c10 kernel: mlxfw(OE) mlx4_en(OE) sunrpc iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev i2c_i801 sg lpc_ich mei_me ioatdma mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci igb crct10dif_pclmul crct10dif_common crc32c_intel mlx4_core(OE) libata ptp pps_core devlink dca i2c_algo_bit drm_panel_orientation_quirks mlx_compat(OE) [last unloaded: libcfs]
Oct  7 12:45:10 c10 kernel: CPU: 10 PID: 31247 Comm: ldlm_bl_13 Kdump: loaded Tainted: G           OEL ------------ T 3.10.0-957.27.2.el7.x86_64 #1
Oct  7 12:45:10 c10 kernel: Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
Oct  7 12:45:10 c10 kernel: task: ffff9fd7a9180000 ti: ffff9fe5e8364000 task.ti: ffff9fe5e8364000
Oct  7 12:45:10 c10 kernel: RIP: 0010:[&amp;lt;ffffffff82d88889&amp;gt;]  [&amp;lt;ffffffff82d88889&amp;gt;] __write_lock_failed+0x9/0x20
Oct  7 12:45:10 c10 kernel: RSP: 0018:ffff9fe5e8367c80  EFLAGS: 00000297
Oct  7 12:45:10 c10 kernel: RAX: ffff9fd56365a1d8 RBX: 0000000000000140 RCX: ffff9fd56365a1b0
Oct  7 12:45:10 c10 kernel: RDX: 0000000000000001 RSI: ffff9fd6a7258150 RDI: ffff9fd56365a1cc
Oct  7 12:45:10 c10 kernel: RBP: ffff9fe5e8367c80 R08: ffff9fd6463d8bb0 R09: 0000000000000001
Oct  7 12:45:10 c10 kernel: R10: 0000000000000000 R11: ffffe16951a6ab00 R12: ffff9fd2833c1180
Oct  7 12:45:10 c10 kernel: R13: ffffffffc12ae36f R14: ffff9fe5e8367c38 R15: ffff9fd34eb66150
Oct  7 12:45:10 c10 kernel: FS:  0000000000000000(0000) GS:ffff9fd6fe880000(0000) knlGS:0000000000000000
Oct  7 12:45:10 c10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  7 12:45:10 c10 kernel: CR2: 00000000023b6000 CR3: 000000028e610000 CR4: 00000000003607e0
Oct  7 12:45:10 c10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  7 12:45:10 c10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct  7 12:45:10 c10 kernel: Call Trace:
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff8316c577&amp;gt;] _raw_write_lock+0x17/0x20
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc0ee635a&amp;gt;] cl_object_attr_write_lock+0x1a/0x20 [obdclass]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc1295746&amp;gt;] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc10e954a&amp;gt;] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc0ebf979&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc10f4460&amp;gt;] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc10f897a&amp;gt;] ldlm_cli_cancel_list_local+0xea/0x260 [ptlrpc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc10ff370&amp;gt;] ldlm_bl_thread_main+0x5a0/0xa40 [ptlrpc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff82ad7c40&amp;gt;] ? wake_up_state+0x20/0x20
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffffc10fedd0&amp;gt;] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff82ac2e81&amp;gt;] kthread+0xd1/0xe0
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff82ac2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff83176c1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Oct  7 12:45:10 c10 kernel: [&amp;lt;ffffffff82ac2db0&amp;gt;] ? insert_kthread_work+0x40/0x40
Oct  7 12:45:10 c10 kernel: Code: 00 00 0f 1f 44 00 00 41 ff e7 e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 3c 24 c3 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 &amp;lt;83&amp;gt; 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 90 66 2e 0f 1f 84 00 00 00 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="256078" author="pjones" created="Tue, 8 Oct 2019 17:14:48 +0000"  >&lt;p&gt;Shilong&lt;/p&gt;

&lt;p&gt;Could you please advise&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="256100" author="sihara" created="Wed, 9 Oct 2019 00:30:11 +0000"  >&lt;p&gt;it enabled kernel.softlockup_panic and we got crash and crash dump when softlockup happens on client. &lt;/p&gt;</comment>
                            <comment id="256109" author="wshilong" created="Wed, 9 Oct 2019 07:06:49 +0000"  >&lt;p&gt;Wang Shilong (wshilong@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/36413&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36413&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12832&quot; title=&quot;soft lockup in ldlm_bl_xx threads at read for a single shared strided file&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12832&quot;&gt;&lt;del&gt;LU-12832&lt;/del&gt;&lt;/a&gt; osc: invalidate kms upon blocking ast&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 92f87e3a4b25392930e3b6df9b5c6a46eb6191ff&lt;/p&gt;</comment>
                            <comment id="256165" author="wshilong" created="Thu, 10 Oct 2019 02:10:14 +0000"  >&lt;p&gt;Just re-think the problem, IOR seems try to write file firstly stridely and in this case we might have every&lt;br/&gt;
client full with many PW extents, more than 60K each client.&lt;/p&gt;

&lt;p&gt;And before Read starts, lock cancel might happen which will try to cancel tons of cache? which might make our spinlock very hot.&lt;/p&gt;</comment>
                            <comment id="256167" author="wshilong" created="Thu, 10 Oct 2019 02:42:27 +0000"  >&lt;p&gt;So to walk around the problem, maybe use a smaller LRU SIZE shall be fine.&lt;/p&gt;

&lt;p&gt;But we might need add some debug information to extent tree locks to understand details etc.&lt;/p&gt;</comment>
                            <comment id="256170" author="sihara" created="Thu, 10 Oct 2019 06:00:04 +0000"  >&lt;p&gt;yes, fixed lru_size or short age for lru cache could be workaround, but the other hand, some application prefer keeping locks in caches as much as possible. For just an benchmark with ior for strided write/read is fine, but we need a sophisticate logic for both workload reasonably works properly &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="256173" author="gerrit" created="Thu, 10 Oct 2019 07:33:48 +0000"  >&lt;p&gt;Wang Shilong (wshilong@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/36419&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36419&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12832&quot; title=&quot;soft lockup in ldlm_bl_xx threads at read for a single shared strided file&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12832&quot;&gt;&lt;del&gt;LU-12832&lt;/del&gt;&lt;/a&gt; ldlm: debug extent tree&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ec320272aa8d3f3b6353aff2f556b9641f9245a9&lt;/p&gt;</comment>
                            <comment id="256177" author="wshilong" created="Thu, 10 Oct 2019 10:47:31 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=dongyang&quot; class=&quot;user-hover&quot; rel=&quot;dongyang&quot;&gt;dongyang&lt;/a&gt;  found following problem might fix the problem:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/ldlm/ldlm_extent.c b/lustre/ldlm/ldlm_extent.c
index 9e046c91ff..36998cb5f9 100644
--- a/lustre/ldlm/ldlm_extent.c
+++ b/lustre/ldlm/ldlm_extent.c
@@ -890,7 +890,7 @@ static enum interval_iter ldlm_kms_shift_cb(struct interval_node *n,
    /* If we find a lock with a greater or equal kms, we are not the
     * highest lock (or we share that distinction with another lock), and
     * don&apos;t need to update KMS.  Return old_kms and stop looking. */
-   if (lock-&amp;gt;l_policy_data.l_extent.end &amp;gt;= arg-&amp;gt;old_kms) {
+   if (lock-&amp;gt;l_policy_data.l_extent.end + 1 &amp;gt;= arg-&amp;gt;old_kms) {
        arg-&amp;gt;kms = arg-&amp;gt;old_kms;
        arg-&amp;gt;complete = true;
        RETURN(INTERVAL_ITER_STOP);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The point is It is very possible that arg-&amp;gt;old_kms = last extent end. and if we fail last one check,&lt;br/&gt;
we will have to iterate all PR extent locks in tree which is O(N).&lt;/p&gt;</comment>
                            <comment id="256252" author="adilger" created="Fri, 11 Oct 2019 19:24:19 +0000"  >&lt;p&gt;It definitely seems possible that the above change could fix the problem, as the comment before &lt;tt&gt;ldlm_extent_shift_kms()&lt;/tt&gt; says:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;NB: A lock on [x,y] protects a KMS of up to y + 1 bytes!&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;. &lt;/p&gt;

&lt;p&gt;That said, the above-referenced code in &lt;tt&gt;ldlm_kms_shift_cb()&lt;/tt&gt; hasn&apos;t changed since patch &lt;a href=&quot;https://review.whamcloud.com/20779&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/20779&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8272&quot; title=&quot;Use granted extent tree to update kms&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8272&quot;&gt;&lt;del&gt;LU-8272&lt;/del&gt;&lt;/a&gt; ldlm: Use interval tree to update kms&lt;/tt&gt;&quot; in 2016-10-17, so it shouldn&apos;t be the &lt;em&gt;immediate&lt;/em&gt; cause of a regression since the last time this code was working, so it is at least going to be a bad interaction with a patch that landed more recently.  I&apos;m not saying the above code is &lt;b&gt;not&lt;/b&gt; the source of the problem, just that it would be good to know why this problem was suddenly being hit.  It doesn&apos;t look like it is directly related to lock conversion, since that patch  landed as v2_11_52_0-78-g541902a3f9 in 2018-07-03 before the 2.12.0 release and we haven&apos;t been seeing this problem very long.&lt;/p&gt;

&lt;p&gt;It looks like there are 3 patches to &lt;tt&gt;ldlm/ldlm_extent.c&lt;/tt&gt; in the 2.13 development cycle:&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/35057&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35057&lt;/a&gt; 2019-08-27 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12017&quot; title=&quot;Truncate vs setxattr deadlock with DoM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12017&quot;&gt;&lt;del&gt;LU-12017&lt;/del&gt;&lt;/a&gt; ldlm: DoM truncate deadlock&lt;/tt&gt;&quot;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/35486&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35486&lt;/a&gt; 2019-07-20 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12542&quot; title=&quot;LDLM improvements form linux lustre client work&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12542&quot;&gt;&lt;del&gt;LU-12542&lt;/del&gt;&lt;/a&gt; ldlm: simplify lock_mode_to_index()&lt;/tt&gt;&quot;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/19665&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/19665&lt;/a&gt; 2019-03-21 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8047&quot; title=&quot;optimize ll_md_blocking_ast() for a failed lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8047&quot;&gt;&lt;del&gt;LU-8047&lt;/del&gt;&lt;/a&gt; llite: optimizations for not granted lock processing&lt;/tt&gt;&quot;&lt;/p&gt;

&lt;p&gt;There are also a few other patches in 2.13 that are related to KMS handling:&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/36200&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36200&lt;/a&gt; 2019-09-30 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12681&quot; title=&quot;Data corruption - due incorrect KMS with SEL files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12681&quot;&gt;&lt;del&gt;LU-12681&lt;/del&gt;&lt;/a&gt; osc: wrong cache of LVB attrs, part2&lt;/tt&gt;&quot;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/36199&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/36199&lt;/a&gt; 2019-09-30 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12681&quot; title=&quot;Data corruption - due incorrect KMS with SEL files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12681&quot;&gt;&lt;del&gt;LU-12681&lt;/del&gt;&lt;/a&gt; osc: wrong cache of LVB attrs&lt;/tt&gt;&quot;&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/34095&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34095&lt;/a&gt; 2019-06-25 &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12043&quot; title=&quot;improve Lustre single thread read performances&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12043&quot;&gt;&lt;del&gt;LU-12043&lt;/del&gt;&lt;/a&gt; llite: improve single-thread read performance&lt;/tt&gt;&quot;&lt;/p&gt;

&lt;p&gt;which would be a good place to start to see if reverting those patches, or testing the commit immediately before each patch avoids this problem.  If we know the first time this problem was seen it might make it easy to track down the problematic patch.&lt;/p&gt;

&lt;p&gt;If there are problems tracking down the source of this bug then it would be possible to use &quot;&lt;tt&gt;git bisect&lt;/tt&gt;&quot; at the cost of running more tests.  There are currently 1008 patches from &lt;tt&gt;2.12.50..fcf219d&lt;/tt&gt; so if the problem is reliably hit during testing then it would be possible to find the problem in 10 test runs or fewer.  Since it is likely to be a client-side problem then it (probably) wouldn&apos;t be necessary to update the server version each time if that makes it faster for each test run.&lt;/p&gt;</comment>
                            <comment id="256290" author="wshilong" created="Sat, 12 Oct 2019 06:40:04 +0000"  >&lt;p&gt;Yup, it is intresting to know why we did not hit the problem before, this could potentially related to LRU SIZE? or testing file size?&lt;br/&gt;
or some locks changes regression raised the problem.&lt;/p&gt;</comment>
                            <comment id="256294" author="sihara" created="Sat, 12 Oct 2019 12:29:10 +0000"  >&lt;p&gt;I&apos;ve tested above change with same workload, but it caused another regression at write (not read). Please see console messages whne client crashed.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[295995.970954] LustreError: 92739:0:(osc_cache.c:1153:osc_extent_make_ready()) ASSERTION( last_oap_count &amp;gt; 0 ) failed: 
[295995.971011] LustreError: 92739:0:(osc_cache.c:1153:osc_extent_make_ready()) LBUG
[295995.971047] Pid: 92739, comm: ldlm_bl_02 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
[295995.971048] Call Trace:
[295995.971058]  [&amp;lt;ffffffffc0cb58ac&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[295995.971071]  [&amp;lt;ffffffffc0cb595c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[295995.971077]  [&amp;lt;ffffffffc123204b&amp;gt;] osc_extent_make_ready+0x91b/0xe50 [osc]
[295995.971097]  [&amp;lt;ffffffffc123499f&amp;gt;] osc_io_unplug0+0xd9f/0x1910 [osc]
[295995.971105]  [&amp;lt;ffffffffc12391e0&amp;gt;] osc_cache_writeback_range+0x9c0/0xff0 [osc]
[295995.971113]  [&amp;lt;ffffffffc1224275&amp;gt;] osc_lock_flush+0x195/0x290 [osc]
[295995.971120]  [&amp;lt;ffffffffc1224733&amp;gt;] osc_ldlm_blocking_ast+0x2e3/0x3a0 [osc]
[295995.971126]  [&amp;lt;ffffffffc107854a&amp;gt;] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[295995.971161]  [&amp;lt;ffffffffc1083460&amp;gt;] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
[295995.971181]  [&amp;lt;ffffffffc10891c1&amp;gt;] ldlm_cli_cancel+0x161/0x650 [ptlrpc]
[295995.971205]  [&amp;lt;ffffffffc12245ca&amp;gt;] osc_ldlm_blocking_ast+0x17a/0x3a0 [osc]
[295995.971213]  [&amp;lt;ffffffffc108d9dd&amp;gt;] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
[295995.971233]  [&amp;lt;ffffffffc108e5d0&amp;gt;] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc]
[295995.971252]  [&amp;lt;ffffffff932c2e81&amp;gt;] kthread+0xd1/0xe0
[295995.971258]  [&amp;lt;ffffffff93976c1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
[295995.971264]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[295995.971294] Kernel panic - not syncing: LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="256303" author="sihara" created="Sat, 12 Oct 2019 15:34:55 +0000"  >&lt;p&gt;I wonder if overstriping could be related. I have been enabling overstriping for this testing, but will test without overstriping in master. If the problem is still exist, we can test old commit. It takes too long time to finish this ior job without overstriping though... &lt;/p&gt;</comment>
                            <comment id="256913" author="sihara" created="Wed, 23 Oct 2019 06:01:53 +0000"  >&lt;p&gt;&lt;font color=&quot;#000000&quot;&gt;background aggressive lock canceling in advance helps to avoid this problem.&lt;/font&gt;&lt;br/&gt;
&lt;font color=&quot;#000000&quot;&gt;e.g. setting 30sec to lru_max_age &apos;lctl set_param ldlm.namespaces.*.lru_max_age=30000&apos;&#160;&lt;/font&gt;&lt;br/&gt;
After this change, I didn&apos;t see see problem.&lt;br/&gt;
However, now I&apos;m reproducing &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12786&quot; title=&quot;io hard read fails due to data verification fails&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12786&quot;&gt;&lt;del&gt;LU-12786&lt;/del&gt;&lt;/a&gt; again.. I will re-open &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12786&quot; title=&quot;io hard read fails due to data verification fails&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12786&quot;&gt;&lt;del&gt;LU-12786&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="258237" author="pjones" created="Wed, 13 Nov 2019 15:24:29 +0000"  >&lt;p&gt;seems to be hw issue&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="56947">LU-12786</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57793">LU-13131</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="55823">LU-12377</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="33641" name="vmcore-dmesg.txt" size="120692" author="sihara" created="Sat, 12 Oct 2019 12:30:58 +0000"/>
                            <attachment id="33626" name="vmcore-dmesg.txt" size="1045999" author="sihara" created="Wed, 9 Oct 2019 00:29:10 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00nov:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>