<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:40:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11092] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]</title>
                <link>https://jira.whamcloud.com/browse/LU-11092</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I&apos;m running a robinhood scan on a lustre filesystem with 80M files. As I&apos;m getting memory pressure on the system (80M * ~1kB per lustre_inode_cache is big) and the system tries to shrink the slabs, I get those errors in /var/log/messages:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;2018-06-20T10:49:34.165875-04:00 btsotbal3000 kernel: [ 4889.674359] Lustre: lustre-MDT0000-mdc-ffff883fd53bb000: Connection to lustre-MDT0000 (at 10.11.201.11@o2ib) was lost; in progress operations using this service will wait for recov&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;ery to complete&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:38.493202-04:00 btsotbal3000 kernel: [ 5014.015342] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505581/real 0&amp;#93;&lt;/span&gt; req@ffff881f7b3513c0 x16037989090541&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;12/t0(0) o103-&amp;gt;lustre-OST008c-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:38.493217-04:00 btsotbal3000 kernel: [ 5014.017476] Lustre: lustre-OST008c-osc-ffff883fd53bb000: Connection to lustre-OST008c (at 10.11.201.18@o2ib) was lost; in progress operations using this service will wait for recovery to complete&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:39.012492-04:00 btsotbal3000 kernel: [ 5014.527203] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505581/real 0&amp;#93;&lt;/span&gt; req@ffff881e37d8f080 x1603798909214432/t0(0) o103-&amp;gt;lustre-OST0084-osc-ffff883fd53bb000@10.11.201.17@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:39.012505-04:00 btsotbal3000 kernel: [ 5014.527207] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 262 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:40.008987-04:00 btsotbal3000 kernel: [ 5015.533078] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505582/real 0&amp;#93;&lt;/span&gt; req@ffff881f2e08acc0 x1603798909677472/t0(0) o103-&amp;gt;lustre-OST0089-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:40.008998-04:00 btsotbal3000 kernel: [ 5015.533083] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 553 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:42.022152-04:00 btsotbal3000 kernel: [ 5017.542234] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505582/real 0&amp;#93;&lt;/span&gt; req@ffff8818937e8340 x1603798910485936/t0(0) o103-&amp;gt;lustre-OST00a9-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:42.022165-04:00 btsotbal3000 kernel: [ 5017.542239] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1102 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:46.030748-04:00 btsotbal3000 kernel: [ 5021.548246] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505584/real 0&amp;#93;&lt;/span&gt; req@ffff881e0945b640 x1603798911754992/t0(0) o103-&amp;gt;lustre-OST00ab-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505648 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:46.030762-04:00 btsotbal3000 kernel: [ 5021.548249] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2104 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:54.040347-04:00 btsotbal3000 kernel: [ 5029.557301] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1529505589/real 0&amp;#93;&lt;/span&gt; req@ffff881e054d39c0 x1603798914360720/t0(0) o103-&amp;gt;lustre-OST009f-osc-ffff883fd53bb000@10.11.200.14@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505658 ref 2 fl Rpc:X/0/ffffffff rc 0/-1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:51:54.040360-04:00 btsotbal3000 kernel: [ 5029.557307] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4325 previous similar messages&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579715-04:00 btsotbal3000 kernel: [ 5037.097043] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpcd_00_18:4222&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579728-04:00 btsotbal3000 kernel: [ 5037.106479] Modules linked in: nfsv3 nfs fscache mpt3sas raid_class scsi_transport_sas mptctl mptbase osc(OEN) mgc(OEN) lustre(OEN) lmv(OEN) fld(OEN) mdc(OEN) fid(OEN) lov(OEN) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) af_packet iscsi_ibft iscsi_boot_sysfs rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs ib_ipoib(OEX) ib_cm(OEX) ib_uverbs(OEX) ib_umad(OEX) mlx4_en(OEX) mlx4_ib(OEX) mlx4_core(OEX) msr xfs libcrc32c intel_rapl sb_edac edac_core ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp ixgbe kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mdio drbg ansi_cprng aesni_intel igb mgag200 ttm drm_kms_helper ipmi_si drm ipmi_devintf mei_me ipmi_msghandler pcspkr i2c_algo_bit fjes aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops iTCO_wdt iTCO_vendor_support dcdbas(X) dca shpchp lpc_ich mfd_core processor mei wmi button nfsd nfs_acl lockd auth_rpcgss grace sunrpc mlx5_ib(OEX) ext4 crc16 jbd2 mbcache ib_core(OEX) sd_mod mlx5_core(OEX) vxlan ip6_udp_tunnel udp_tunnel ptp pps_core devlink inet_lro ehci_pci mlx_compat(OEX) ehci_hcd ahci libahci usbcore libata megaraid_sas usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: dell_rbu&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579731-04:00 btsotbal3000 kernel: [ 5037.106543] Supported: No, Unsupported modules are loaded&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579732-04:00 btsotbal3000 kernel: [ 5037.106546] CPU: 12 PID: 4222 Comm: ptlrpcd_00_18 Tainted: G OE NX 4.4.132-94.33-default #1&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579733-04:00 btsotbal3000 kernel: [ 5037.106548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.0.2 03/15/2016&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579734-04:00 btsotbal3000 kernel: [ 5037.106549] task: ffff883f99d95640 ti: ffff883f99d98000 task.ti: ffff883f99d98000&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579735-04:00 btsotbal3000 kernel: [ 5037.106550] RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810cb01f&amp;gt;&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810cb01f&amp;gt;&amp;#93;&lt;/span&gt; native_queued_spin_lock_slowpath+0x10f/0x1a0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RSP: 0018:ffff883f99d9bc00 EFLAGS: 00000246&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000340000&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106558] RDX: ffff883fff05af80 RSI: 0000000000100001 RDI: ffff881ffae04500&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106559] RBP: ffff883f99d9bce0 R08: ffff881fff99af80 R09: 0000000000000000&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579738-04:00 btsotbal3000 kernel: [ 5037.106560] R10: 000000000000002d R11: ffff881efa09381b R12: 00000000147f4371&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579739-04:00 btsotbal3000 kernel: [ 5037.106561] R13: 00000000051fd0dc R14: 0000000000000000 R15: ffff883f99d95640&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106562] FS: 0000000000000000(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106564] CR2: 00007f0e6c018898 CR3: 0000000001e0c000 CR4: 00000000003606f0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579741-04:00 btsotbal3000 kernel: [ 5037.106565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579742-04:00 btsotbal3000 kernel: [ 5037.106566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579743-04:00 btsotbal3000 kernel: [ 5037.106567] Stack:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106568] ffffffff811915d2 ffffffffa0b98f15 ffffffffa0eee8db ffffffffa0eee8db&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106569] ffffffffa0eee8db ffffffffa0eee8db 0000000000000000 00080000ffffffff&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106571] 5b2a6a1100000000 00000000ffffffff ffff883f92821048 ffffffffa0eee8db&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106573] Call Trace:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106581] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811915d2&amp;gt;&amp;#93;&lt;/span&gt; queued_spin_lock_slowpath+0x7/0xa&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579746-04:00 btsotbal3000 kernel: [ 5037.106605] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0b98f15&amp;gt;&amp;#93;&lt;/span&gt; LNetMDUnlink+0x65/0x150 &lt;span class=&quot;error&quot;&gt;&amp;#91;lnet&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579747-04:00 btsotbal3000 kernel: [ 5037.106677] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0e854be&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_unregister_reply+0xee/0x6b0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106703] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0e89de3&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_expire_one_request+0xb3/0x420 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106726] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0e8a1f9&amp;gt;&amp;#93;&lt;/span&gt; ptlrpc_expired_set+0xa9/0x180 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106752] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0eb7679&amp;gt;&amp;#93;&lt;/span&gt; ptlrpcd+0x229/0x490 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106760] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8109ebc9&amp;gt;&amp;#93;&lt;/span&gt; kthread+0xc9/0xe0&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579750-04:00 btsotbal3000 kernel: [ 5037.106765] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff816177ef&amp;gt;&amp;#93;&lt;/span&gt; ret_from_fork+0x3f/0x80&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579751-04:00 btsotbal3000 kernel: [ 5037.107943] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x80&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107944]&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107945] Leftover inexact backtrace:&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107945]&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107947] &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8109eb00&amp;gt;&amp;#93;&lt;/span&gt; ? kthread_park+0x50/0x50&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;2018-06-20T10:52:01.579754-04:00 btsotbal3000 kernel: [ 5037.107948] Code: c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 80 af 01 00 48 03 14 c5 20 0d f5 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 &amp;lt;41&amp;gt; 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;At that point, all the CPUs go to 100% and it seems like we are not making progress anymore.&lt;/p&gt;

&lt;p&gt;I can repro by running robinhood for a few minutes and dropping the cache (echo 2 &amp;gt; /proc/sys/vm/drop_caches).&lt;/p&gt;

&lt;p&gt;I managed to fix it by limiting lru_max_age to 10s (ldlm.namespace.*.lru_max_age).&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Overall, it feels like when the slab shrinker run, if there are too many locks, reclaiming those locks takes a while, prevents the rpc stack to make progress, and then we hit a race condition in LNetMDUnlink.&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>SLES12SP3 -- 4.4.132-94.33-default client -- lustre 2.10.4&lt;br/&gt;
OSS &amp;amp; MDS on centos 7.3 -- lustre 2.10.0 (with a couple of hand picked patches)</environment>
        <key id="52586">LU-11092</key>
            <summary>NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jwallior">Julien Wallior</reporter>
                        <labels>
                    </labels>
                <created>Wed, 20 Jun 2018 19:39:54 +0000</created>
                <updated>Fri, 21 Jan 2022 00:05:11 +0000</updated>
                                            <version>Lustre 2.10.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="231655" author="adilger" created="Wed, 8 Aug 2018 18:08:38 +0000"  >&lt;p&gt;There is a patch under development that might be able to help this - it reduces the number of client RPCs in flight if it is not getting a timely response from the servers.  This helps reduce the load on the servers, and avoids timeouts on the clients.&lt;/p&gt;</comment>
                            <comment id="231873" author="adilger" created="Mon, 13 Aug 2018 17:19:48 +0000"  >&lt;p&gt;There are patches for master and b2_10 in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11079&quot; title=&quot;Control concurrent statahead instances&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11079&quot;&gt;&lt;del&gt;LU-11079&lt;/del&gt;&lt;/a&gt; that may address this issue.&lt;/p&gt;</comment>
                            <comment id="231927" author="jwallior" created="Tue, 14 Aug 2018 17:02:52 +0000"  >&lt;p&gt;We tried with the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11079&quot; title=&quot;Control concurrent statahead instances&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11079&quot;&gt;&lt;del&gt;LU-11079&lt;/del&gt;&lt;/a&gt;. Same issue (after putting lru_max_age back to the default).&lt;/p&gt;

&lt;p&gt;Overall, since we lowered lru_max_age, we haven&apos;t had any issue related to this bug.&lt;/p&gt;</comment>
                            <comment id="231959" author="adilger" created="Wed, 15 Aug 2018 04:06:41 +0000"  >&lt;p&gt;When you run without the reduced &lt;tt&gt;lru_max_age&lt;/tt&gt;, how many locks accumulate on the clients/MDS?  This can be seen on the clients and servers with &lt;tt&gt;lctl get_param ldlm.namespaces.&amp;#42;MDT&amp;#42;.lock_count&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="231960" author="adilger" created="Wed, 15 Aug 2018 04:25:22 +0000"  >&lt;p&gt;Is it the MDS or the client that is running short of memory?  It looks like the client is getting timeouts on the MDS RPCs, so finding out where the memory pressure is will determine what to investigate/fix.  In addition to the &lt;tt&gt;lock_count&lt;/tt&gt;, please provide the output of &lt;tt&gt;/proc/meminfo&lt;/tt&gt; and &lt;tt&gt;/proc/slabinfo&lt;/tt&gt; on the RBH client and MDS.&lt;/p&gt;</comment>
                            <comment id="231972" author="jwallior" created="Wed, 15 Aug 2018 15:15:42 +0000"  >&lt;p&gt;When I revert lru_max_age to default (3900000), I only do it on the client and I can repro. (this is live prod filesystem, so I can only break it so much).&lt;/p&gt;

&lt;p&gt;Regarding who&apos;s running out of memory, I think it&apos;s the client (I can either wait for it to run out of memory, or create the issue earlier by dropping the cache on the client). Now the client has 256GB, so even for a 80M filesystem, I would hope that should be enough to run rbh.&lt;/p&gt;

&lt;p&gt;Finally, regarding the timeouts, they say &quot;@@@ Request sent has timed out for sent delay&quot;. I was thinking that meant the client was stuck on something (clearing the memory) and didn&apos;t get to send the request on time. But maybe it means the client can&apos;t put the request on the wire because the server told it to wait (I&apos;m not sure if that&apos;s possible).&lt;/p&gt;

&lt;p&gt;The client was freshly rebooted before starting the run. After running rhb --scan for ~20min, I got the following results.&lt;br/&gt;
mds831.txt: data on the mds hosting mdt0&lt;br/&gt;
mds832.txt: data on the mds hosting mdt1&lt;br/&gt;
client.txt: data on the client&lt;/p&gt;

&lt;p&gt;I can get the data on the OSS too if you need them. &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/30765/30765_mds832.txt&quot; title=&quot;mds832.txt attached to LU-11092&quot;&gt;mds832.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/30766/30766_mds831.txt&quot; title=&quot;mds831.txt attached to LU-11092&quot;&gt;mds831.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/30767/30767_client.txt&quot; title=&quot;client.txt attached to LU-11092&quot;&gt;client.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="232092" author="adilger" created="Thu, 16 Aug 2018 22:26:23 +0000"  >&lt;p&gt;It might be with so much RAM on the client that it isn&apos;t trying very hard to cancel locks on the client.  Setting the lock max age to 10s is probably too short to be useful for other locks on the system (e.g. directories that are being accessed repeatedly).  Another option would be to set a limit on the number of locks on the client and increase the max age to something better like 10 minutes, like &lt;tt&gt;lctl set_param ldlm.namespaces.&amp;#42;.lru_size=100000 ldlm.namespaces.&amp;#42;.lru_max_age=600000&lt;/tt&gt; or similar.    If we assumed all of the RAM in use on the client was used by Lustre, the 1M inodes cached would be using about 13KB/inode (including 5M locks), which means that it would consume at most 13MB per OST for the inode/lock cache.  It may be that you are seeing more locks than inodes if e.g. you have a default stripe count = 4 (one MDT lock, and 4 OST locks per file).&lt;/p&gt;

&lt;p&gt;It probably makes sense to put some reasonable upper limit on the LRU size, since I&apos;ve seen similar problems of too many cached locks on other systems with a lot of RAM.  Even if there is enough free RAM it just slows things down when there are many millions of files cached for no real reason.  We might also need to take a look at the kernel slab cache shrinker to see if it is working effectively to shrink the DLM cache size under memory pressure.&lt;/p&gt;</comment>
                            <comment id="232133" author="jwallior" created="Fri, 17 Aug 2018 18:07:07 +0000"  >&lt;p&gt;Yes, we use stripe count &amp;gt; 1 in different directory. &lt;/p&gt;

&lt;p&gt;I tried lru_size = 100k and lru_max_age = 600k and that works fine too. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="52849">LU-11189</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="52535">LU-11079</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="60851">LU-13970</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="52525">LU-11075</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="30767" name="client.txt" size="16873" author="jwallior" created="Wed, 15 Aug 2018 15:15:22 +0000"/>
                            <attachment id="30766" name="mds831.txt" size="13743" author="jwallior" created="Wed, 15 Aug 2018 15:15:22 +0000"/>
                            <attachment id="30765" name="mds832.txt" size="13801" author="jwallior" created="Wed, 15 Aug 2018 15:15:22 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzyfb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>