<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:52:55 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12476] ldlm_bl_ processes running at 100% causing client issues</title>
                <link>https://jira.whamcloud.com/browse/LU-12476</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;symptom is that clients cannot access lustre filesystem data. Seeing timeouts in the logs, e.g.,:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1561643289/real 1561643289]  req@ffff9bffcbd3b900 x1637338661347344/t0(0) o36-&amp;gt;echo-MDT0000-mdc-ffff9c2b2b775000@10.23.22.104@tcp:12/10 lens 880/856 e 24 to 1 dl 1561643890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1

Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 4 previous similar messages

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection to echo-MDT0000 (at 10.23.22.104@tcp) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection restored to 10.23.22.104@tcp (at 10.23.22.104@tcp)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the MDS we see:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 06:55:08 emds1 kernel: LustreError: 27539:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1561643408, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-echo-MDT0000_UUID lock: ffff88522ed35800/0x7d046634332b9f1e lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27539 timeout: 0 lvb_type: 0

Jun 27 07:00:04 emds1 kernel: Lustre: 27723:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (4/4), not sending early reply#012  req@ffff88231ac4f500 x1637338661348352/t0(0) o36-&amp;gt;e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:498/0 lens 928/3128 e 1 to 0 dl 1561644008 ref 2 fl Interpret:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: Lustre: 49916:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (118:3300s); client may timeout.  req@ffff8823586ce900 x1637332991328208/t910888684941(0) o36-&amp;gt;e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:247/0 lens 680/424 e 3 to 0 dl 1561640737 ref 1 fl Complete:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: LNet: Service thread pid 49916 completed after 3417.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We see a ldlm_bl_01 (or ldlm_bl_02) at 100% on a CPU core for extended periods (over an hour). It will recover for a few minutes, then max out a CPU again&lt;/p&gt;

&lt;p&gt;What might be causing this?&lt;/p&gt;</description>
                <environment></environment>
        <key id="56206">LU-12476</key>
            <summary>ldlm_bl_ processes running at 100% causing client issues</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="pjones">Peter Jones</assignee>
                                    <reporter username="cmcl">Campbell Mcleay</reporter>
                        <labels>
                    </labels>
                <created>Thu, 27 Jun 2019 16:34:45 +0000</created>
                <updated>Wed, 2 Dec 2020 23:53:25 +0000</updated>
                                            <version>Lustre 2.10.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="250148" author="cmcl" created="Thu, 27 Jun 2019 16:50:47 +0000"  >&lt;p&gt;I noticed the &apos;system was overloaded (too many service threads, or there were not enough hardware resources).&apos; mentioned in the error, though this system has been running with a very consistent load on the same hardware. We are running an lfsck on the MDS (it has been running for about a week: lctl lfsck_query -M echo-MDT0000) but have only noticed the above issues just recently.&lt;/p&gt;</comment>
                            <comment id="250297" author="pfarrell" created="Fri, 28 Jun 2019 17:53:26 +0000"  >&lt;p&gt;Campbell,&lt;/p&gt;

&lt;p&gt;This doesn&apos;t obviously match up with any known issue from what we&apos;ve got here, but it should be easy to get a bunch more info.&lt;/p&gt;



&lt;p&gt;Can you do this on the system with the ldlm_bl thread maxing out (that&apos;s the MDS, right?) &lt;b&gt;while that&apos;s happening&lt;/b&gt;, and then provide the resulting log to us:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;DEBUGMB=`lctl get_param -n debug_mb`
lctl set_param *debug=-1 debug_mb=10000
lctl clear
lctl mark &quot;start&quot;
sleep 1
lctl dk &amp;gt; /tmp/log
#Set debug back to defaults
lctl set_param debug=&quot;super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck&quot;
lctl set_param debug_mb=$DEBUGMB&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, also on the MDS, output of:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl get_param ldlm.namespaces.*.lock_count &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;(It would be interesting to get that lock_count info both while the bl thread is acting up, and when it is not.)&lt;/p&gt;


&lt;p&gt;It&apos;s also odd that lfsck has been running for so long - Can you tell us more about that?&#160; Why are you running it - Is there a problem that came up?&lt;br/&gt;
And let&apos;s get the status of lfsck on the MDS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lctl get_param mdd.*.lfsck_*&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="250298" author="pfarrell" created="Fri, 28 Jun 2019 17:54:54 +0000"  >&lt;p&gt;FWIW, the overloaded message you noted simply means something took too long.&#160; Sometimes it means there is too much demand on the system resources, but it can also reflect a delay in completing a specific operation for other reasons.&#160; So if you have some thread which is stuck for whatever reason, it will eventually print this message.&#160; (We could perhaps improve it to be clearer)&lt;/p&gt;</comment>
                            <comment id="250481" author="cmcl" created="Tue, 2 Jul 2019 09:43:13 +0000"  >&lt;p&gt;Thanks Patrick. I was away on the Friday and most of Monday, so apologies for the delay in replying. The problem seem to sort itself out without intervention before I could run the commands you suggested. We haven&apos;t seen this issue on our other MDSs that are comparatively busy, but there might have been an unusual operation or set of operations (millions of tiny files or a massive file or something else). If it happens again, I will run the commands you suggested to see what it reveals&lt;/p&gt;</comment>
                            <comment id="250618" author="cmcl" created="Wed, 3 Jul 2019 16:41:59 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;This is happening now, attached log (emds1-log.gz) and the other info requested.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Campbell&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                          
 21718 root      20   0       0      0      0 R 100.0  0.0 208:06.75 ldlm_bl_03  

[root@emds1 cmcl]# lctl get_param ldlm.namespaces.*.lock_count 
ldlm.namespaces.MGC10.23.22.104@tcp.lock_count=6
ldlm.namespaces.MGS.lock_count=328
ldlm.namespaces.echo-MDT0000-lwp-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0000-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0001-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0002-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0003-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0004-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0005-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0006-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0007-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0008-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0009-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0010-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0011-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0012-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0013-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0014-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0015-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0016-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0017-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0018-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0019-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0020-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0021-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0022-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0023-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0024-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0025-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0026-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0027-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0028-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0029-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0030-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0031-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0032-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0033-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0034-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0035-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0036-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0037-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0038-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0039-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0040-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0041-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0042-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0043-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0044-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0045-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0046-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0047-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0048-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0049-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0050-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0051-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0052-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0053-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0054-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0055-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0056-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0057-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0058-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0059-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0060-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0061-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0062-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0063-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0064-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0065-osc-MDT0000.lock_count=0
ldlm.namespaces.mdt-echo-MDT0000_UUID.lock_count=15540771
[root@emds1 cmcl]# lctl get_param mdd.*.lfsck_*
mdd.echo-MDT0000.lfsck_async_windows=1024
mdd.echo-MDT0000.lfsck_layout=
name: lfsck_layout
magic: 0xb17371b9
version: 2
status: scanning-phase1
flags:
param:
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1561056883
time_since_latest_start: 1113713 seconds
last_checkpoint_time: 1562170542
time_since_last_checkpoint: 54 seconds
latest_start_position: 77
last_checkpoint_position: 1666220057
first_failure_position: 262195188
success_count: 0
repaired_dangling: 0
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 0
repaired_inconsistent_owner: 17
repaired_others: 0
skipped: 0
failed_phase1: 1
failed_phase2: 0
checked_phase1: 319917891
checked_phase2: 0
run_time_phase1: 1105579 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 289 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 522 items/sec
real-time_speed_phase2: N/A
current_position: 1666711854
mdd.echo-MDT0000.lfsck_namespace=
name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param:
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1561056883
time_since_latest_start: 1113713 seconds
last_checkpoint_time: 1562170542
time_since_last_checkpoint: 54 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 1666220057, [0x20003e671:0x1e10:0x0], 0x511cd47b1a8309c2
first_failure_position: 73962505, [0x20003cb09:0x449c:0x0], 0x712fba5e57a3ada8
checked_phase1: 1924518312
checked_phase2: 0
updated_phase1: 710433
updated_phase2: 0
failed_phase1: 5
failed_phase2: 0
directories: 97008120
dirent_repaired: 0
linkea_repaired: 710433
nlinks_repaired: 0
multiple_linked_checked: 1674560030
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
success_count: 0
run_time_phase1: 1112435 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 1730 items/sec
average_speed_phase2: N/A
average_speed_total: 1730 items/sec
real_time_speed_phase1: 3808 items/sec
real_time_speed_phase2: N/A
current_position: 1666711854, [0x20003e1f9:0x1e423:0x0], 0x54bdf08ba56f6996
mdd.echo-MDT0000.lfsck_speed_limit=0
[root@emds1 cmcl]#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="250619" author="cmcl" created="Wed, 3 Jul 2019 16:47:24 +0000"  >&lt;p&gt;Could the lfsck be blocking a resource?&lt;/p&gt;</comment>
                            <comment id="250620" author="pfarrell" created="Wed, 3 Jul 2019 16:57:24 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=cmcl&quot; class=&quot;user-hover&quot; rel=&quot;cmcl&quot;&gt;cmcl&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;I&apos;ll take a look.&#160; I think it&apos;s more likely the problem is related to the number of active locks - 15 million is quite a few.&#160; It&apos;s not impossible the lfsck scan is causing that, but...&lt;/p&gt;

&lt;p&gt;I am wondering again by the way:&lt;br/&gt;
It&apos;s also odd that lfsck has been running for so long - Can you tell us more about that?&#160; Why are you running it - Is there a problem that came up, or did it start automatically?&lt;/p&gt;</comment>
                            <comment id="250621" author="cmcl" created="Wed, 3 Jul 2019 17:09:55 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Errors were reported by e2fsck when it was rebooted (it had been crashing). It was running 2.10.6, so we upgraded it, and ran an lfsck as it seemed sensible to run a check.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Campbell&lt;/p&gt;</comment>
                            <comment id="250837" author="cmcl" created="Mon, 8 Jul 2019 16:50:24 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Could the errors on the MDS filesystem be causing problems on the client? The client was unresponsive and the last thing in the log on the client was:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul  8 09:14:50 vanlustre3 kernel: LustreError: 146130:0:(file.c:3644:ll_inode_revalidate_fini()) echo: revalidate FID [0x20001dcd0:0x1382e:0x0] error: rc = -4
Jul  8 09:14:50 vanlustre3 kernel: LustreError: 146130:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 841563 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the MDS&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul  8 09:13:49 emds1 kernel: LustreError: 32492:0:(osp_object.c:582:osp_attr_get()) echo-OST0006-osc-MDT0000:osp_attr_get update error [0x100060000:0x2c5f05e:0x0]: rc = -4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We stopped the lfsck to see if this might be the cause of the problems but I think that was not a good idea if the filesystem has errors and we&apos;re not sure what is causing the problems (this seems to be a different problem). It seems somehow a bad idea to run a fsck on an active filesystem (we&apos;re only doing this as it is an important system). Should we perhaps stop all filesystem activity and just run the fsck? The problem is we don&apos;t know how long it would be offline for.&lt;/p&gt;

&lt;p&gt;Thanks for any help,&lt;/p&gt;

&lt;p&gt;Campbell&lt;/p&gt;</comment>
                            <comment id="250839" author="pfarrell" created="Mon, 8 Jul 2019 17:04:22 +0000"  >&lt;p&gt;Ah, sorry - I managed to lose track of this one.&#160; I&apos;ll look at your logs.&lt;/p&gt;

&lt;p&gt;Certainly the errors could be causing client side problems.&#160; -4 (EINTR) is an odd error to be getting from osp_attr_get, I&apos;ll have to poke around.&lt;/p&gt;</comment>
                            <comment id="250888" author="cmcl" created="Tue, 9 Jul 2019 11:51:50 +0000"  >&lt;p&gt;Hi Patrick, &lt;/p&gt;

&lt;p&gt;We had another one of these incidents, after we stopped the lfsck, so that doesn&apos;t look like it should cause issues. I&apos;ve gathered some more data - if you want it, let me know and I&apos;ll upload it. Would you recommend restarting the lfsck, or leave it for now?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Campbell&lt;/p&gt;</comment>
                            <comment id="251622" author="cmcl" created="Thu, 18 Jul 2019 10:41:31 +0000"  >&lt;p&gt;Just an amendment: the fsck had actually finished. Any updates from your side?&lt;/p&gt;</comment>
                            <comment id="251963" author="cmcl" created="Wed, 24 Jul 2019 18:05:59 +0000"  >&lt;p&gt;Hi Patrick,&lt;/p&gt;

&lt;p&gt;Any recommendations? The issue is becoming so bad we are getting a bit desparate. Should we try 2.12?&lt;/p&gt;

&lt;p&gt;Thanks for any help,&lt;/p&gt;

&lt;p&gt;Campbell&lt;/p&gt;</comment>
                            <comment id="286551" author="sthiell" created="Wed, 2 Dec 2020 23:53:25 +0000"  >&lt;p&gt;Even the latest Lustre 2.12.5, lblm_bl doesn&apos;t seem to scale very well when the lock count per MDT is above 10M, like in this ticket. This happens when users trigger large parallel copies for example, of hundreds of millions of files (esp. when using the --delete flag of rsync, not sure why). We&apos;ve seen that again today, with ldlm_bl at 100% and a lock_count above 15M, generating a lot of evictions and misc. errors. The workaround is to reduce lru_max_age on clients as mentioned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12832&quot; title=&quot;soft lockup in ldlm_bl_xx threads at read for a single shared strided file&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12832&quot;&gt;&lt;del&gt;LU-12832&lt;/del&gt;&lt;/a&gt;. It would be nice if the MDT could adjust this dynamically when ldlm_bl becomes cpu bound.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="33069" name="emds1-log.gz" size="4599254" author="cmcl" created="Wed, 3 Jul 2019 16:44:27 +0000"/>
                            <attachment id="33024" name="messages-vanlustre3.gz" size="630759" author="cmcl" created="Thu, 27 Jun 2019 17:02:42 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00itb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>