<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:55:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12735] MGS misbehaving in 2.12.2+</title>
                <link>https://jira.whamcloud.com/browse/LU-12735</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our first issue on Fir after upgrading to 2.12.2_119 (Lustre b2_12 2.12.2_116 + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;del&gt;LU-11285&lt;/del&gt;&lt;/a&gt;,&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12017&quot; title=&quot;Truncate vs setxattr deadlock with DoM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12017&quot;&gt;&lt;del&gt;LU-12017&lt;/del&gt;&lt;/a&gt;,&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11761&quot; title=&quot;blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11761&quot;&gt;&lt;del&gt;LU-11761&lt;/del&gt;&lt;/a&gt;): the MGS seems to have gone crazy and put down the server along with 2 MDTs. Quite annoying if that is actually a MGS problem.&lt;/p&gt;

&lt;p&gt;Since the upgrade of the filesystem, I&apos;ve noticed those messages:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[Sun Sep  8 14:12:26 2019][291910.295291] LustreError: 166-1: MGC10.0.10.51@o2ib7: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail^M
[Sun Sep  8 14:12:26 2019][291910.308156] LustreError: Skipped 1 previous similar message^M
[Sun Sep  8 14:12:26 2019][291910.313838] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0^M
[Sun Sep  8 14:12:26 2019][291910.351487] LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:27 2019][291911.201631] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[Sun Sep  8 14:12:27 2019][291911.221161] LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message^M
[Sun Sep  8 14:12:53 2019][291936.837519] Lustre: MGS: Connection restored to eb318ae2-201e-a222-0b8e-3d4d1220bc21 (at 10.9.106.8@o2ib4)^M
[Sun Sep  8 14:12:53 2019][291936.837577] LNetError: 22274:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don&apos;t perform health checking (-103, 0)^M
[Sun Sep  8 14:12:53 2019][291936.861716] Lustre: Skipped 3411 previous similar messages^M
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Because everything seems to work OK, we didn&apos;t do anything. I think I&apos;ve seen that sometimes even with 2.10 and a solution is to remount the MGS.&lt;/p&gt;

&lt;p&gt;So last night, Fir was unaccessible and we got alerts.  There wasn&apos;t any crash but the primary MGS/MDS &lt;tt&gt;fir-md1-s1&lt;/tt&gt; was under heavy load:&lt;br/&gt;
&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;top - 22:35:21 up 3 days, 17:27,  1 user,  load average: 499.00, 487.24, 455.13
Tasks: 2201 total, 101 running, 2100 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us, 30.2 sy,  0.0 ni, 69.5 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 26356524+total, 99668392 free,  4215400 used, 15968145+buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 23785484+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 22485 root      20   0       0      0      0 R  37.8  0.0 181:28.81 ll_mgs_0029
 22394 root      20   0       0      0      0 R  35.9  0.0 175:39.61 ll_mgs_0001
 22476 root      20   0       0      0      0 R  33.9  0.0 171:57.25 ll_mgs_0020
 22467 root      20   0       0      0      0 R  31.9  0.0 171:41.76 ll_mgs_0011
 22479 root      20   0       0      0      0 R  31.6  0.0 171:59.09 ll_mgs_0023
 22483 root      20   0       0      0      0 R  27.6  0.0 166:36.77 ll_mgs_0027
 22471 root      20   0       0      0      0 R  21.4  0.0 172:23.91 ll_mgs_0015
 22451 root      20   0       0      0      0 R  19.7  0.0 170:26.72 ll_mgs_0003
 22455 root      20   0       0      0      0 R  19.7  0.0 160:29.77 ll_mgs_0007
 22478 root      20   0       0      0      0 R  15.5  0.0 174:40.18 ll_mgs_0022
 22457 root      20   0       0      0      0 R  13.8  0.0 172:22.42 ll_mgs_0008
 22484 root      20   0       0      0      0 R  11.5  0.0 186:13.69 ll_mgs_0028
 22473 root      20   0       0      0      0 R   9.9  0.0 159:36.69 ll_mgs_0017
 16487 root      20   0  164268   4572   1604 R   8.2  0.0   0:00.29 top
 22459 root      20   0       0      0      0 R   7.2  0.0 171:38.46 ll_mgs_0010
 22475 root      20   0       0      0      0 R   6.6  0.0 167:11.56 ll_mgs_0019
 22487 root      20   0       0      0      0 R   6.2  0.0 162:49.27 ll_mgs_0031
 24401 root      20   0       0      0      0 S   4.6  0.0  15:56.76 mdt00_098
 22472 root      20   0       0      0      0 R   3.9  0.0 170:10.49 ll_mgs_0016
 22482 root      20   0       0      0      0 R   3.6  0.0 168:09.82 ll_mgs_0026
 22269 root      20   0       0      0      0 S   3.3  0.0 101:06.47 kiblnd_sd_00_00
 22452 root      20   0       0      0      0 R   3.3  0.0 170:29.48 ll_mgs_0004
 22453 root      20   0       0      0      0 R   3.0  0.0 155:18.83 ll_mgs_0005
 22486 root      20   0       0      0      0 R   3.0  0.0 172:00.85 ll_mgs_0030
 24415 root      20   0       0      0      0 S   3.0  0.0   3:12.69 mdt00_109
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I should have take a crash dump there, but instead I tried to umount the MGS, which lead to many soft lockups and then a took a crash dump.&lt;/p&gt;

&lt;p&gt;I&apos;m attaching foreach bt of the crash dump as  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/33466/33466_fir-md1-s1-foreachbt-2019-09-08-22-38-08.log&quot; title=&quot;fir-md1-s1-foreachbt-2019-09-08-22-38-08.log attached to LU-12735&quot;&gt;fir-md1-s1-foreachbt-2019-09-08-22-38-08.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;

&lt;p&gt;Crash dump (while MGS unmounting) is available in the WC ftp as vmcore-fir-md1-s1-2019-09-08-22-38-08 &lt;br/&gt;
Kernel version is 3.10.0-957.27.2.el7_lustre.pl1.x86_64&lt;/p&gt;

&lt;p&gt;Thanks!&lt;br/&gt;
Stephane&lt;/p&gt;</description>
                <environment>CentOS 7.6, 2.12.2_116 + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11285&quot; title=&quot;don&amp;#39;t stop on the first blocked lock in ldlm_reprocess_queue()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11285&quot;&gt;&lt;strike&gt;LU-11285&lt;/strike&gt;&lt;/a&gt;,&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12017&quot; title=&quot;Truncate vs setxattr deadlock with DoM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12017&quot;&gt;&lt;strike&gt;LU-12017&lt;/strike&gt;&lt;/a&gt;,&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11761&quot; title=&quot;blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11761&quot;&gt;&lt;strike&gt;LU-11761&lt;/strike&gt;&lt;/a&gt;</environment>
        <key id="56858">LU-12735</key>
            <summary>MGS misbehaving in 2.12.2+</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Mon, 9 Sep 2019 17:25:55 +0000</created>
                <updated>Tue, 6 Jul 2021 15:30:35 +0000</updated>
                            <resolved>Tue, 11 May 2021 15:37:25 +0000</resolved>
                                    <version>Lustre 2.12.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="254406" author="pjones" created="Mon, 9 Sep 2019 18:20:50 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="255256" author="tappro" created="Mon, 23 Sep 2019 12:56:52 +0000"  >&lt;p&gt;is this occurring often or this is only one such situation so far? &lt;/p&gt;</comment>
                            <comment id="292233" author="sthiell" created="Wed, 17 Feb 2021 21:53:50 +0000"  >&lt;p&gt;Hi Mike,&lt;br/&gt;
This issue is still around with 2.12.6 servers. Sometimes, after a OST failover, the MGS gets in a weird state and keeps disconnecting from clients. From a 2.12.6 client, I can see this when the issue occurs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594272, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e44d608f000) refcount nonzero (1) after lock cleanup; forcing
Feb 17 12:42:52 oak-h01v12.sunet kernel: LustreError: 5624:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 12:47:58 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message
Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5)
Feb 17 12:47:58 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message

Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613594886, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e4292446900) refcount nonzero (1) after lock cleanup; forcin
Feb 17 12:53:06 oak-h01v12.sunet kernel: LustreError: 13169:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 12:58:12 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message
Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: MGC10.0.2.51@o2ib5: Connection restored to 10.0.2.51@o2ib5 (at 10.0.2.51@o2ib5)
Feb 17 12:58:12 oak-h01v12.sunet kernel: Lustre: Skipped 1 previous similar message

Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1613595498, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock:
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 2020:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8e461f1dac00) refcount nonzero (1) after lock cleanup; forcin
Feb 17 13:03:18 oak-h01v12.sunet kernel: LustreError: 20180:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
Feb 17 13:08:25 oak-h01v12.sunet kernel: LustreError: Skipped 1 previous similar message
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;etc.. &lt;/p&gt;

&lt;p&gt;Then, when unmounting the MGS, the server gets stuck with some kind of ldlm_bl deadlock (see  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/37524/37524_vmcore-dmesg-oak-md1-s1-lu12735.txt&quot; title=&quot;vmcore-dmesg-oak-md1-s1-lu12735.txt attached to LU-12735&quot;&gt;vmcore-dmesg-oak-md1-s1-lu12735.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; ). When I take a crash dump, I see this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
crash&amp;gt; ps | grep &quot;&amp;gt;&quot;
&amp;gt;  3840      2   7  ffff8a6569feb180  RU   0.0       0      0  [ldlm_bl_01]
&amp;gt;  3841      2   5  ffff8a6569fec200  RU   0.0       0      0  [ldlm_bl_02]
&amp;gt; 22779      2   8  ffff8a65e97a9080  RU   0.0       0      0  [ldlm_bl_03]
&amp;gt; 22908      2   6  ffff8a5644bca100  RU   0.0       0      0  [ldlm_bl_04]
&amp;gt; 27290      2  19  ffff8a5642ba0000  RU   0.0       0      0  [ldlm_bl_05]
&amp;gt; 27291      2  21  ffff8a5642ba5280  RU   0.0       0      0  [ldlm_bl_06]
&amp;gt; 27292      2   2  ffff8a5642ba2100  RU   0.0       0      0  [ldlm_bl_07]
&amp;gt; 27293      2  12  ffff8a5642ba3180  RU   0.0       0      0  [ldlm_bl_08]
&amp;gt; 27294      2  11  ffff8a5642ba4200  RU   0.0       0      0  [ldlm_bl_09]
&amp;gt; 27295      2  15  ffff8a5664fa6300  RU   0.0       0      0  [ldlm_bl_10]
&amp;gt; 28505      2  17  ffff8a56bdb60000  RU   0.0       0      0  [ldlm_bl_11]
&amp;gt; 28506      2  10  ffff8a56bdb63180  RU   0.0       0      0  [ldlm_bl_12]
&amp;gt; 28507      2   4  ffff8a65c47a2100  RU   0.0       0      0  [ldlm_bl_13]
&amp;gt; 28508      2  16  ffff8a65c37bb180  RU   0.0       0      0  [ldlm_bl_14]
&amp;gt; 28509      2   1  ffff8a65c37ba100  RU   0.0       0      0  [ldlm_bl_15]
&amp;gt; 28510      2  13  ffff8a56517ad280  RU   0.0       0      0  [ldlm_bl_16]
&amp;gt; 28511      2   3  ffff8a56517ae300  RU   0.0       0      0  [ldlm_bl_17]
&amp;gt; 28512      2   0  ffff8a65cab9b180  RU   0.0       0      0  [ldlm_bl_18]
&amp;gt; 28513      2  14  ffff8a65cab9e300  RU   0.0       0      0  [ldlm_bl_19]
&amp;gt; 28514      2  18  ffff8a65cab9d280  RU   0.0       0      0  [ldlm_bl_20]
&amp;gt; 28515      2   9  ffff8a65cab9c200  RU   0.0       0      0  [ldlm_bl_21]
&amp;gt; 28516      2  23  ffff8a65cab9a100  RU   0.0       0      0  [ldlm_bl_22]
&amp;gt; 28517      2  20  ffff8a65cab99080  RU   0.0       0      0  [ldlm_bl_23]
&amp;gt; 28518      2  22  ffff8a65cab98000  RU   0.0       0      0  [ldlm_bl_24]
crash&amp;gt; bt 3840
PID: 3840   TASK: ffff8a6569feb180  CPU: 7   COMMAND: &quot;ldlm_bl_01&quot;
 #0 [ffff8a757dac8e48] crash_nmi_callback at ffffffffaa458537
 #1 [ffff8a757dac8e58] nmi_handle at ffffffffaab8c93c
 #2 [ffff8a757dac8eb0] do_nmi at ffffffffaab8cc18
 #3 [ffff8a757dac8ef0] end_repeat_nmi at ffffffffaab8bd9c
    [exception RIP: native_queued_spin_lock_slowpath+294]
    RIP: ffffffffaa517626  RSP: ffff8a6569ffbc10  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff8a6559155e80  RCX: 0000000000390000
    RDX: ffff8a657e29b8c0  RSI: 0000000000210001  RDI: ffff8a56673df99c
    RBP: ffff8a6569ffbc10   R8: ffff8a757dadb8c0   R9: 0000000000000000
    R10: ffff8a75692b7658  R11: ffff8a75692b7648  R12: ffff8a6569ffbdb8
    R13: ffff8a5652b91c00  R14: ffff8a6559155e80  R15: ffff8a756fc07800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- &amp;lt;NMI exception stack&amp;gt; ---
 #4 [ffff8a6569ffbc10] native_queued_spin_lock_slowpath at ffffffffaa517626
 #5 [ffff8a6569ffbc18] queued_spin_lock_slowpath at ffffffffaab7b43b
 #6 [ffff8a6569ffbc28] _raw_spin_lock at ffffffffaab89b50
 #7 [ffff8a6569ffbc38] lock_res_and_lock at ffffffffc123a02c [ptlrpc]
 #8 [ffff8a6569ffbc50] ldlm_lock_cancel at ffffffffc1242f8d [ptlrpc]
 #9 [ffff8a6569ffbc70] ldlm_cancel_lock_for_export at ffffffffc12431c4 [ptlrpc]
#10 [ffff8a6569ffbcb8] ldlm_cancel_locks_for_export_cb at ffffffffc12434ec [ptlrpc]
#11 [ffff8a6569ffbce0] cfs_hash_for_each_relax at ffffffffc0d2b1a0 [libcfs]
#12 [ffff8a6569ffbd60] cfs_hash_for_each_empty at ffffffffc0d2e700 [libcfs]
#13 [ffff8a6569ffbda0] ldlm_export_cancel_locks at ffffffffc12437fa [ptlrpc]
#14 [ffff8a6569ffbe10] ldlm_bl_thread_main at ffffffffc126ccf8 [ptlrpc]
#15 [ffff8a6569ffbec8] kthread at ffffffffaa4c5c21
#16 [ffff8a6569ffbf50] ret_from_fork_nospec_begin at ffffffffaab94ddd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="295245" author="delbaryg" created="Wed, 17 Mar 2021 16:06:34 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Unfortunately we hit the same issue. Our Lustre version is a bit different, 2.12.4 for servers and (2.12.5/6 for clients).&lt;/p&gt;

&lt;p&gt;The context is a bit different because we have found some elements to trigger the MGS weird state.&lt;/p&gt;

&lt;p&gt;If you have one client disappears from lnet network that has a lock on the MGS, then you do a modification (before ldlm timeout) in the MGS (ex:&#160; ost add, set_param -P, ...), then you unmount the MGS we fall into this issue.&lt;/p&gt;

&lt;p&gt;On the crash we have the same backtrace like Stephane. But for this time, the crash was immediate on unmount. We think&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; could help. We will confirm this with a crash deep analysis.&lt;/p&gt;

&lt;p&gt;One way to avoid this is to be sure all exports are connected... @Stephane, before the unmount, have you got other weird clients more than the failed OST?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="295248" author="sthiell" created="Wed, 17 Mar 2021 16:46:50 +0000"  >&lt;p&gt;Glad to know that I&apos;m not alone, Ga&#235;l. I don&apos;t think we got weird clients apart from the OST failover. At the moment, our workaround with Lustre 2.12.6 is to unmount the MGS before stopping any MDT or OST (for failover), just to avoid a MGS crash. Remounting the MGS before starting OST/MDTs seems to be fine. Next time, I will check the exports.&#160;It&apos;s an annoying problem especially if you have MDTs co-located with the MGS.&lt;/p&gt;</comment>
                            <comment id="301176" author="eaujames" created="Tue, 11 May 2021 14:15:55 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;We have applied the patch&#160;&lt;a href=&quot;https://review.whamcloud.com/41309&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41309&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt; client: don&apos;t use OBD_CONNECT_MNE_SWAB&quot;) in production for several weeks now and the issue doesn&apos;t appear again.&lt;/p&gt;

&lt;p&gt;The issue used to occurred after an OST failover.&lt;/p&gt;</comment>
                            <comment id="301186" author="sthiell" created="Tue, 11 May 2021 15:08:11 +0000"  >&lt;p&gt;Etienne &#8211;&#160;Great news, thanks! Hope this patch can make it to 2.12.7.&lt;/p&gt;</comment>
                            <comment id="301190" author="pjones" created="Tue, 11 May 2021 15:37:25 +0000"  >&lt;p&gt;Good news - thanks&lt;/p&gt;</comment>
                            <comment id="305881" author="mrb" created="Wed, 30 Jun 2021 12:58:03 +0000"  >&lt;p&gt;Hello, sorry to comment on a resolved issue, but I believe I&apos;m hitting this issue as well at the moment.&lt;/p&gt;

&lt;p&gt;Can I ask if using the patch mentioned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; is the full solution here?&lt;/p&gt;

&lt;p&gt;@Stefane - are you using this patch in your 2.12 servers at the moment? I don&apos;t see it applied to b2_12 upstream at the moment which is a bit concerning - it&apos;s been quite debilitating for us, so I&apos;m curious why it&apos;s not made it into 2.12.7.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Matt&lt;/p&gt;</comment>
                            <comment id="305883" author="delbaryg" created="Wed, 30 Jun 2021 13:15:11 +0000"  >&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;I don&apos;t answer in place of St&#233;phane but on our side it fully fixes this issue. We have the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; backport on all our Lustre in production.&lt;/p&gt;

&lt;p&gt;Gael&lt;/p&gt;</comment>
                            <comment id="305889" author="mrb" created="Wed, 30 Jun 2021 13:53:26 +0000"  >&lt;p&gt;Thanks Gael, that&apos;s all I wanted to hear! Super - I&apos;m going to test this on our environment now and hopefully fix this issue.&lt;/p&gt;

&lt;p&gt;Cheers!&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="306226" author="mrb" created="Mon, 5 Jul 2021 14:59:50 +0000"  >&lt;p&gt;Hi Gael, sorry I missed this before, but is this a client &lt;b&gt;and&lt;/b&gt; server patch? I tried deploying just to our MGS server initially, but it didn&apos;t have any effect, before looking more closely. Have you deployed this across both clients &amp;amp; servers in your environment?&lt;/p&gt;

&lt;p&gt;Thanks again,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="306273" author="delbaryg" created="Tue, 6 Jul 2021 09:53:53 +0000"  >&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;Unfortunately that&apos;s a server patch &lt;b&gt;only&lt;/b&gt;. In our case, the effect was immediate. Perhaps, you have a similar issue but not exactly the same. Have you got an opened Jira ticket on your issue?&lt;/p&gt;

&lt;p&gt;Gael&lt;/p&gt;</comment>
                            <comment id="306276" author="mrb" created="Tue, 6 Jul 2021 10:42:34 +0000"  >&lt;p&gt;Thanks for confirming. Yes perhaps I have something else going on too, I&apos;ve opened my own issue under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14802&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-14802&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks again,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="306315" author="sthiell" created="Tue, 6 Jul 2021 15:30:35 +0000"  >&lt;p&gt;Hey Matt, sorry to hear about your current issues. We have not applied the patch &lt;a href=&quot;https://review.whamcloud.com/#/c/41309/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/41309/&lt;/a&gt; (backport for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;) in production yet, I was waiting for it to land to 2.12.7. I also thought it was a server-only patch.&lt;br/&gt;
As you have seen in my ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14695&quot; title=&quot;New OST not visible by MDTs. MGS problem or corrupt catalog llog?&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14695&quot;&gt;LU-14695&lt;/a&gt;, we have been having issues with the MGS propagating the configuration to the other targets when new OSTS are added, and also we have some kind of llog corruption that we&apos;re still trying to understand. But in our case, new clients can still mount properly (could be some luck). I hope Whamcloud will be able to help understand these config llog / MGS -related problems. I&apos;m going to watch your ticket! Fingers crossed.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="58344">LU-13356</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="33466" name="fir-md1-s1-foreachbt-2019-09-08-22-38-08.log" size="1568390" author="sthiell" created="Mon, 9 Sep 2019 17:22:12 +0000"/>
                            <attachment id="37524" name="vmcore-dmesg-oak-md1-s1-lu12735.txt" size="795896" author="sthiell" created="Wed, 17 Feb 2021 21:53:01 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00mfr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>