<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:06:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14046] lov tgt 0 not cleaned! deathrow=0, lovrc=1</title>
                <link>https://jira.whamcloud.com/browse/LU-14046</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Today we upgrade Oak servers from 2.10.8 to 2.12.5, and now we ~50 clients (2.13) out of ~1,500 that cannot mount Oak at all after reboot. Example with client&#160;10.50.0.63@o2ib2:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 19 13:31:26 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(lov_obd.c:828:lov_cleanup()) oak-clilov-ffffa0d562f8a800: lov tgt 0 not cleaned! deathrow=0, lovrc=1
Oct 19 13:31:26 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(lov_obd.c:828:lov_cleanup()) Skipped 291 previous similar messages
Oct 19 13:31:27 sh02-ln03.stanford.edu kernel: Lustre: Unmounted oak-client
Oct 19 13:31:27 sh02-ln03.stanford.edu kernel: LustreError: 94181:0:(obd_mount.c:1669:lustre_fill_super()) Unable to mount&#160; (-5) &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;On the MGS side, I can see this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/tmp/dk:00010000:02000400:7.0:1603137393.190601:0:7903:0:(ldlm_lib.c:1151:target_handle_connect()) MGS: Received new LWP connection from 10.50.0.63@o2ib2, removing former export from same NID
/tmp/dk:00010000:00080000:7.0:1603137393.190602:0:7903:0:(ldlm_lib.c:1227:target_handle_connect()) MGS: connection from f3832037-ce6f-4@10.50.0.63@o2ib2 t0 exp ffff88f2a4e59c00 cur 12765 last 1603137393 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;2.10 servers with 2.13 clients worked fine. This is 2.12 servers with 2.13 clients.&lt;/p&gt;

&lt;p&gt;Please advise. Is it the same as in&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13719&quot; title=&quot;lov tgt 36 not cleaned! deathrow=0, lovrc=1&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13719&quot;&gt;&lt;del&gt;LU-13719&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;Stephane&lt;/p&gt;</description>
                <environment>2.12.5 servers + 2.13 clients, CentOS 7.6</environment>
        <key id="61262">LU-14046</key>
            <summary>lov tgt 0 not cleaned! deathrow=0, lovrc=1</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Mon, 19 Oct 2020 20:49:00 +0000</created>
                <updated>Fri, 23 Oct 2020 21:47:51 +0000</updated>
                                            <version>Lustre 2.13.0</version>
                    <version>Lustre 2.12.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="282647" author="sthiell" created="Mon, 19 Oct 2020 20:54:30 +0000"  >&lt;p&gt;Sorry didn&apos;t mean to open this as an improvement, it&apos;s a bug we would like to report with clients unable to mount the filesystem. Please let me know what you think. Thanks!&lt;/p&gt;</comment>
                            <comment id="282648" author="sthiell" created="Mon, 19 Oct 2020 21:22:27 +0000"  >&lt;p&gt;After reviewing the attached client logs (&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/36390/36390_sh02-ln03.client.dk.log&quot; title=&quot;sh02-ln03.client.dk.log attached to LU-14046&quot;&gt;sh02-ln03.client.dk.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;) again, it looks like this could be due to something else.&lt;/p&gt;


&lt;p&gt;On the MGS/MDS, we can see endless disconnections:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603137677, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88f240925c40/0xe88ce0ce9dbc849f lrc: 4/1,0 mode: 
Oct 19 13:06:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:09:26 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2a8b2b400, cur 1603138166 expire 1603138016 last 1603137939
Oct 19 13:11:17 oak-md1-s1 kernel: LustreError: 29618:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88f1ef5e66c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:11:17 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:12:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to 0a2acb4f-5e35-e84f-7137-12310b3b17d8 (at 10.12.4.25@o2ib)
Oct 19 13:12:47 oak-md1-s1 kernel: Lustre: Skipped 3213 previous similar messages
Oct 19 13:14:26 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88e2baae9400, cur 1603138466 expire 1603138316 last 1603138239
Oct 19 13:15:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:15:22 oak-md1-s1 kernel: Lustre: Skipped 3237 previous similar messages
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603138277, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e04ef26e40/0xe88ce0ce9f2b45ae lrc: 4/1,0 mode: 
Oct 19 13:16:17 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:19:35 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2a7f22c00, cur 1603138775 expire 1603138625 last 1603138548
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 29746:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88e02528f680) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:21:18 oak-md1-s1 kernel: LustreError: 29746:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Oct 19 13:22:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to bf3e962b-e521-22c4-b7d4-b2c82f971648 (at 10.12.4.86@o2ib)
Oct 19 13:22:47 oak-md1-s1 kernel: Lustre: Skipped 3211 previous similar messages
Oct 19 13:24:35 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2aaa09c00, cur 1603139075 expire 1603138925 last 1603138848
Oct 19 13:25:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:25:22 oak-md1-s1 kernel: Lustre: Skipped 3253 previous similar messages
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:26:26 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603138886, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e1ad4618c0/0xe88ce0cea0bd8bfc lrc: 4/1,0 mode: 
Oct 19 13:26:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:29:41 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2a6e94400, cur 1603139381 expire 1603139231 last 1603139154
Oct 19 13:30:21 oak-md1-s1 kernel: Lustre: oak-MDT0001: Client c50f4c48-a8d0-2d5f-ff90-7efef0b098e9 (at 10.210.9.195@tcp1) reconnecting
Oct 19 13:30:21 oak-md1-s1 kernel: Lustre: Skipped 21 previous similar messages
Oct 19 13:31:27 oak-md1-s1 kernel: LustreError: 29879:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88dfc0b3fc80) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:31:27 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:32:47 oak-md1-s1 kernel: Lustre: MGS: Connection restored to e3ecfc0a-db4b-4 (at 10.50.10.3@o2ib2)
Oct 19 13:32:47 oak-md1-s1 kernel: Lustre: Skipped 3241 previous similar messages
Oct 19 13:34:41 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2aaadf400, cur 1603139681 expire 1603139531 last 1603139454
Oct 19 13:35:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:35:22 oak-md1-s1 kernel: Lustre: Skipped 3238 previous similar messages
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: Skipped 1 previous similar message
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1603139487, 300s ago), entering recovery for MGS@10.0.2.51@o2ib5 ns: MGC10.0.2.51@o2ib5 lock: ffff88e15230e540/0xe88ce0cea2ae56e9 lrc: 4/1,0 mode: 
Oct 19 13:36:27 oak-md1-s1 kernel: LustreError: 7862:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Oct 19 13:39:46 oak-md1-s1 kernel: Lustre: MGS: haven&apos;t heard from client 38b0ac60-6c23-4 (at 10.49.27.12@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88f2a4c9b800, cur 1603139986 expire 1603139836 last 1603139759
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 29993:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x736d61726170:0x3:0x0].0x0 (ffff88e130e895c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 7862:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Oct 19 13:41:27 oak-md1-s1 kernel: LustreError: 29993:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
Oct 19 13:42:48 oak-md1-s1 kernel: Lustre: MGS: Connection restored to 4151aadc-4857-e0bc-f1e5-8c97714919e5 (at 10.210.12.107@tcp1)
Oct 19 13:42:48 oak-md1-s1 kernel: Lustre: Skipped 3233 previous similar messages
Oct 19 13:45:22 oak-md1-s1 kernel: Lustre: MGS: Received new LWP connection from 10.0.2.102@o2ib5, removing former export from same NID
Oct 19 13:45:22 oak-md1-s1 kernel: Lustre: Skipped 3203 previous similar messages 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do you think this issue is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13667&quot; title=&quot;ptlrpc_pinger_main is stuck in endless loop&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13667&quot;&gt;&lt;del&gt;LU-13667&lt;/del&gt;&lt;/a&gt; for which a patch has landed in b2_12? &lt;/p&gt;</comment>
                            <comment id="282711" author="pjones" created="Tue, 20 Oct 2020 14:24:47 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Does this seem to be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13367&quot; title=&quot;lnet_handle_local_failure messages every 10 min ?&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13367&quot;&gt;&lt;del&gt;LU-13367&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="282712" author="sthiell" created="Tue, 20 Oct 2020 14:25:49 +0000"  >&lt;p&gt;We have restarted the MGS that started to load quite a lot and it crashed when we tried to stop it. This used to already happen with 2.10, and the bug is still in 2.12.5.  We have applied Hongchao&apos;s patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13667&quot; title=&quot;ptlrpc_pinger_main is stuck in endless loop&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13667&quot;&gt;&lt;del&gt;LU-13667&lt;/del&gt;&lt;/a&gt; &quot;ptlrpc: fix endless loop issue&quot; and restarted MGS/MDS. After that, our 2.13 clients could mount the filesystem again and we haven&apos;t seen lock timeout issues on MGS even after failing over more OSTs.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="36390" name="sh02-ln03.client.dk.log" size="98137" author="sthiell" created="Mon, 19 Oct 2020 20:45:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01cp3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>