<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:38:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10828] OBD devices and exports not cleaned up after llog processing failures</title>
                <link>https://jira.whamcloud.com/browse/LU-10828</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;When a failure occurs during processing the llog Lustre does not properly cleanup OBDs.  This has been an issue for a long time which I usually see when attempting to mount with an incorrect configuration or something outside of the ordinary.  These usually require a reboot as an OBD is stuck with references to it.  In the partiular case I am running into is when an OSS server doesn&apos;t have a key loaded for SSK and the MDT attempts to connect to the OSP and fails.  &lt;/p&gt;

&lt;p&gt;When ptlrpc_connect_import() returns an error the exports created from early in osp_obd_connect() is not released.  This seems to be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7184&quot; title=&quot;(lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&amp;amp;lu-&amp;gt;ld_ref) == 0 ) failed: lu is ffff88010cf8a000&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7184&quot;&gt;&lt;del&gt;LU-7184&lt;/del&gt;&lt;/a&gt; and before would LBUG but now is not properly cleaned up.  I will submit a patch to call obd_disconnect() if osp_obd_connect()-&amp;gt;ptlrpc_connect_import() fails.&lt;/p&gt;

&lt;p&gt;There is another issue which I don&apos;t have an easy solution for and wouldn&apos;t mind if someone more familiar with llog processing had any ideas.  Since this failure causes llog processing to be aborted for the MDT no more devices are added but the first OSP is already attached and setup which caused the failure in llog processing.  Is ther ea proper place that something like class_manual_cleanup() could be called in the llog processing?&lt;/p&gt;


&lt;p&gt;The failure from the logs looks something like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[432236.495419] LustreError: 16927:0:(sec_gss.c:2036:gss_svc_handle_init()) target &lt;span class=&quot;code-quote&quot;&gt;&apos;SiteA2-MDT0000_UUID&apos;&lt;/span&gt; is not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; context init (no target)
[432236.669171] LustreError: 17057:0:(gss_keyring.c:849:gss_sec_lookup_ctx_kr()) failed request key: -126
[432236.679584] LustreError: 17057:0:(sec.c:448:sptlrpc_req_get_ctx()) req ffff880f05f10600: fail to get context
[432236.690680] LustreError: 17057:0:(osp_dev.c:1452:osp_obd_connect()) SiteA2-OST0001-osc-MDT0000: can&apos;t connect obd: rc = -111
[432236.703333] LustreError: 17057:0:(lod_lov.c:302:lod_add_device()) SiteA2-OST0001-osc-MDT0000: cannot connect to next dev SiteA2-OST0001_UUID (-111)
[432236.718584] LustreError: 17057:0:(obd_config.c:1716:class_config_llog_handler()) MGC10.10.10.19@o2ib: cfg command failed: rc = -111
[432236.731905] Lustre:    cmd=cf00d 0:SiteA2-MDT0000-mdtlov  1:SiteA2-OST0001_UUID  2:1  3:1  

[432236.732069] LustreError: 15c-8: MGC10.10.10.19@o2ib: The configuration from log &lt;span class=&quot;code-quote&quot;&gt;&apos;SiteA2-MDT0000&apos;&lt;/span&gt; failed (-111). This may be the result of communication errors between &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; node and the MGS, a bad configuration, or other errors. See the syslog &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more information.
[432236.758276] LustreError: 16836:0:(obd_mount_server.c:1383:server_start_targets()) failed to start server SiteA2-MDT0000: -111
[432236.771206] LustreError: 16836:0:(obd_mount_server.c:1934:server_fill_super()) Unable to start targets: -111
[432236.782390] Lustre: Failing over SiteA2-MDT0000
[432236.940170] Lustre: server umount SiteA2-MDT0000 complete
[432236.940180] LustreError: 16836:0:(obd_mount.c:1583:lustre_fill_super()) Unable to mount  (-111)
[432239.712652] Lustre: MGS: Connection restored to 0b840ce7-6105-bed2-c3ed-d79df6a411a4 (at 10.10.4.11@o2ib)
[432248.702230] Lustre: 17432:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1521563934/real 1521563934]  req@ffff880f132b0000 x1595465066874656/t0(0) o251-&amp;gt;MGC10.10.10.19@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1521563940 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[432248.704582] Lustre: server umount MGS complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Trying to clean things manually after the fact:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[root@sitea-oss-1 ~]# lctl dl
  3 UP osd-zfs SiteA2-MDT0000-osd SiteA2-MDT0000-osd_UUID 3
  9 IN osp SiteA2-OST0001-osc-MDT0000 SiteA2-MDT0000-mdtlov_UUID 3

[root@sitea-oss-1 ~]# lctl --device SiteA2-OST0001-osc-MDT0000 cleanup
[root@sitea-oss-1 ~]# lctl --device SiteA2-OST0001-osc-MDT0000 detach
[root@sitea-oss-1 ~]# lctl dl
[root@sitea-oss-1 ~]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The device is cleaned up and detached but without the patch I will submit the osp module can&apos;t be unloaded due to references still active.&lt;/p&gt;

&lt;p&gt;From looking at the code there are a handful of places that things may not be disconnected during failure cases with which happen after obd_connect() but I don&apos;t have time to investigate them.&lt;/p&gt;</description>
                <environment></environment>
        <key id="51438">LU-10828</key>
            <summary>OBD devices and exports not cleaned up after llog processing failures</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="jfilizetti">Jeremy Filizetti</assignee>
                                    <reporter username="jfilizetti">Jeremy Filizetti</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Tue, 20 Mar 2018 17:02:39 +0000</created>
                <updated>Wed, 6 Apr 2022 14:51:32 +0000</updated>
                            <resolved>Wed, 6 Apr 2022 14:51:32 +0000</resolved>
                                    <version>Lustre 2.10.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="224049" author="gerrit" created="Tue, 20 Mar 2018 17:17:51 +0000"  >&lt;p&gt;Jeremy Filizetti (jeremy.filizetti@gmail.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31696&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31696&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10828&quot; title=&quot;OBD devices and exports not cleaned up after llog processing failures&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10828&quot;&gt;&lt;del&gt;LU-10828&lt;/del&gt;&lt;/a&gt; osp: Disconnect export during failure case&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 7cb68e1c6745570bb741e8d9c09d366b1b3d1812&lt;/p&gt;</comment>
                            <comment id="224051" author="adilger" created="Tue, 20 Mar 2018 17:28:02 +0000"  >&lt;p&gt;Thanks for the patch.  I think the two Alexeys (CC&apos;d) are the ones most familiar with this code that might be able to answer your question.&lt;/p&gt;</comment>
                            <comment id="224054" author="shadow" created="Tue, 20 Mar 2018 17:33:59 +0000"  >&lt;p&gt;Jeremy,&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;can you check a patch from &lt;a href=&quot;https://review.whamcloud.com/#/c/27753/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/27753/&lt;/a&gt;&#160;?&lt;/p&gt;

&lt;p&gt;is it help for you?&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="46797">LU-9699</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="45063">LU-9267</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzukv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>