<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:48:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11969] sanity-lfsck test_36c: (5) MDS3 is not the expected &apos;completed&apos;</title>
                <link>https://jira.whamcloud.com/browse/LU-11969</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was created by maloo for Andreas Dilger  &amp;lt;adilger@whamcloud.com&amp;gt;&lt;/p&gt;

&lt;p&gt;This issue relates to the following test suite run: &lt;a href=&quot;https://testing.whamcloud.com/test_sets/52b653d4-3018-11e9-bd83-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/52b653d4-3018-11e9-bd83-52540065bddc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;test_36c failed with the following error on the MDS1 log:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lfsck           S ffff8a82a228d140     0  8776      2 0x00000080
Call Trace:
schedule+0x29/0x70
lfsck_double_scan_generic+0x22e/0x2c0 [lfsck]
lfsck_layout_master_double_scan+0x30/0x1e0 [lfsck]
lfsck_double_scan+0x5f/0x210 [lfsck]
lfsck_master_engine+0x4c6/0x1370 [lfsck]
kthread+0xd1/0xe0

umount          S ffff8a829b391040     0 13270  13269 0x00000080
Call Trace:
schedule+0x29/0x70
lfsck_stop+0x485/0x5c0 [lfsck]
mdd_iocontrol+0x327/0xb40 [mdd]
mdt_device_fini+0x75/0x930 [mdt]
class_cleanup+0x862/0xbd0 [obdclass]
class_process_config+0x65c/0x2830 [obdclass]
class_manual_cleanup+0x1c6/0x710 [obdclass]
server_put_super+0x8de/0xcd0 [obdclass]
generic_shutdown_super+0x6d/0x100
kill_anon_super+0x12/0x20
lustre_kill_super+0x32/0x50 [obdclass]
deactivate_locked_super+0x4e/0x70
deactivate_super+0x46/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0xbb/0xe0
do_notify_resume+0xa5/0xc0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;







&lt;p&gt;VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV&lt;br/&gt;
sanity-lfsck test_36c - (5) MDS3 is not the expected &apos;completed&apos;&lt;/p&gt;</description>
                <environment></environment>
        <key id="54888">LU-11969</key>
            <summary>sanity-lfsck test_36c: (5) MDS3 is not the expected &apos;completed&apos;</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="maloo">Maloo</reporter>
                        <labels>
                    </labels>
                <created>Thu, 14 Feb 2019 06:40:56 +0000</created>
                <updated>Wed, 10 Apr 2019 08:49:32 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="241979" author="pfarrell" created="Thu, 14 Feb 2019 16:54:31 +0000"  >&lt;p&gt;Master:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://testing.whamcloud.com/test_sessions/b9c7607a-51ca-4bc2-89bb-accfb41935ef&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sessions/b9c7607a-51ca-4bc2-89bb-accfb41935ef&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="241985" author="pfarrell" created="Thu, 14 Feb 2019 18:01:17 +0000"  >&lt;p&gt;For whenever an lfsck-familiar person takes a crack at this, here are a few notes.&lt;/p&gt;

&lt;p&gt;This looks to be a race condition, but I&apos;m not sure at all how to fix it.&lt;/p&gt;

&lt;p&gt;Here&apos;s the log snippets:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;MDT0002 master is 8467

00100000:10000000:0.0:1550117810.581394:0:8467:0:(lfsck_lib.c:2584:lfsck_post_generic()) lustre-MDT0002-osd: waiting for assistant to do lfsck_layout post, rc = 1
00100000:10000000:0.0:1550117810.581416:0:8467:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) lustre-MDT0002-osd: waiting for assistant to do lfsck_layout double_scan, status 200100000:10000000:0.0:1550117810.581420:0:8469:0:(lfsck_engine.c:1665:lfsck_assistant_engine()) lustre-MDT0002-osd: lfsck_layout LFSCK assistant thread post

8469 is the child thread...
00100000:10000000:0.0:1550117810.584905:0:8469:0:(lfsck_engine.c:1684:lfsck_assistant_engine()) lustre-MDT0002-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0
00100000:10000000:0.0:1550117810.584914:0:8469:0:(lfsck_engine.c:1702:lfsck_assistant_engine()) lustre-MDT0002-osd: LFSCK assistant sync before the second-stage scaning

8467 again:
00100000:10000000:0.0:1550117810.585025:0:8467:0:(lfsck_lib.c:2624:lfsck_double_scan_generic()) lustre-MDT0002-osd: the assistant has done lfsck_layout double_scan, status 0
And this completes normally.

MDT0000, 8464 is master:
00100000:10000000:0.0:1550117810.584824:0:8464:0:(lfsck_lib.c:2584:lfsck_post_generic()) lustre-MDT0000-osd: waiting for assistant to do lfsck_layout post, rc = 1
00100000:10000000:0.0:1550117810.584831:0:8464:0:(lfsck_lib.c:2596:lfsck_post_generic()) lustre-MDT0000-osd: the assistant has done lfsck_layout post, rc = 1
00100000:10000000:0.0:1550117810.584841:0:8464:0:(lfsck_layout.c:5881:lfsck_layout_master_post()) lustre-MDT0000-osd: layout LFSCK master post done: rc = 0
Assistant is 8466:
00100000:10000000:1.0:1550117810.584842:0:8466:0:(lfsck_engine.c:1665:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_layout LFSCK assistant thread post00100000:10000000:0.0:1550117810.584843:0:8464:0:(lfsck_lib.c:2614:lfsck_double_scan_generic()) lustre-MDT0000-osd: waiting for assistant to do lfsck_layout double_scan, status 2

This thread (8464, the master thread here) doesn&apos;t wake up again.

Notice, we&apos;ve got &quot;assistant thread post&quot; *before* lfsck_layout double_scan waiting...

Various things happen, then eventually:
00100000:10000000:1.0:1550117810.585975:0:8466:0:(lfsck_engine.c:1684:lfsck_assistant_engine()) lustre-MDT0000-osd: LFSCK assistant notified others for lfsck_layout post: rc = 0

The assistant thread never logs this message:
00100000:10000000:0.0:1550117810.584914:0:8469:0:(lfsck_engine.c:1702:lfsck_assistant_engine()) lustre-MDT0002-osd: LFSCK assistant sync before the second-stage scaning

And the master thread :8464: never wakes up. &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The condition for doing &quot;second-stage scanning&quot; (in the assistant) is:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                if (lad-&amp;gt;lad_to_double_scan) {
                        lad-&amp;gt;lad_to_double_scan = 0;
                        atomic_inc(&amp;amp;lfsck-&amp;gt;li_double_scan_count); &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Which is set in the master thread:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lfsck_double_scan_generic:
        if (status != LS_SCANNING_PHASE2)
                lad-&amp;gt;lad_exit = 1;
        else
                lad-&amp;gt;lad_to_double_scan = 1;        

        CDEBUG(D_LFSCK, &quot;%s: waiting for assistant to do %s double_scan, &quot;
               &quot;status %d\n&quot;,
               lfsck_lfsck2name(com-&amp;gt;lc_lfsck), lad-&amp;gt;lad_name, status);

        wake_up_all(&amp;amp;athread-&amp;gt;t_ctl_waitq);
        l_wait_event(mthread-&amp;gt;t_ctl_waitq,
                     lad-&amp;gt;lad_in_double_scan ||
                     thread_is_stopped(athread),
                     &amp;amp;lwi); &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Basically, it looks like on MDT0000 the assistant thread completed the first phase scanning before the master thread had set this variable and gone to sleep.&#160; And the assistant thread never enters the second scan path that would cause it to wake up the master thread, because it&apos;s already passed that point.&lt;/p&gt;</comment>
                            <comment id="245501" author="bruno" created="Wed, 10 Apr 2019 08:49:19 +0000"  >&lt;p&gt;I think I have triggered the same issue during autotest session at &lt;a href=&quot;https://testing.whamcloud.com/test_sessions/3955c094-f818-44fb-877f-de6d0d9fd73a&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sessions/3955c094-f818-44fb-877f-de6d0d9fd73a&lt;/a&gt; . This time it has happen during sanity-lfsck/test_20a .&lt;/p&gt;

&lt;p&gt;I haven&apos;t gone thru the debug log to check it is the same scenario Patrick has already detailed before, but the following stacks found dumped afterward n the MDS Console, clearly indicate a very similar situation :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[13917.818020] lfsck           S ffff988690c00000     0  4111      2 0x00000080
[13917.819302] Call Trace:
[13917.819740]  [&amp;lt;ffffffffa9b67c49&amp;gt;] schedule+0x29/0x70
[13917.820677]  [&amp;lt;ffffffffc121766e&amp;gt;] lfsck_double_scan_generic+0x22e/0x2c0 [lfsck]
[13917.822075]  [&amp;lt;ffffffffa94d67b0&amp;gt;] ? wake_up_state+0x20/0x20
[13917.823103]  [&amp;lt;ffffffffc123ec90&amp;gt;] lfsck_layout_master_double_scan+0x30/0x1e0 [lfsck]
[13917.824458]  [&amp;lt;ffffffffc1217f0f&amp;gt;] lfsck_double_scan+0x5f/0x210 [lfsck]
[13917.825637]  [&amp;lt;ffffffffc0cb0631&amp;gt;] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[13917.826884]  [&amp;lt;ffffffffc121cee6&amp;gt;] lfsck_master_engine+0x4c6/0x1370 [lfsck]
[13917.828107]  [&amp;lt;ffffffffa94d67b0&amp;gt;] ? wake_up_state+0x20/0x20
[13917.829110]  [&amp;lt;ffffffffc121ca20&amp;gt;] ? lfsck_master_oit_engine+0x1510/0x1510 [lfsck]
[13917.830420]  [&amp;lt;ffffffffa94c1c31&amp;gt;] kthread+0xd1/0xe0
[13917.831359]  [&amp;lt;ffffffffa94c1b60&amp;gt;] ? insert_kthread_work+0x40/0x40
[13917.832429]  [&amp;lt;ffffffffa9b74c37&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
[13917.833567]  [&amp;lt;ffffffffa94c1b60&amp;gt;] ? insert_kthread_work+0x40/0x40
[13917.834656] lfsck_layout    S ffff988690c030c0     0  4117      2 0x00000080
[13917.835969] Call Trace:
[13917.836415]  [&amp;lt;ffffffffc0b93f17&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[13917.837620]  [&amp;lt;ffffffffa9b67c49&amp;gt;] schedule+0x29/0x70
[13917.838516]  [&amp;lt;ffffffffc121eedd&amp;gt;] lfsck_assistant_engine+0x114d/0x2090 [lfsck]
[13917.839784]  [&amp;lt;ffffffffa94e0eee&amp;gt;] ? dequeue_task_fair+0x41e/0x660
[13917.840872]  [&amp;lt;ffffffffa942a59e&amp;gt;] ? __switch_to+0xce/0x580
[13917.841898]  [&amp;lt;ffffffffa94d67b0&amp;gt;] ? wake_up_state+0x20/0x20
[13917.842895]  [&amp;lt;ffffffffc121dd90&amp;gt;] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
[13917.844140]  [&amp;lt;ffffffffa94c1c31&amp;gt;] kthread+0xd1/0xe0
[13917.845022]  [&amp;lt;ffffffffa94c1b60&amp;gt;] ? insert_kthread_work+0x40/0x40
[13917.846099]  [&amp;lt;ffffffffa9b74c37&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
[13917.847282]  [&amp;lt;ffffffffa94c1b60&amp;gt;] ? insert_kthread_work+0x40/0x40

..........................

[13917.925158] umount          S ffff9886a3696180     0 26668  26667 0x00000080
[13917.926436] Call Trace:
[13917.926880]  [&amp;lt;ffffffffa9b67c49&amp;gt;] schedule+0x29/0x70
[13917.927743]  [&amp;lt;ffffffffc1213925&amp;gt;] lfsck_stop+0x485/0x5c0 [lfsck]
[13917.928894]  [&amp;lt;ffffffffa94d67b0&amp;gt;] ? wake_up_state+0x20/0x20
[13917.929952]  [&amp;lt;ffffffffc146b7c7&amp;gt;] mdd_iocontrol+0x327/0xb40 [mdd]
[13917.931049]  [&amp;lt;ffffffffc0cb0509&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[13917.932411]  [&amp;lt;ffffffffc12e2d15&amp;gt;] mdt_device_fini+0x75/0x930 [mdt]
[13917.933540]  [&amp;lt;ffffffffc0ccfc73&amp;gt;] ? lu_context_init+0xd3/0x1f0 [obdclass]
[13917.934807]  [&amp;lt;ffffffffc0cbd542&amp;gt;] class_cleanup+0x862/0xbd0 [obdclass]
[13917.935987]  [&amp;lt;ffffffffc0cbe53c&amp;gt;] class_process_config+0x65c/0x2830 [obdclass]
[13917.937251]  [&amp;lt;ffffffffc0b93f17&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[13917.938544]  [&amp;lt;ffffffffc0cc08d6&amp;gt;] class_manual_cleanup+0x1c6/0x720 [obdclass]
[13917.939862]  [&amp;lt;ffffffffc0cf0c8e&amp;gt;] server_put_super+0x8de/0xcd0 [obdclass]
[13917.941058]  [&amp;lt;ffffffffa9643dbd&amp;gt;] generic_shutdown_super+0x6d/0x100
[13917.942172]  [&amp;lt;ffffffffa96441b2&amp;gt;] kill_anon_super+0x12/0x20
[13917.943187]  [&amp;lt;ffffffffc0cc34f2&amp;gt;] lustre_kill_super+0x32/0x50 [obdclass]
[13917.944362]  [&amp;lt;ffffffffa964456e&amp;gt;] deactivate_locked_super+0x4e/0x70
[13917.945473]  [&amp;lt;ffffffffa9644cf6&amp;gt;] deactivate_super+0x46/0x60
[13917.946538]  [&amp;lt;ffffffffa966327f&amp;gt;] cleanup_mnt+0x3f/0x80
[13917.947481]  [&amp;lt;ffffffffa9663312&amp;gt;] __cleanup_mnt+0x12/0x20
[13917.948446]  [&amp;lt;ffffffffa94be79b&amp;gt;] task_work_run+0xbb/0xe0
[13917.949416]  [&amp;lt;ffffffffa942bc65&amp;gt;] do_notify_resume+0xa5/0xc0
[13917.950440]  [&amp;lt;ffffffffa9b75124&amp;gt;] int_signal+0x12/0x17
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00bnz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>