<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:31:46 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10068] OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check</title>
                <link>https://jira.whamcloud.com/browse/LU-10068</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Reformatted, created complete new filesystem. &lt;br/&gt;
MDS/MGS mounts.&lt;br/&gt;
OSS mount fails&lt;br/&gt;
OSS errors&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[  584.585335] Lustre: 13317:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507069853/real 1507069853]  req@ffff88083f758300 x1580277278179360/t0(0) o253-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507069860 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
[  584.680525] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
[  584.727353] LustreError: 15f-b: soaked-OST0000: cannot register &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; server with the MGS: rc = -5. Is the MGS running?
[  584.755152] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[  584.796681] LustreError: 13317:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  584.828634] LustreError: 13317:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-OST0000
[  584.858501] LustreError: 13317:0:(obd_mount_server.c:132:server_deregister_mount()) soaked-OST0000 not registered
[  585.118058] Lustre: server umount soaked-OST0000 complete
[  585.135868] LustreError: 13317:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Errors on MDS&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[17524.112385] Lustre: soaked-MDT0000: &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; disk, initializing
[17524.138680] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[17524.153739] Lustre: ctl-soaked-MDT0000: &lt;span class=&quot;code-keyword&quot;&gt;super&lt;/span&gt;-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400]:0:mdt
[17528.212947] Lustre: MGS: Connection restored to aa8414d2-f089-bf2e-b8c3-406370f048cc (at 192.168.1.102@o2ib)
[17528.273221] LustreError: 12811:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
[17528.287356] LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; magic/version check
[17528.305726] LustreError: 14558:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.102@o2ib x1580277278179360
[17528.418152] Lustre: MGS: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.102@o2ib, removing former export from same NID
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Not sure where to go from here. &lt;/p&gt;</description>
                <environment>Soak cluster - latest lustre-master build (3650) version=2.10.53_32_g20ffe21</environment>
        <key id="48582">LU-10068</key>
            <summary>OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Tue, 3 Oct 2017 22:40:12 +0000</created>
                <updated>Fri, 12 Jan 2018 19:37:40 +0000</updated>
                            <resolved>Tue, 19 Dec 2017 10:12:15 +0000</resolved>
                                    <version>Lustre 2.11.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="210305" author="cliffw" created="Wed, 4 Oct 2017 16:11:14 +0000"  >&lt;p&gt;I re-loaded b2.10 GA - filesystem formatted and mounted with not issues.&lt;br/&gt;
I then loaded lustre-master build 3650, and attempted to mount the previously formatted filesystem. This filesystem was formatted and mounted once, no work other than mount/umount performed. The OST mounts do complete, but there are many errors&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132883/real 1507132883]  req@ffff8800b2960300 x1580342920085536/t0(0) o253-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507132890 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Oct  4 16:01:24 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:24 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082acd7e00
Oct  4 16:01:24 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507132884/real 1507132884]  req@ffff8803ff218000 x1580342920085632/t0(0) o256-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 304/240 e 0 to 1 dl 1507132891 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  4 16:01:24 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Oct  4 16:01:25 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:251:do_config_log_add()) MGC192.168.1.108@o2ib: failed processing log, type 4: rc = -5
Oct  4 16:01:25 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:32 soak-2 kernel: Lustre: 24621:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1507132885/real 1507132885]  req@ffff8803ff218000 x1580342920085696/t0(0) o503-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507132892 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct  4 16:01:32 soak-2 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Oct  4 16:01:33 soak-2 kernel: LustreError: 24621:0:(mgc_request.c:1885:mgc_llog_local_copy()) MGC192.168.1.108@o2ib: failed to copy remote log soaked-OST0000: rc = -5
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC192.168.1.108@o2ib: namespace resource [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x64656b616f73:0x0:0x0].0x0 (ffff8804184e1200) refcount = 2
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1685:ldlm_resource_dump()) Granted locks (in reverse order):
Oct  4 16:01:33 soak-2 kernel: LustreError: 24937:0:(ldlm_resource.c:1688:ldlm_resource_dump()) ### ### ns: MGC192.168.1.108@o2ib lock: ffff88040eb20400/0xc3f9d16c2df9943e lrc: 2/1,0 mode: CR/CR res: [0x64656b616f73:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0xc8dccd9f1ede04d5 expref: -99 pid: 24621 timeout: 0 lvb_type: 0
Oct  4 16:01:33 soak-2 kernel: Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
Oct  4 16:01:34 soak-2 kernel: LustreError: 23440:0:(events.c:199:client_bulk_callback()) event type 2, status -103, desc ffff88082cc40800
Oct  4 16:01:34 soak-2 kernel: LustreError: 24665:0:(mgc_request.c:603:do_requeue()) failed processing log: -5
Oct  4 16:01:45 soak-2 kernel: LustreError: 137-5: soaked-OST0001_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 192.168.1.108@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210306" author="cliffw" created="Wed, 4 Oct 2017 16:12:29 +0000"  >&lt;p&gt;In addition, the OST mounts are very, very slow&lt;/p&gt;</comment>
                            <comment id="210307" author="cliffw" created="Wed, 4 Oct 2017 16:21:18 +0000"  >&lt;p&gt;However, now all client mount attempts fail:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[ 1891.859459] Lustre: 9069:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1507133982/real 1507133982]  req@ffff880813ea0300 x1580342921134176/t0(0) o503-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507133989 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 1891.952822] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
[ 1891.999529] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log &lt;span class=&quot;code-quote&quot;&gt;&apos;soaked-client&apos;&lt;/span&gt; failed (-5). This may be the result of communication errors between &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; node and the MGS, a bad configuration, or other errors. See the syslog &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more information.
[ 1892.002601] Lustre: MGC192.168.1.108@o2ib: Connection restored to MGC192.168.1.108@o2ib_0 (at 192.168.1.108@o2ib)
[ 1892.110166] Lustre: Unmounted soaked-client
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210315" author="jamesanunez" created="Wed, 4 Oct 2017 17:12:38 +0000"  >&lt;p&gt;The last master build that ran successfully on soak was build #3637.&lt;/p&gt;</comment>
                            <comment id="210320" author="cliffw" created="Wed, 4 Oct 2017 18:52:30 +0000"  >&lt;p&gt;Reverted back to build 3649, and it&apos;s even worse! The initial MGT/MDT mount hangs, big time.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct  4 18:40:43 soak-8 kernel: INFO: task mount.lustre:2489 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Oct  4 18:40:43 soak-8 kernel: &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
Oct  4 18:40:43 soak-8 kernel: mount.lustre    D 0000000000000001     0  2489   2488 0x00000082
Oct  4 18:40:44 soak-8 kernel: ffff8804102d77b0 0000000000000082 ffff88041140dee0 ffff8804102d7fd8
Oct  4 18:40:44 soak-8 kernel: ffff8804102d7fd8 ffff8804102d7fd8 ffff88041140dee0 ffff88082c4c4810
Oct  4 18:40:44 soak-8 kernel: 7fffffffffffffff ffff88082c4c4808 ffff88041140dee0 0000000000000001
Oct  4 18:40:44 soak-8 kernel: Call Trace:
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff816a94a9&amp;gt;] schedule+0x29/0x70
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff816a6fb9&amp;gt;] schedule_timeout+0x239/0x2c0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff81050b5c&amp;gt;] ? native_smp_send_reschedule+0x4c/0x70
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff810c0548&amp;gt;] ? resched_curr+0xa8/0xc0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff810c12c8&amp;gt;] ? check_preempt_curr+0x78/0xa0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff810c1309&amp;gt;] ? ttwu_do_wakeup+0x19/0xd0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff816a985d&amp;gt;] wait_for_completion+0xfd/0x140
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a496b4&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a498d4&amp;gt;] llog_process+0x14/0x20 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a79e95&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc14465a8&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc1449413&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a826b0&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc1449b18&amp;gt;] ? do_config_log_add+0x248/0x580 [mgc]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc144a9f0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a861ea&amp;gt;] lustre_process_log+0x2da/0xae0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0927ba7&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a713e9&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0ab2da2&amp;gt;] server_start_targets+0x1352/0x2a70 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0cd0b48&amp;gt;] ? ptlrpc_pinger_wake_up+0x28/0x30 [ptlrpc]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a71511&amp;gt;] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a826b0&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0ab554d&amp;gt;] server_fill_super+0x108d/0x184c [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a8c168&amp;gt;] lustre_fill_super+0x328/0x950 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a8be40&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff81204afd&amp;gt;] mount_nodev+0x4d/0xb0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffffc0a83f78&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff81205589&amp;gt;] mount_fs+0x39/0x1b0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff81222067&amp;gt;] vfs_kern_mount+0x67/0x110
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff81224573&amp;gt;] do_mount+0x233/0xaf0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff812251b6&amp;gt;] SyS_mount+0x96/0xf0
Oct  4 18:40:44 soak-8 kernel: [&amp;lt;ffffffff816b4fc9&amp;gt;] system_call_fastpath+0x16/0x1b
Oct  4 18:42:44 soak-8 kernel: INFO: task mount.lustre:2489 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210321" author="cliffw" created="Wed, 4 Oct 2017 18:54:58 +0000"  >&lt;p&gt;Did a debug=-1 dump of soak-8 (the MDS) while the mount was hanging. Attached. &lt;/p&gt;</comment>
                            <comment id="210336" author="cliffw" created="Wed, 4 Oct 2017 21:46:38 +0000"  >&lt;p&gt;Tested build 3637, worked fine.&lt;br/&gt;
Build 3638 fails immedately&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[  230.588325] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
[  230.635151] LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log &lt;span class=&quot;code-quote&quot;&gt;&apos;soaked-MDT0000&apos;&lt;/span&gt; failed (-5). This may be the result of communication errors between &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; node and the MGS, a bad configuration, or other errors. See the syslog &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more information.
[  230.638598] Lustre: MGS: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.108@o2ib, removing former export from same NID
[  230.638625] Lustre: MGS: Connection restored to 8c9d0ffb-0dac-9c76-1262-80c392d2d0ce (at 192.168.1.108@o2ib)
[  230.638628] Lustre: Skipped 2 previous similar messages
[  230.796182] LustreError: 2039:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
[  230.873594] LustreError: 2039:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
[  230.905264] LustreError: 2039:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210343" author="adilger" created="Wed, 4 Oct 2017 23:05:16 +0000"  >&lt;p&gt;What patches were landed between these two builds?&lt;/p&gt;</comment>
                            <comment id="210344" author="jamesanunez" created="Thu, 5 Oct 2017 00:12:17 +0000"  >&lt;p&gt;Between builds 3637 and 3638, the following patches landed:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9907&quot; title=&quot;lbuild to support patchless server&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9907&quot;&gt;&lt;del&gt;LU-9907&lt;/del&gt;&lt;/a&gt; build: add patchless server for lbuild &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7988&quot; title=&quot;HSM: high lock contention for cdt_llog_lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7988&quot;&gt;&lt;del&gt;LU-7988&lt;/del&gt;&lt;/a&gt; hsm: update many cookie status at once &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7001&quot; title=&quot;osp_sync.c: 1139: osp_sync_thread&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7001&quot;&gt;&lt;del&gt;LU-7001&lt;/del&gt;&lt;/a&gt; osp: fix llog processing &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9810&quot; title=&quot;Melanox OFED 4.1 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9810&quot;&gt;&lt;del&gt;LU-9810&lt;/del&gt;&lt;/a&gt; lnet: fix build with M-OFED 4.1 &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9810&quot; title=&quot;Melanox OFED 4.1 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9810&quot;&gt;&lt;del&gt;LU-9810&lt;/del&gt;&lt;/a&gt; lnet: prefer Fast Reg &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9260&quot; title=&quot;posix failure: access.43 Unresolved&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9260&quot;&gt;&lt;del&gt;LU-9260&lt;/del&gt;&lt;/a&gt; test: Use the correct mount device when test against lustre &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9941&quot; title=&quot;lsm_is_composite() isn&amp;#39;t right&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9941&quot;&gt;&lt;del&gt;LU-9941&lt;/del&gt;&lt;/a&gt; lov: lsm_is_composite isn&apos;t right &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8342&quot; title=&quot;ZFS dnodesize and recordsize should be set at file system creation&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8342&quot;&gt;&lt;del&gt;LU-8342&lt;/del&gt;&lt;/a&gt; utils: Set dnodesize/recordsize at zfs dataset create &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9044&quot; title=&quot;conf-sanity test cases 24b remove from ALWAYS_EXCEPT&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9044&quot;&gt;&lt;del&gt;LU-9044&lt;/del&gt;&lt;/a&gt; test: remove conf-sanity tests from ALWAYS_EXCEPT &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9870&quot; title=&quot;rpms fail to build when SNMP is missing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9870&quot;&gt;&lt;del&gt;LU-9870&lt;/del&gt;&lt;/a&gt; build: handle SNMP missing on build box &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9891&quot; title=&quot;replay-ost-single test_7: 15995648 &amp;gt; 15995136 + logsize 400&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9891&quot;&gt;&lt;del&gt;LU-9891&lt;/del&gt;&lt;/a&gt; tests: Increase space not released for ZFS &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7746&quot; title=&quot;skip test of new functionality on upstream client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7746&quot;&gt;&lt;del&gt;LU-7746&lt;/del&gt;&lt;/a&gt; tests: skip tests for older (upstream) client &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9921&quot; title=&quot;LNet peer discovery list handling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9921&quot;&gt;&lt;del&gt;LU-9921&lt;/del&gt;&lt;/a&gt; lnet: resolve unsafe list access &#8212; oleg.drokin / gitweb&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9950&quot; title=&quot;add support for Ubuntu(debian) arm64&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9950&quot;&gt;&lt;del&gt;LU-9950&lt;/del&gt;&lt;/a&gt; build: add support for Ubuntu(debian) arm64 &#8212; oleg.drokin / gitweb&lt;/p&gt;</comment>
                            <comment id="210348" author="cliffw" created="Thu, 5 Oct 2017 01:01:55 +0000"  >&lt;p&gt;Having some problems also with lustre-review builds, so may also be soak, will re-load 3637 tomorrow to confirm.&lt;/p&gt;</comment>
                            <comment id="210358" author="adilger" created="Thu, 5 Oct 2017 07:29:31 +0000"  >&lt;p&gt;Based on the above patch descriptions, my first guess would be &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7001&quot; title=&quot;osp_sync.c: 1139: osp_sync_thread&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7001&quot;&gt;&lt;del&gt;LU-7001&lt;/del&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Alternately, if this is only being seen on IB then &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9810&quot; title=&quot;Melanox OFED 4.1 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9810&quot;&gt;&lt;del&gt;LU-9810&lt;/del&gt;&lt;/a&gt; seems likely. &lt;/p&gt;</comment>
                            <comment id="210386" author="jhammond" created="Thu, 5 Oct 2017 15:48:36 +0000"  >&lt;p&gt;It&apos;s likely that &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9810&quot; title=&quot;Melanox OFED 4.1 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9810&quot;&gt;&lt;del&gt;LU-9810&lt;/del&gt;&lt;/a&gt; is broken for OPA. Since I seem to recall that OPA supports FMR but not FastReg. Amir?&lt;/p&gt;</comment>
                            <comment id="210391" author="ashehata" created="Thu, 5 Oct 2017 16:30:49 +0000"  >&lt;p&gt;That&apos;s true. OPA supports FMR in software but not FastReg.&lt;/p&gt;

&lt;p&gt;Was this run on OPA?&lt;/p&gt;</comment>
                            <comment id="210395" author="ashehata" created="Thu, 5 Oct 2017 16:51:21 +0000"  >&lt;p&gt;Do we see this on startup?&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;LCONSOLE_INFO(&lt;span class=&quot;code-quote&quot;&gt;&quot;Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration\n&quot;&lt;/span&gt;);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210413" author="cliffw" created="Thu, 5 Oct 2017 18:14:30 +0000"  >&lt;p&gt;We sure do.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;/scratch/logs/syslog/soak-8.log:Oct  4 23:19:28 soak-8 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:06:13 soak-8 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-8.log:Oct  5 00:30:19 soak-8 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-8.log:Oct  5 18:12:54 soak-8 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  3 18:03:02 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  4 15:55:02 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  4 18:36:43 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  4 23:19:30 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  5 00:30:19 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
/scratch/logs/syslog/soak-9.log:Oct  5 18:12:55 soak-9 kernel: LNet: Using FastReg &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; registration
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210414" author="cliffw" created="Thu, 5 Oct 2017 18:20:07 +0000"  >&lt;p&gt;Just to be double-certain,this morning I re-loaded 3637 (last good build) did a complete re-format to obtain a clean FS, verified that everything mounted. Then, I installed build 3638, power-cycled all systems. Verified that LNET was up on all systems, and that all systems could lctl ping the MGS/MDS (soak-8) - attempted to mount the filesystem. Initial MGS/MDS mount fails. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct  5 18:16:28 soak-8 kernel: ZFS: Loaded module v0.7.1-1, ZFS pool version 5000, ZFS filesystem version 5
Oct  5 18:16:29 soak-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Oct  5 18:16:30 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:31 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507227390/real 1507227390]  req@ffff8808142d0000 x1580442241204256/t0(0) o253-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 4768/4768 e 0 to 1 dl 1507227397 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Oct  5 18:16:31 soak-8 kernel: LustreError: 2216:0:(events.c:304:request_in_callback()) event type 2, status -103, service mgs
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; magic/version check
Oct  5 18:16:31 soak-8 kernel: LustreError: 2417:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-192.168.1.108@o2ib x1580442241204256
Oct  5 18:16:31 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:31 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:38 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1507227391/real 1507227391]  req@ffff88080ef08000 x1580442241204384/t0(0) o503-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 272/8416 e 0 to 1 dl 1507227398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct  5 18:16:38 soak-8 kernel: LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.108@o2ib) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Oct  5 18:16:38 soak-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib: The configuration from log &lt;span class=&quot;code-quote&quot;&gt;&apos;soaked-MDT0000&apos;&lt;/span&gt; failed (-5). This may be the result of communication errors between &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; node and the MGS, a bad configuration, or other errors. See the syslog &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more information.
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.108@o2ib, removing former export from same NID
Oct  5 18:16:39 soak-8 kernel: Lustre: MGS: Connection restored to 0e7fadb0-8a70-1717-c708-3785eb9ea4ec (at 192.168.1.108@o2ib)
Oct  5 18:16:39 soak-8 kernel: Lustre: Skipped 1 previous similar message
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1370:server_start_targets()) failed to start server soaked-MDT0000: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1863:server_fill_super()) Unable to start targets: -5
Oct  5 18:16:39 soak-8 kernel: LustreError: 2365:0:(obd_mount_server.c:1573:server_put_super()) no obd soaked-MDT0000
Oct  5 18:16:46 soak-8 kernel: Lustre: 2365:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1507227399/real 1507227399]  req@ffff88080ef08000 x1580442241204416/t0(0) o251-&amp;gt;MGC192.168.1.108@o2ib@192.168.1.108@o2ib:26/25 lens 224/224 e 0 to 1 dl 1507227405 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct  5 18:16:46 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
Oct  5 18:16:46 soak-8 kernel: LustreError: 2365:0:(obd_mount.c:1504:lustre_fill_super()) Unable to mount  (-5)
Oct  5 18:16:46 soak-8 sshd[2342]: Received disconnect from 192.168.1.116 port 35786:11: disconnected by user
Oct  5 18:16:46 soak-8 sshd[2342]: Disconnected from 192.168.1.116 port 35786
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="210443" author="gerrit" created="Thu, 5 Oct 2017 21:44:35 +0000"  >&lt;p&gt;James Nunez (james.a.nunez@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/29341&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29341&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10068&quot; title=&quot;OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10068&quot;&gt;&lt;del&gt;LU-10068&lt;/del&gt;&lt;/a&gt; lnet: Revert prefer Fast Reg&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 004295627f99c906f360acc0624285b4631782a1&lt;/p&gt;</comment>
                            <comment id="210459" author="ashehata" created="Thu, 5 Oct 2017 23:46:41 +0000"  >&lt;p&gt;so on MLX5 I&apos;m able to mount, but I&apos;m getting the following error:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[ 1157.516112] LNetError: 34447:0:(o2iblnd.c:1940:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
[ 1157.516122] LNetError: 34528:0:(o2iblnd_cb.c:560:kiblnd_fmr_map_tx()) Can&apos;t map 41033 pages: -22
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That&apos;s again due to the fastreg code.&lt;/p&gt;</comment>
                            <comment id="210497" author="jhammond" created="Fri, 6 Oct 2017 14:18:17 +0000"  >&lt;p&gt;Fails on OPA as well.&lt;/p&gt;</comment>
                            <comment id="210511" author="simmonsja" created="Fri, 6 Oct 2017 15:37:49 +0000"  >&lt;p&gt;I tried the patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9983&quot; title=&quot;LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9983&quot;&gt;&lt;del&gt;LU-9983&lt;/del&gt;&lt;/a&gt; to see it would address this but it didn&apos;t &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="210519" author="cliffw" created="Fri, 6 Oct 2017 16:16:03 +0000"  >&lt;p&gt;Trying &lt;a href=&quot;https://review.whamcloud.com/29341&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29341&lt;/a&gt; on soak - servers mount okay, some clients are having timeouts&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;soak-20 login: [ 3363.772469] INFO: task mount.lustre:2191 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.^M
[ 3363.782165] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.^M
[ 3363.792785] mount.lustre    D 0000000000000001     0  2191   2190 0x00000080^M
[ 3363.802418]  ffff88103fad3940 0000000000000082 ffff881013fedee0 ffff88103fad3fd8^M
[ 3363.812340]  ffff88103fad3fd8 ffff88103fad3fd8 ffff881013fedee0 ffff881013b87270^M
[ 3363.822136]  7fffffffffffffff ffff881013b87268 ffff881013fedee0 0000000000000001^M
[ 3363.831817] Call Trace:^M
[ 3363.835841]  [&amp;lt;ffffffff816a94e9&amp;gt;] schedule+0x29/0x70^M
[ 3363.842720]  [&amp;lt;ffffffff816a6ff9&amp;gt;] schedule_timeout+0x239/0x2c0^M
[ 3363.850481]  [&amp;lt;ffffffff810c76f5&amp;gt;] ? sched_clock_cpu+0x85/0xc0^M
[ 3363.858146]  [&amp;lt;ffffffff810c12c8&amp;gt;] ? check_preempt_curr+0x78/0xa0^M
[ 3363.866025]  [&amp;lt;ffffffff810c1309&amp;gt;] ? ttwu_do_wakeup+0x19/0xd0^M
[ 3363.873501]  [&amp;lt;ffffffff816a989d&amp;gt;] wait_for_completion+0xfd/0x140^M
[ 3363.881287]  [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20^M
[ 3363.888596]  [&amp;lt;ffffffffc09de6b4&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]^M
[ 3363.897603]  [&amp;lt;ffffffffc09de8d4&amp;gt;] llog_process+0x14/0x20 [obdclass]^M
[ 3363.905633]  [&amp;lt;ffffffffc0a0ed65&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]^M
[ 3363.914874]  [&amp;lt;ffffffffc091c368&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]^M
[ 3363.923229]  [&amp;lt;ffffffffc091f1e3&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]^M
[ 3363.931153]  [&amp;lt;ffffffffc0a17480&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]^M
[ 3363.940736]  [&amp;lt;ffffffffc091f8e8&amp;gt;] ? do_config_log_add+0x248/0x580 [mgc]^M
[ 3363.949020]  [&amp;lt;ffffffffc09207c0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]^M
[ 3363.957326]  [&amp;lt;ffffffffc0a1ad7e&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]^M
[ 3363.965969]  [&amp;lt;ffffffff816aba0e&amp;gt;] ? _raw_spin_unlock_bh+0x1e/0x20^M
[ 3363.973647]  [&amp;lt;ffffffff8132318b&amp;gt;] ? fprop_local_init_percpu+0x1b/0x30^M
[ 3363.981681]  [&amp;lt;ffffffffc0e20608&amp;gt;] ll_fill_super+0xaf8/0x1220 [lustre]^M
[ 3363.989721]  [&amp;lt;ffffffffc0a20ce6&amp;gt;] lustre_fill_super+0x286/0x910 [obdclass]^M
[ 3363.998214]  [&amp;lt;ffffffffc0a20a60&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]^M
[ 3364.007482]  [&amp;lt;ffffffff81204b0d&amp;gt;] mount_nodev+0x4d/0xb0^M
[ 3364.014099]  [&amp;lt;ffffffffc0a18ba8&amp;gt;] lustre_mount+0x38/0x60 [obdclass]^M
[ 3364.021874]  [&amp;lt;ffffffff81205599&amp;gt;] mount_fs+0x39/0x1b0^M
[ 3364.028291]  [&amp;lt;ffffffff81222067&amp;gt;] vfs_kern_mount+0x67/0x110^M
[ 3364.035288]  [&amp;lt;ffffffff81224573&amp;gt;] do_mount+0x233/0xaf0^M
[ 3364.041786]  [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40^M
[ 3364.048971]  [&amp;lt;ffffffff812251b6&amp;gt;] SyS_mount+0x96/0xf0^M
[ 3364.055364]  [&amp;lt;ffffffff816b5009&amp;gt;] system_call_fastpath+0x16/0x1b^M
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Re-trying clients after power-cycling does work&lt;/p&gt;</comment>
                            <comment id="210523" author="ashehata" created="Fri, 6 Oct 2017 16:43:07 +0000"  >&lt;p&gt;there appears to be 3 separate issues:&lt;br/&gt;
1. Fastreg is not supported on OPA so reverting &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9810&quot; title=&quot;Melanox OFED 4.1 support&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9810&quot;&gt;&lt;del&gt;LU-9810&lt;/del&gt;&lt;/a&gt; works (as a side note, I also tried John&apos;s patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9983&quot; title=&quot;LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9983&quot;&gt;&lt;del&gt;LU-9983&lt;/del&gt;&lt;/a&gt; and that resolves the problem we were seeing there).&lt;br/&gt;
2. Fastreg broken on MLX-5. I&apos;ve been debugging this problem yesterday, and we already have a few bugs that are all related. For this particular problem ib_map_mr_sg() is called to map 11 fragments but ends up mapping 10. I&apos;m trying to look at the mlx5 driver code and understand why it stops before mapping all fragments. The fragments look like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;length = 4096, offset = 0, addr = 105c43c000, page offset = 0, page_addr = 105c43c000
length = 4096, offset = 0, addr = 84cfb0000, page offset = 0, page_addr = 84cfb0000
length = 4096, offset = 0, addr = 84cfb1000, page offset = 0, page_addr = 84cfb1000
length = 4096, offset = 0, addr = 84cfb2000, page offset = 0, page_addr = 84cfb2000
length = 4096, offset = 0, addr = 84cfb3000, page offset = 0, page_addr = 84cfb3000
length = 4096, offset = 0, addr = 84cfb4000, page offset = 0, page_addr = 84cfb4000
length = 4096, offset = 0, addr = 84cfb5000, page offset = 0, page_addr = 84cfb5000
length = 4096, offset = 0, addr = 84cfb6000, page offset = 0, page_addr = 84cfb6000
length = 4096, offset = 0, addr = 84cfb7000, page offset = 0, page_addr = 84cfb7000
length = 73, offset = 0, addr = 84cfb8000, page offset = 0, page_addr = 84cfb8000
length = 4096, offset = 0, addr = 105c43d000, page offset = 0, page_addr = 105c43d000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it looks like it stops on the fragment of length 73.&lt;/p&gt;

&lt;p&gt;3. MLX-4 failure. I still need to investigate further, because it could be different than both of the above.&lt;/p&gt;</comment>
                            <comment id="210565" author="cliffw" created="Fri, 6 Oct 2017 23:42:48 +0000"  >&lt;p&gt;Tested James patch reverting Fast Reg. &lt;a href=&quot;https://review.whamcloud.com/29341&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29341&lt;/a&gt;. Soak is running okay, but router have hit LBUG (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10103&quot; title=&quot;LBUG: lib-move.c:2121:lnet_send()) ASSERTION( msg-&amp;gt;msg_txpeer == ((void *)0) ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10103&quot;&gt;LU-10103&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="210738" author="gerrit" created="Tue, 10 Oct 2017 18:08:44 +0000"  >&lt;p&gt;Amir Shehata (amir.shehata@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/29547&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/29547&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10068&quot; title=&quot;OST fails to mount:LustreError: 14558:0:(pack_generic.c:588:__lustre_unpack_msg()) message length 0 too small for magic/version check&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10068&quot;&gt;&lt;del&gt;LU-10068&lt;/del&gt;&lt;/a&gt; lnet: Combined patch for testing on soak&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f4009ebfecc1cfc4066a4b83808b224eae2ab852&lt;/p&gt;</comment>
                            <comment id="216640" author="adilger" created="Mon, 18 Dec 2017 18:41:49 +0000"  >&lt;p&gt;The combined patch is no longer needed (part reverted, part landed), can this ticket be closed?&lt;/p&gt;</comment>
                            <comment id="216641" author="cliffw" created="Mon, 18 Dec 2017 18:47:43 +0000"  >&lt;p&gt;I am ok with closing it.&lt;/p&gt;</comment>
                            <comment id="216707" author="adilger" created="Tue, 19 Dec 2017 10:12:15 +0000"  >&lt;p&gt;This was fixed by the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9983&quot; title=&quot;LBUG llog_osd.c:327:llog_osd_declare_write_rec() - all DNE MDS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9983&quot;&gt;&lt;del&gt;LU-9983&lt;/del&gt;&lt;/a&gt; patch.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="48300">LU-9983</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="28406" name="soak-8.lustre.log.txt" size="1468637" author="cliffw" created="Wed, 4 Oct 2017 18:54:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzl7z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>