<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:12:55 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14802] MGS configuration problems - cannot add new OST, change parameters, hanging</title>
                <link>https://jira.whamcloud.com/browse/LU-14802</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello,&lt;br/&gt;
Last sunday evening I attempted to add 16 additional OSTs to this filesystem and immediately ran into difficulties, with the majority of these OSTs failing to be added to the filesystem. Further, since this time, the MGS is showing a number of errors in the logs and is failing to take &lt;b&gt;any&lt;/b&gt; new configuration (lctl set/conf_param) and all clients or server OST/MDT targets take a very long time to connect to the MGS, or sometimes fail to connect at all, leaving them unable to mount.&lt;/p&gt;

&lt;p&gt;We&apos;ve had a number of length periods of filesystem unavailability due to this, since the weekend, mostly from when I&apos;ve tried to unmount/remount the MGS which is colocated on an MDS server with some of the MDTs in the filesystem, and the act of unmounting has caused the MDS load to spike and access to those MDTs to hang from all clients.&lt;/p&gt;

&lt;p&gt;The filesystem is currently up for existing clients, but rebooted clients are failing to mount the filesystem again, hence the severity here.&lt;/p&gt;

&lt;p&gt;Here are the MDS and OSS server kernel logs at the time the first OSTs were added on Sunday:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS Server logs Sun 27th June&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 22:28:18 rds-mds7 kernel: Lustre: ctl-rds-d6-MDT0000: &lt;span class=&quot;code-keyword&quot;&gt;super&lt;/span&gt;-sequence allocation rc = 0 [0x0000001300000400-0x0000001340000400]:40:ost
Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 177 previous similar messages
Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 2 previous similar messages
Jun 27 22:30:37 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Connection restored to 0bb09763-547b-56c8-0a7a-c2613e26070a (at 10.47.2.157@o2ib1)
Jun 27 22:30:37 rds-mds7 kernel: Lustre: Skipped 216 previous similar messages
Jun 27 22:31:55 rds-mds7 kernel: Lustre: rds-d6-MDT0000: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
Jun 27 22:31:55 rds-mds7 kernel: Lustre: Skipped 186 previous similar messages
Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 4 previous similar messages
Jun 27 22:32:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Jun 27 22:32:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829272, 300s ago), entering recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff975ed48c3840/0x184fda0e60b72a0e lrc: 4/1,0 m
Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829289, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff975c83dbd100/0x184fda0e60b67c50 lrc: 3/0,1
Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 43 previous similar messages
Jun 27 22:33:37 rds-mds7 kernel: LNet: Service thread pid 12985 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
Jun 27 22:33:37 rds-mds7 kernel: LNet: Skipped 3 previous similar messages
Jun 27 22:33:37 rds-mds7 kernel: Pid: 12985, comm: ll_mgs_0005 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Sun Aug 23 13:52:28 UTC 2020
Jun 27 22:33:37 rds-mds7 kernel: Call Trace:
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc1313c90&amp;gt;] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a076c&amp;gt;] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a0983&amp;gt;] mgs_completion_ast_config+0x13/0x20 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc13147b1&amp;gt;] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a53b4&amp;gt;] mgs_revoke_lock+0x104/0x380 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a5ad2&amp;gt;] mgs_target_reg+0x4a2/0x1320 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc13b29da&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc135748b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc135adf4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffff964c6691&amp;gt;] kthread+0xd1/0xe0
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffff96b92d1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Jun 27 22:33:37 rds-mds7 kernel: LustreError: dumping log to /tmp/lustre-log.1624829616.12985
Jun 27 22:34:53 rds-mds7 kernel: Lustre: MGS: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 10.47.1.153@o2ib1, removing former export from same NID
Jun 27 22:34:53 rds-mds7 kernel: Lustre: Skipped 1998 previous similar messages
Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
Jun 27 22:35:26 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:26 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829572, 300s ago), entering recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9745624b8240/0x184fda0e60cb24e5 lrc: 4/1,0 m
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 205913:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff973ee9f4cc00) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 3 previous similar messages
Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply
 req@ffff974d89171850 x1703756897980928/t0(0) o253-&amp;gt;dec8c782-f0e7-81a2-d397-7bf60e27e1ff@10.44.241.25@o2ib2:556/0 lens 4768/4768 e 20 to 0 dl 1624830016 ref 2 fl Interpret:/0/0 rc 0/0
Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 131 previous similar messages
Jun 27 22:40:49 rds-mds7 kernel: Lustre: rds-d6-MDT0003: Connection restored to 2f43bc44-4d4e-e6e8-8ca0-dd0bfa20fa04 (at 10.47.2.114@o2ib1)
Jun 27 22:40:49 rds-mds7 kernel: Lustre: Skipped 4151 previous similar messages
Jun 27 22:41:59 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Client 03c49b2b-134c-390d-9059-dc732d34be0b (at 10.47.1.253@o2ib1) reconnecting
Jun 27 22:41:59 rds-mds7 kernel: Lustre: Skipped 179 previous similar messages
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829872, 300s ago), entering recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9738f01418c0/0x184fda0e60faac19 lrc: 4/1,0 m
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 212692:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9724c6f7a6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;OSS Server logs&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jun 27 22:26:35 rds-oss59 kernel: Lustre: Lustre: Build Version: 2.12.5
Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.47.0.2@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 135 previous similar messages
Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.44.163.107@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 1088 previous similar messages
Jun 27 22:27:25 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.44.161.43@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:25 rds-oss59 kernel: LustreError: Skipped 588 previous similar messages
Jun 27 22:27:27 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.44.18.2@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:27 rds-oss59 kernel: LustreError: Skipped 55 previous similar messages
Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; disk, initializing
Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 10.44.161.6@o2ib2 (not set up)
Jun 27 22:27:43 rds-oss59 kernel: Lustre: srv-rds-d6-OST0040: No data found on store. Initialize space
Jun 27 22:27:43 rds-oss59 kernel: Lustre: rds-d6-OST0040: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.2.15@o2ib1)
Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.1.172@o2ib1)
Jun 27 22:28:14 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.21.88@o2ib1)
Jun 27 22:28:14 rds-oss59 kernel: Lustre: Skipped 8 previous similar messages
Jun 27 22:28:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.74.52@o2ib2)
Jun 27 22:28:15 rds-oss59 kernel: Lustre: Skipped 1456 previous similar messages
Jun 27 22:28:17 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.12.31@tcp2)
Jun 27 22:28:17 rds-oss59 kernel: Lustre: Skipped 447 previous similar messages
Jun 27 22:28:18 rds-oss59 kernel: Lustre: cli-rds-d6-OST0040-&lt;span class=&quot;code-keyword&quot;&gt;super&lt;/span&gt;: Allocated &lt;span class=&quot;code-keyword&quot;&gt;super&lt;/span&gt;-sequence [0x0000001300000400-0x0000001340000400]:40:ost]
Jun 27 22:28:32 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.161.6@o2ib2)
Jun 27 22:28:32 rds-oss59 kernel: Lustre: Skipped 45 previous similar messages
Jun 27 22:29:34 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.240.201@tcp2)
Jun 27 22:29:34 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
Jun 27 22:29:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
Jun 27 22:30:15 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jun 27 22:30:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.6.58@o2ib2)
Jun 27 22:30:35 rds-oss59 kernel: Lustre: Skipped 2 previous similar messages
Jun 27 22:32:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.13@o2ib1)
Jun 27 22:33:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.15@o2ib1)
Jun 27 22:33:35 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 10.44.241.1@o2ib2) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will fail
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829423, 300s ago), entering recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9bc4d83c8b40/0xc34bf6a4691f74a9 lrc: 4/1,
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 263517:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9bc4d59bf5c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 15f-b: rds-d6-OST0041: cannot register &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; server with the MGS: rc = -98. Is the MGS running?
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -98
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0041
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0041 not registered
Jun 27 22:35:25 rds-oss59 kernel: Lustre: server umount rds-d6-OST0041 complete
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-98)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So from above it shows OST0040 being added successfully, but then OST0041 failing to be added, and all subsequent OSTs (they were started together), failing in the same way.&lt;/p&gt;

&lt;p&gt;The lustre internal log dump file mentioned above lustre-log.1624829616.12985 is attached.&lt;/p&gt;

&lt;p&gt;Since sunday, I have managed to mount some of the new OSTs successfully, through a mixture of restarting the MGS, and then needing to reformat the new OSTs with the &apos;--replace&apos; flag, to get around the error mentioned in the log:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This worked for a couple of OSTs at a time, before the MGS would hang again refusing any further config. So far I have managed to mount about 8 of the 16 OSTs that I was intending to add on Sundary.&lt;/p&gt;

&lt;p&gt;I&apos;ve been looking around on Jira since sunday for similar tickets and I came across:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12735&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-12735&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14695&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-14695&lt;/a&gt;&lt;br/&gt;
which both look very much like what I&apos;m experiencing here.&lt;/p&gt;

&lt;p&gt;Following the discussion in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12735&quot; title=&quot;MGS misbehaving in 2.12.2+&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12735&quot;&gt;&lt;del&gt;LU-12735&lt;/del&gt;&lt;/a&gt;, I tried the patch mentioned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; (&lt;a href=&quot;https://review.whamcloud.com/41309&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41309&lt;/a&gt;), applied on top of 2.12.5 (which is the current server release we are running), and installed this patched version &lt;b&gt;only&lt;/b&gt; on the server running the MGS.&lt;/p&gt;

&lt;p&gt;Unfortunately this doesn&apos;t appear to have fixed the issue for us, this is me subsequently trying to add an OST following a restart of the MGS onto this new version:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 30 22:29:10 rds-mds8 kernel: LustreError: 142-7: The target rds-d6-OST004b has not registered yet. It must be started before failnids can be added.
Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_llog.c:4301:mgs_write_log_param()) err -2 on param &lt;span class=&quot;code-quote&quot;&gt;&apos;failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2&apos;&lt;/span&gt;
Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST004b log (-2)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I am also getting errors from rebooted clients that are failing to mount the filesystem:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) MGC10.44.241.1@o2ib2: failed processing log, type 1: rc = -5
Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) Skipped 2 previous similar messages
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: root_squash is set to 99:99
Jul 01 10:28:33 lustre-transfer-02 kernel: LustreError: 22964:0:(llite_lib.c:516:client_common_fill_super()) cannot mds_connect: rc = -2
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: nosquash_nids set to 10.44.241.[1-2]@o2ib2 10.44.161.[81-84]@o2ib2 10.44.71.[2-3]@o2ib2 10.43.240.[201-202]@tcp2 10.43.240.[198-199]@tcp2 10.144.9.[50-51]@o2ib 10.47.4.[210-213]@o2ib1
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: Unmounted rds-d6-client
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m not sure what to try next. Can I get some support in tracking down what is wrong with the MGS?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Matt&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>{code:title=Server Environment}&lt;br/&gt;
# Lustre 2.12.5&lt;br/&gt;
[&lt;a href=&apos;mailto:root@rds-mds7&apos;&gt;root@rds-mds7&lt;/a&gt; ~]# lfs --version&lt;br/&gt;
lfs 2.12.5&lt;br/&gt;
&lt;br/&gt;
# Kernel&lt;br/&gt;
[&lt;a href=&apos;mailto:root@rds-mds7&apos;&gt;root@rds-mds7&lt;/a&gt; ~]# uname -r&lt;br/&gt;
3.10.0-1127.8.2.el7_lustre.x86_64&lt;br/&gt;
&lt;br/&gt;
# MOFED 4.9&lt;br/&gt;
[&lt;a href=&apos;mailto:root@rds-mds7&apos;&gt;root@rds-mds7&lt;/a&gt; ~]# ofed_info -n&lt;br/&gt;
4.9-0.1.7.0.202008231437&lt;br/&gt;
&lt;br/&gt;
# Filesystem&lt;br/&gt;
4 MDTs, rds-d6-MDT000[0-3]&lt;br/&gt;
64 OSTs initially - attempting to expand this with additional 16 OSTs&lt;br/&gt;
{code}&lt;br/&gt;
&lt;br/&gt;
{code:title=Client Environment}&lt;br/&gt;
# Lustre 2.12.6 almost everywhere &lt;br/&gt;
[&lt;a href=&apos;mailto:root@gpu-e-80&apos;&gt;root@gpu-e-80&lt;/a&gt; ~]# lfs --version&lt;br/&gt;
lfs 2.12.6&lt;br/&gt;
&lt;br/&gt;
# Mixture of RHEL 7.9 and RHEL 8.3 clients&lt;br/&gt;
[&lt;a href=&apos;mailto:root@gpu-e-80&apos;&gt;root@gpu-e-80&lt;/a&gt; ~]# uname -r&lt;br/&gt;
3.10.0-1160.31.1.el7.csd3.x86_64&lt;br/&gt;
&lt;br/&gt;
# Mixture of MOFED 4.9 and MOFED 5.1&lt;br/&gt;
[&lt;a href=&apos;mailto:root@gpu-e-80&apos;&gt;root@gpu-e-80&lt;/a&gt; ~]# ofed_info -n&lt;br/&gt;
4.9-2.2.4.0&lt;br/&gt;
{code}</environment>
        <key id="64938">LU-14802</key>
            <summary>MGS configuration problems - cannot add new OST, change parameters, hanging</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="mrb">Matt R&#225;s&#243;-Barnett</reporter>
                        <labels>
                    </labels>
                <created>Thu, 1 Jul 2021 10:04:06 +0000</created>
                <updated>Sat, 27 Aug 2022 13:29:17 +0000</updated>
                            <resolved>Mon, 31 Jan 2022 04:42:20 +0000</resolved>
                                                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="305978" author="mrb" created="Thu, 1 Jul 2021 10:58:57 +0000"  >&lt;p&gt;Another thread onto this problem - we have a second identical filesystem that also shows this problem. Again triggered by attempting to add OSTs to it on Sunday evening.&lt;/p&gt;

&lt;p&gt;I just now tried unmounting the MGS as clients were unable to mount the filesystem and it generated the following stack trace and lustre debug log as it hung for some time before unmounting. Just thought I would add incase this particular trace helps:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;div class=&quot;error&quot;&gt;&lt;span class=&quot;error&quot;&gt;Unable to find source-code formatter for language: rds-mds10 mgs log on unmount of mgt.&lt;/span&gt; Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml&lt;/div&gt;&lt;pre&gt;
Jul 01 11:27:11 rds-mds10 kernel: LNet: Service thread pid 204490 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.34s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
Jul 01 11:27:11 rds-mds10 kernel: Pid: 204490, comm: ll_mgs_0002 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Sun Aug 23 13:52:28 UTC 2020
Jul 01 11:27:11 rds-mds10 kernel: Call Trace:
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc108ac90&amp;gt;] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc141776c&amp;gt;] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc1417983&amp;gt;] mgs_completion_ast_config+0x13/0x20 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc108b7b1&amp;gt;] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc141c3b4&amp;gt;] mgs_revoke_lock+0x104/0x380 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc141cad2&amp;gt;] mgs_target_reg+0x4a2/0x1320 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc11299da&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc10ce48b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffc10d1df4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffff82ac6691&amp;gt;] kthread+0xd1/0xe0
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffff83192d1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Jul 01 11:27:11 rds-mds10 kernel:  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
Jul 01 11:27:11 rds-mds10 kernel: LustreError: dumping log to /tmp/lustre-log.1625135231.204490
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/39379/39379_lustre-log.1625135231.204490.gz&quot; title=&quot;lustre-log.1625135231.204490.gz attached to LU-14802&quot;&gt;lustre-log.1625135231.204490.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="306013" author="pjones" created="Thu, 1 Jul 2021 14:56:07 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="306225" author="mrb" created="Mon, 5 Jul 2021 14:56:38 +0000"  >&lt;p&gt;Hi Mike,&lt;br/&gt;
My suspicion before opening this ticket was that I was experiencing the same as this ticket: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12735&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-12735&lt;/a&gt; which references a patch under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; that should fix the problem.&lt;/p&gt;

&lt;p&gt;I tried to test this by only updating the MGS server to a 2.12.5 version with this patch backported, and nothing else. However looking more closely at the commit, is this actually a &lt;b&gt;client&lt;/b&gt; patch? If so, should I update all the clients in our environment with this patch to see the effect? How about the servers, is there any reason for the servers to have this patch as well?&lt;/p&gt;

&lt;p&gt;Thanks for your time,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="306227" author="tappro" created="Mon, 5 Jul 2021 15:59:26 +0000"  >&lt;p&gt;Matt, yes, the patch you are referring is client/server patch, both parts are needed actually. Meanwhile I tend to think that originally we have LNET problem here and that caused related problem with MGS. Could you also collect LNET configuration:&lt;br/&gt;
&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl global show
lnetctl net show -v 4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;also configuration defined in &lt;tt&gt;ko2iblnd.conf&lt;/tt&gt; and the values of the configurations in &lt;tt&gt;/sys/module/lnet/parameters/&lt;/tt&gt; and &lt;tt&gt;/sys/module/ko2iblnd/parameters/&lt;/tt&gt; if any.&lt;/p&gt;</comment>
                            <comment id="306275" author="mrb" created="Tue, 6 Jul 2021 10:41:00 +0000"  >&lt;p&gt;Here is the LNET configuration for the servers:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;Server Config&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[root@rds-mds8 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 2
    transaction_timeout: 50
    health_sensitivity: 100
    recovery_interval: 1

[root@rds-mds8 ~]# lnetctl net show -v 4
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 38220
              recv_count: 38220
              drop_count: 0
          sent_stats:
              put: 38220
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 38220
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: &lt;span class=&quot;code-quote&quot;&gt;&quot;[0,1]&quot;&lt;/span&gt;
      local NI(s):
        - nid: 10.44.241.2@o2ib2
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 41253701
              recv_count: 40228442
              drop_count: 1038348
          sent_stats:
              put: 41184595
              get: 69106
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 40150520
              get: 3158
              reply: 65947
              ack: 8817
              hello: 0
          dropped_stats:
              put: 1038348
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              interrupts: 0
              dropped: 311
              aborted: 0
              no route: 0
              timeouts: 102
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 16
              peer_buffer_credits: 0
              credits: 2048
              peercredits_hiw: 8
              map_on_demand: 256
              concurrent_sends: 16
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: &lt;span class=&quot;code-quote&quot;&gt;&quot;[0,1]&quot;&lt;/span&gt;

[root@rds-mds8 ~]# cat /etc/modprobe.d/ko2iblnd.conf
# Ansible managed

# Currently it isn&apos;t possible to auto-tune the o2iblnd parameters optimally
# inside the kernel since the OFED API hides the details from us.
# Unfortunately, there isn&apos;t a single set of parameters that provide optimal
# performance on different HCA/HFI types. This file provides optimized
# tunables &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; the OPA cards.
#
# ** Please note that the below settings are the recommended settings only &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt;
#    OPA cards. If other IB cards are also present along with OPA cards then
#    these settings will be applied across all the configured IB interfaces.
#
# Card detection and tunable selection is handled via /usr/sbin/ko2iblnd-probe
# at runtime when the ko2iblnd module is installed, either at boot or when
# Lustre is first mounted.

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
install ko2iblnd /usr/sbin/ko2iblnd-probe

[root@rds-mds8 ~]# cat /etc/modprobe.d/lnet.conf 
# Ansible managed
options lnet networks=&lt;span class=&quot;code-quote&quot;&gt;&quot;o2ib2(ib0)&quot;&lt;/span&gt; auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60

[root@rds-mds8 ~]# cat /etc/lnet.conf
# Ansible managed
---
# lnet.conf - configuration file &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; lnet routes to be imported by lnetctl
#
# This configuration file is formatted as YAML and can be imported
# by lnetctl.
net:
    - net type: o2ib2
      local NI(s):
        - nid: 10.44.241.2@o2ib2
          interfaces:
            0: ib0
          tunables:
              peer_credits: 16
              peer_credits_hiw: 8
              concurrent_sends: 16
              credits: 2048
          lnd tunables:
              map_on_demand: 256
              ntx: 2048
route:
    - net: o2ib0
      gateway: 10.44.240.165@o2ib2
    - net: o2ib0
      gateway: 10.44.240.166@o2ib2
    - net: o2ib0
      gateway: 10.44.240.167@o2ib2
    - net: o2ib0
      gateway: 10.44.240.168@o2ib2
    - net: o2ib1
      gateway: 10.44.240.161@o2ib2
    - net: o2ib1
      gateway: 10.44.240.162@o2ib2
    - net: o2ib1
      gateway: 10.44.240.163@o2ib2
    - net: o2ib1
      gateway: 10.44.240.164@o2ib2
    - net: o2ib1
      gateway: 10.44.240.165@o2ib2
    - net: o2ib1
      gateway: 10.44.240.166@o2ib2
    - net: o2ib1
      gateway: 10.44.240.167@o2ib2
    - net: o2ib1
      gateway: 10.44.240.168@o2ib2
    - net: tcp2
      gateway: 10.44.240.161@o2ib2
    - net: tcp2
      gateway: 10.44.240.162@o2ib2
    - net: tcp2
      gateway: 10.44.240.163@o2ib2
    - net: tcp2
      gateway: 10.44.240.164@o2ib2
    - net: tcp2
      gateway: 10.44.240.165@o2ib2
    - net: tcp2
      gateway: 10.44.240.166@o2ib2
    - net: tcp2
      gateway: 10.44.240.167@o2ib2
    - net: tcp2
      gateway: 10.44.240.168@o2ib2
    - net: tcp4
      gateway: 10.44.240.161@o2ib2
    - net: tcp4
      gateway: 10.44.240.162@o2ib2
    - net: tcp4
      gateway: 10.44.240.163@o2ib2
    - net: tcp4
      gateway: 10.44.240.164@o2ib2
    - net: tcp4
      gateway: 10.44.240.165@o2ib2
    - net: tcp4
      gateway: 10.44.240.166@o2ib2
    - net: tcp4
      gateway: 10.44.240.167@o2ib2
    - net: tcp4
      gateway: 10.44.240.168@o2ib2

[root@rds-mds8 ~]# grep . /sys/module/lnet/parameters/*                                                                                                                                                                                                                                    
/sys/module/lnet/parameters/accept:secure
/sys/module/lnet/parameters/accept_backlog:127
/sys/module/lnet/parameters/accept_port:988
/sys/module/lnet/parameters/accept_timeout:5
/sys/module/lnet/parameters/auto_down:1
/sys/module/lnet/parameters/avoid_asym_router_failure:1
/sys/module/lnet/parameters/check_routers_before_use:1
/sys/module/lnet/parameters/config_on_load:0
/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/large_router_buffers:0
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/lnet_drop_asym_route:0
/sys/module/lnet/parameters/lnet_health_sensitivity:100
/sys/module/lnet/parameters/lnet_interfaces_max:200
/sys/module/lnet/parameters/lnet_numa_range:0
/sys/module/lnet/parameters/lnet_peer_discovery_disabled:0
/sys/module/lnet/parameters/lnet_recovery_interval:1
/sys/module/lnet/parameters/lnet_retry_count:2
/sys/module/lnet/parameters/lnet_transaction_timeout:50
/sys/module/lnet/parameters/local_nid_dist_zero:1
/sys/module/lnet/parameters/networks:o2ib2(ib0)
/sys/module/lnet/parameters/peer_buffer_credits:0
/sys/module/lnet/parameters/portal_rotor:3
/sys/module/lnet/parameters/rnet_htable_size:128
/sys/module/lnet/parameters/router_ping_timeout:60
/sys/module/lnet/parameters/small_router_buffers:0
/sys/module/lnet/parameters/tiny_router_buffers:0
/sys/module/lnet/parameters/use_tcp_bonding:0

[root@rds-mds8 ~]# grep . /sys/module/ko2iblnd/parameters/*
/sys/module/ko2iblnd/parameters/cksum:0
/sys/module/ko2iblnd/parameters/concurrent_sends:0
/sys/module/ko2iblnd/parameters/conns_per_peer:1
/sys/module/ko2iblnd/parameters/credits:256
/sys/module/ko2iblnd/parameters/dev_failover:0
/sys/module/ko2iblnd/parameters/fmr_cache:1
/sys/module/ko2iblnd/parameters/fmr_flush_trigger:384
/sys/module/ko2iblnd/parameters/fmr_pool_size:512
/sys/module/ko2iblnd/parameters/ib_mtu:0
/sys/module/ko2iblnd/parameters/ipif_name:ib0
/sys/module/ko2iblnd/parameters/keepalive:100
/sys/module/ko2iblnd/parameters/map_on_demand:0
/sys/module/ko2iblnd/parameters/nscheds:0
/sys/module/ko2iblnd/parameters/ntx:512
/sys/module/ko2iblnd/parameters/peer_buffer_credits:0
/sys/module/ko2iblnd/parameters/peer_credits:8
/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
/sys/module/ko2iblnd/parameters/peer_timeout:180
/sys/module/ko2iblnd/parameters/require_privileged_port:0
/sys/module/ko2iblnd/parameters/retry_count:5
/sys/module/ko2iblnd/parameters/rnr_retry_count:6
/sys/module/ko2iblnd/parameters/service:987
/sys/module/ko2iblnd/parameters/timeout:50
/sys/module/ko2iblnd/parameters/use_fastreg_gaps:0
/sys/module/ko2iblnd/parameters/use_privileged_port:1
/sys/module/ko2iblnd/parameters/wrq_sge:2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The servers here are Mellanox Connect-X6 cards, HDR, and we are currently still using MOFED 4.9 on all the servers:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[root@rds-mds8 ~]# lspci -vvv | grep Mellanox
5e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
        Subsystem: Mellanox Technologies Device 0006
[root@rds-mds8 ~]# ibstatus
Infiniband device &lt;span class=&quot;code-quote&quot;&gt;&apos;mlx5_0&apos;&lt;/span&gt; port 1 status:
        &lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt; gid:     fe80:0000:0000:0000:1c34:da03:0055:3a44
        base lid:        0x24e
        sm lid:          0x53
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

[root@rds-mds8 ~]# ofed_info -n
4.9-0.1.7.0.202008231437
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The client environment is much more diverse, so if the logs indicate any particular clients/NIDs that are involved here I can provide more details on those. We also have LNET routers in the environment, and these are routing between IB and Intel OPA and Ethernet.&lt;/p&gt;

&lt;p&gt;The IB fabric is also relatively diverse, with generations of clients spanning from FDR to EDR to HDR. Generally though these are all running up to date, 2.12.6 Lustre, and a mixture of RHEL 7.9 or RHEL 8.3, with MOFED 4.9 or 5.1.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="306659" author="tappro" created="Thu, 8 Jul 2021 21:35:50 +0000"  >&lt;p&gt;From LNET level there is health stats for NID&#160;10.44.241.2@o2ib2:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 health stats:
              health value: 1000
              interrupts: 0
              dropped: 311
              aborted: 0
              no route: 0
              timeouts: 102
              error: 0&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;it shows &apos;dropped&apos; values and &apos;timeouts&apos;, though it is not major problem you&apos;ve seen it still worth to tune lnet values to the following:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
peer_credits = 32
peer_credits_hiw = 16
concurrent_sends = 64

lnet_transaction_timeout = 100
lnet_retry_count = 2 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;As for main MGS issue, the main patch change is in &lt;tt&gt;lustre_start_mgc()&lt;/tt&gt; which is used by both servers and clients to access MGS, so I think ideally patch should be added to all nodes. Can you apply it on all servers at least?&lt;/p&gt;</comment>
                            <comment id="306950" author="mrb" created="Mon, 12 Jul 2021 13:48:38 +0000"  >&lt;p&gt;Hi Mikhail, I can certainly adjust the LNET values to those recommended across our estate. Similarly with deploying the patch. We have a general maintenance coming up next week on the 21st, so we will aim to deploy these changes at this time.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Matt&lt;/p&gt;</comment>
                            <comment id="307341" author="mrb" created="Wed, 14 Jul 2021 13:45:49 +0000"  >&lt;p&gt;Hi Mikhail, slight side-question here, but in setting the parameter &lt;tt&gt;lnet_transaction_timeout&lt;/tt&gt; I don&apos;t see a way to set this in the &lt;tt&gt;/etc/lnet.conf&lt;/tt&gt; file.&lt;/p&gt;

&lt;p&gt;It seems to be able to be configured dynamically via &lt;tt&gt;lnetctl set transaction_timeout&lt;/tt&gt;, but I was hoping to find a configuration file way to do it, something like eg:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# cat /etc/lnet.conf
...
global:
    transaction_timeout = 100
    retry_count = 2 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can obviously configure the setting dynamically via script, but we&apos;ve moved all our tunables into managing the /etc/lnet.conf file now, so having a facility to set it via this file would be desirable just for simple management.&lt;/p&gt;

&lt;p&gt;Otherwise we will roll out the patch and updated settings next week so will feed-back after this how it&apos;s looking.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Matt&lt;/p&gt;</comment>
                            <comment id="307541" author="tappro" created="Fri, 16 Jul 2021 10:07:32 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt;, could you please help with LNET settings here?&lt;/p&gt;</comment>
                            <comment id="307577" author="ssmirnov" created="Fri, 16 Jul 2021 16:02:08 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;You should be able to set the transaction timeout and retry count in lnet.conf by adding the following line:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 options lnet lnet_retry_count=3 lnet_transaction_timeout=150&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;the o2ib lnd parameters can be set like this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
options ko2iblnd peer_credits=32 peer_credits_hiw=16 concurrent_sends=64&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Please make sure that the above is set uniformly across the cluster.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="307589" author="mrb" created="Fri, 16 Jul 2021 18:00:16 +0000"  >&lt;p&gt;Thanks Serguei, I forgot about adding it to the module options - I&apos;ve been recently using the dynamic yaml /etc/lnet.conf config so much that that was my instinctive first choice for adding the configuration.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="307943" author="eaujames" created="Wed, 21 Jul 2021 10:38:46 +0000"  >&lt;p&gt;Hello &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mrb&quot; class=&quot;user-hover&quot; rel=&quot;mrb&quot;&gt;mrb&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Are you sure to have the same logs before and after applying the patch &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;The typical dmesg on MGC nodes:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0
LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message
LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;MGCs (all the clients nodes) are trying to reconnect indefinitely to MGS. Because MGS wait indefinitely a MGC to release a lock: it can not evict the client (flag OBD_CONNECT_MDS_MDS is set by the client).&lt;/p&gt;

&lt;p&gt;So the following calltrace could be observed on the MGS:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jun 27 22:33:37 rds-mds7 kernel: Call Trace:
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc1313c90&amp;gt;] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a076c&amp;gt;] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a0983&amp;gt;] mgs_completion_ast_config+0x13/0x20 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc13147b1&amp;gt;] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a53b4&amp;gt;] mgs_revoke_lock+0x104/0x380 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc16a5ad2&amp;gt;] mgs_target_reg+0x4a2/0x1320 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc13b29da&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc135748b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffc135adf4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffff964c6691&amp;gt;] kthread+0xd1/0xe0
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffff96b92d1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
Jun 27 22:33:37 rds-mds7 kernel: [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;After some hours/days (depending the number of clients), we observe that the load increase on the MGS node, because the res-&amp;gt;lr_waiting become to large. The node became unresponsive, the HA kill it.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; title=&quot;lctl conf_param hung on the MGS node&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13356&quot;&gt;&lt;del&gt;LU-13356&lt;/del&gt;&lt;/a&gt; is a client and server patch but after applying it, the MGS should clear the connection flag OBD_CONNECT_MDS_MDS/OBD_CONNECT_MNE_SWAB by checking MGS_CONNECT_SUPPORTED.&lt;/p&gt;

&lt;p&gt;The known workaround for this issue is to umount/mount the MGT, but it might crash the MGS .&lt;/p&gt;

&lt;p&gt;The main difference between our issue and this one is the resource.&lt;br/&gt;
 In our case it was an &quot;Imperative recovery&quot;/CONFIG_T_RECOVER resource MGS: e.g &lt;span class=&quot;error&quot;&gt;&amp;#91;0x6b616f:*0x2*:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
 In this case it is &quot;Config&quot;/CONFIG_T_CONFIG MGS resource because you are adding new OSTs: &lt;span class=&quot;error&quot;&gt;&amp;#91;0x36642d736472:*0x0*:0x0&amp;#93;&lt;/span&gt;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
xxd -p -r &amp;lt;&amp;lt;&amp;lt; &lt;span class=&quot;code-quote&quot;&gt;&quot;36642d736472&quot;&lt;/span&gt; &amp;amp;&amp;amp; echo
6d-sdr (rds-6d)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This issue could be definitly be caused by network errors (MGC&#160; become unresponsive with a read lock on the resource).&lt;/p&gt;</comment>
                            <comment id="308927" author="mrb" created="Fri, 30 Jul 2021 14:39:19 +0000"  >&lt;p&gt;Hi all, I have an update now post-upgrade of our servers and clients last week.&lt;/p&gt;

&lt;p&gt;I have now updated all servers and the majority of our clients to 2.12.7 + &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13356&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13356&lt;/a&gt; patch.&lt;/p&gt;

&lt;p&gt;Since that update we haven&apos;t had the same issues in &lt;b&gt;most&lt;/b&gt; cases, which is good, so it appears the patch has worked for us. I have been able to update the MGS configuration and add new OSTs for one of the new filesystems, which was the original trigger for this ticket.&lt;/p&gt;

&lt;p&gt;However unfortunately, I still have 6 OSTs on one of the other upgraded filesystems that are &lt;b&gt;still&lt;/b&gt; refusing to mount and register. This is even the case after I re-format the OSTs:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MGS logs - post format of OST0049&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Skipped 2 previous similar messages
Jul 30 15:22:36 rds-mds8 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Jul 30 15:22:36 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;OSS logs - post format of OST0049&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul 30 15:22:28 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 15f-b: rds-d6-OST0049: cannot register &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; server with the MGS: rc = -98. Is the MGS running?
Jul 30 15:22:35 rds-oss60 kernel: LustreError: Skipped 1 previous similar message
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -98
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0049
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0049 not registered
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:134:server_deregister_mount()) Skipped 1 previous similar message
Jul 30 15:22:35 rds-oss60 kernel: Lustre: server umount rds-d6-OST0049 complete
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount  (-98)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Ok so it&apos;s complaining that the requested index is in use, so I reformat again now with the &apos;--replace&apos; flag:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MGS logs - post format with --replace&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul 30 15:27:26 rds-mds8 kernel: Lustre: Found index 73 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; rds-d6-OST0049, updating log
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 142-7: The target rds-d6-OST0049 has not registered yet. It must be started before failnids can be added.
Jul 30 15:27:26 rds-mds8 kernel: LustreError: Skipped 2 previous similar messages
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_llog.c:4304:mgs_write_log_param()) err -2 on param &lt;span class=&quot;code-quote&quot;&gt;&apos;failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2&apos;&lt;/span&gt;
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_llog.c:4304:mgs_write_log_param()) Skipped 2 previous similar messages
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-2)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;OSS logs - post format with --replace&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Jul 30 15:27:20 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jul 30 15:27:32 rds-oss60 kernel: LustreError: 15f-b: rds-d6-OST0049: cannot register &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; server with the MGS: rc = -2. Is the MGS running?
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -2
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0049
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0049 not registered
Jul 30 15:27:33 rds-oss60 kernel: Lustre: server umount rds-d6-OST0049 complete
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount  (-2)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I don&apos;t see any log files on the MGT for the OSTs I&apos;m trying to mount (for reference these are rds-d6-OST004&lt;span class=&quot;error&quot;&gt;&amp;#91;9abdef&amp;#93;&lt;/span&gt;):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[root@rds-mds8 ~]# lctl --device MGS llog_catlist | grep -E &lt;span class=&quot;code-quote&quot;&gt;&quot;rds-d6-OST004[9abdef]&quot;&lt;/span&gt;
[root@rds-mds8 ~]#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m not really sure what else to try at this point? Is there any information from the above logs that would indicate the problem here?&lt;br/&gt;
Thanks,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="308949" author="tappro" created="Fri, 30 Jul 2021 17:56:45 +0000"  >&lt;p&gt;Matt, could you show exact commands you were using to format OST? The &lt;tt&gt;mgs_write_log_param()&lt;/tt&gt; failed on param &lt;tt&gt;failover.node&lt;/tt&gt; setting, probably it is just wrong syntax. Also to get more details you could collect lustre logs, just do the following steps:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;On MGS node, set debug level by &apos;&lt;tt&gt;lctl set_param debug=0xffffffff&lt;/tt&gt;&apos; (maximum level)&lt;/li&gt;
	&lt;li&gt;Do your command to add OST&lt;/li&gt;
	&lt;li&gt;Again on MGS collect log by: &apos;&lt;tt&gt;lctl dk &amp;gt; your_logfilename&lt;/tt&gt;&apos;&lt;/li&gt;
	&lt;li&gt;Analyse logfile or just attach it here.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Prior that remember old debug level on MGS, get it via&#160;&apos;&lt;tt&gt;lctl get_param debug&lt;/tt&gt;&apos; and save it, after all set it back on MGS, maximum debug level would cause server slowdown, so better to use default setting and set maximum level only to debug things&lt;/p&gt;</comment>
                            <comment id="309146" author="mrb" created="Tue, 3 Aug 2021 14:01:54 +0000"  >&lt;p&gt;Here are the format commands used:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/sbin/mkfs.lustre --ost --replace --fsname rds-d6 --index 73 --mgsnode=&apos;10.44.241.1@o2ib2&apos; --mgsnode=&apos;10.44.241.2@o2ib2&apos; --servicenode=&apos;10.44.241.26@o2ib2&apos; --servicenode=&apos;10.44.241.25@o2ib2&apos; --mkfsoptions &apos;-i 2097152 -E stride=32,stripe-width=256&apos; &apos;/dev/mapper/v2-rds-ost-jb60&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The only difference between the tests above is whether I include the &apos;--replace&apos; flag or not.&lt;/p&gt;

&lt;p&gt;I have taken logs from both the MGS and OSS during an attempted mount after running the above format command. I&apos;ve set a debug marker just before this: &apos;DEBUG MARKER: START rds-OST0049 mount&apos;. The OSS NID is:&lt;br/&gt;
10.44.241.26@o2ib2 and the OST is rds-OST0049, index 73.&lt;/p&gt;

&lt;p&gt;Unfortunately I think the MGS log is too big to upload to this ticket (even compressed it&apos;s 100M), do you have an sftp upload facility I can put the log files?&lt;/p&gt;

&lt;p&gt;Here is (I think), the relevant section from the MGS log:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
20000000:00000001:25.0:1627998228.698165:0:182803:0:(mgs_handler.c:364:mgs_target_reg()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698167:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698167:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698168:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698169:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698169:0:182803:0:(mgs_handler.c:414:mgs_target_reg()) fs: rds-d6 index: 73 is registered to MGS.
20000000:00000001:25.0:1627998228.698170:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698170:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698171:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698171:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698172:0:182803:0:(mgs_handler.c:519:mgs_target_reg()) updating rds-d6-OST0049, index=73
20000000:00000001:25.0:1627998228.698172:0:182803:0:(mgs_llog.c:4315:mgs_write_log_target()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:666:mgs_set_index()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000001:25.0:1627998228.698174:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698174:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698175:0:182803:0:(mgs_llog.c:710:mgs_set_index()) Server rds-d6-OST0049 updating index 73
20000000:00000001:25.0:1627998228.698175:0:182803:0:(mgs_llog.c:711:mgs_set_index()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via out_up (rc=114 : 114 : 0x72)
20000000:02000400:25.0:1627998228.698176:0:182803:0:(mgs_llog.c:4324:mgs_write_log_target()) Found index 73 &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; rds-d6-OST0049, updating log
20000000:01000000:25.0:1627998228.698178:0:182803:0:(mgs_llog.c:4356:mgs_write_log_target()) Update params &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; rds-d6-OST0049
20000000:00000010:25.0:1627998228.698179:0:182803:0:(mgs_llog.c:4362:mgs_write_log_target()) kmalloced &lt;span class=&quot;code-quote&quot;&gt;&apos;(buf)&apos;&lt;/span&gt;: 97 at ffff96c084bf1700.
20000000:01000000:25.0:1627998228.698180:0:182803:0:(mgs_llog.c:4376:mgs_write_log_target()) remaining string: &lt;span class=&quot;code-quote&quot;&gt;&apos; failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2 &apos;&lt;/span&gt;, param: &lt;span class=&quot;code-quote&quot;&gt;&apos;mgsnode=10.44.241.1@o2ib2:10.44.241.2@o2ib2&apos;&lt;/span&gt;
20000000:00000001:25.0:1627998228.698181:0:182803:0:(mgs_llog.c:3886:mgs_write_log_param()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:01000000:25.0:1627998228.698181:0:182803:0:(mgs_llog.c:3890:mgs_write_log_param()) next param &lt;span class=&quot;code-quote&quot;&gt;&apos;mgsnode=10.44.241.1@o2ib2:10.44.241.2@o2ib2&apos;&lt;/span&gt;
20000000:00000001:25.0:1627998228.698182:0:182803:0:(mgs_llog.c:3897:mgs_write_log_param()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via end (rc=0 : 0 : 0x0)
20000000:00000001:25.0:1627998228.698183:0:182803:0:(mgs_llog.c:4306:mgs_write_log_param()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698183:0:182803:0:(mgs_llog.c:4376:mgs_write_log_target()) remaining string: &lt;span class=&quot;code-quote&quot;&gt;&apos; &apos;&lt;/span&gt;, param: &lt;span class=&quot;code-quote&quot;&gt;&apos;failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2&apos;&lt;/span&gt;
20000000:00000001:25.0:1627998228.698184:0:182803:0:(mgs_llog.c:3886:mgs_write_log_param()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:01000000:25.0:1627998228.698184:0:182803:0:(mgs_llog.c:3890:mgs_write_log_param()) next param &lt;span class=&quot;code-quote&quot;&gt;&apos;failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2&apos;&lt;/span&gt;
20000000:01000000:25.0:1627998228.698185:0:182803:0:(mgs_llog.c:3925:mgs_write_log_param()) Adding failnode
20000000:00000001:25.0:1627998228.698185:0:182803:0:(mgs_llog.c:3203:mgs_write_log_add_failnid()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
20000000:00000040:25.0:1627998228.698186:0:182803:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff96bf0f3a6600 : &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; refcount 2
00000040:00000001:25.0:1627998228.698187:0:182803:0:(llog.c:1280:llog_open()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000040:00000010:25.0:1627998228.698188:0:182803:0:(llog.c:61:llog_alloc_handle()) kmalloced &lt;span class=&quot;code-quote&quot;&gt;&apos;(loghandle)&apos;&lt;/span&gt;: 280 at ffff96c104bcce00.
00000040:00000001:25.0:1627998228.698190:0:182803:0:(llog_osd.c:1236:llog_osd_open()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000020:00000001:25.0:1627998228.698190:0:182803:0:(local_storage.c:146:ls_device_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000020:00000001:25.0:1627998228.698191:0:182803:0:(local_storage.c:151:ls_device_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via out_ls (rc=18446628553290942976 : -115520418608640 : 0xffff96ef4e7c0a00)
00000020:00000001:25.0:1627998228.698192:0:182803:0:(local_storage.c:173:ls_device_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=18446628553290942976 : -115520418608640 : ffff96ef4e7c0a00)
00080000:00000001:25.0:1627998228.698194:0:182803:0:(osd_handler.c:7423:osd_index_ea_lookup()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00080000:00000001:25.0:1627998228.698195:0:182803:0:(osd_handler.c:5972:osd_ea_lookup_rec()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00080000:00000001:25.0:1627998228.698200:0:182803:0:(osd_handler.c:6047:osd_ea_lookup_rec()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00080000:00000001:25.0:1627998228.698202:0:182803:0:(osd_handler.c:7431:osd_index_ea_lookup()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000001:25.0:1627998228.698203:0:182803:0:(llog_osd.c:1298:llog_osd_open()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00000040:00000001:25.0:1627998228.698203:0:182803:0:(llog_osd.c:1354:llog_osd_open()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000010:25.0:1627998228.698204:0:182803:0:(llog.c:91:llog_free_handle()) kfreed &lt;span class=&quot;code-quote&quot;&gt;&apos;loghandle&apos;&lt;/span&gt;: 280 at ffff96c104bcce00.
00000040:00000001:25.0:1627998228.698205:0:182803:0:(llog.c:1306:llog_open()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000001:25.0:1627998228.698205:0:182803:0:(llog.c:1337:llog_is_empty()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving via out (rc=0 : 0 : 0x0)
20000000:00000040:25.0:1627998228.698206:0:182803:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff96bf0f3a6600 : &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; refcount 1
20000000:02020000:25.0:1627998228.698206:0:182803:0:(mgs_llog.c:3214:mgs_write_log_add_failnid()) 142-7: The target rds-d6-OST0049 has not registered yet. It must be started before failnids can be added.
00000800:00000200:36.2F:1627998228.706154:0:0:0:(o2iblnd_cb.c:3769:kiblnd_cq_completion()) conn[ffff96c19bbf3a00] (36)++
00000800:00000200:43.0:1627998228.706160:0:2905:0:(o2iblnd_cb.c:3891:kiblnd_scheduler()) conn[ffff96c19bbf3a00] (37)++
00000800:00000200:43.0:1627998228.706161:0:2905:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[1] from 10.44.240.162@o2ib2
00000400:00000200:43.0:1627998228.706163:0:2905:0:(lib-move.c:4196:lnet_parse()) TRACE: 10.44.241.2@o2ib2(10.44.241.2@o2ib2) &amp;lt;- 10.43.10.24@tcp2 : PUT - &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; me
00000800:00000200:45.0:1627998228.706164:0:2903:0:(o2iblnd_cb.c:3907:kiblnd_scheduler()) conn[ffff96c19bbf3a00] (38)--
00000400:00000010:43.0:1627998228.706164:0:2905:0:(lib-lnet.h:490:lnet_msg_alloc()) alloc &lt;span class=&quot;code-quote&quot;&gt;&apos;(msg)&apos;&lt;/span&gt;: 440 at ffff96c10fb88e00 (tot 357391524).
00000400:00000200:43.0:1627998228.706165:0:2905:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.43.10.24@tcp2 of length 224 into portal 26 MB=0x60f8b5d916f80
00000400:00000200:43.0:1627998228.706167:0:2905:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1a from 12345-10.43.10.24@tcp2 of length 224/224 into md 0x4a433cc9 [1] + 0
00000400:00000010:43.0:1627998228.706167:0:2905:0:(lib-lnet.h:295:lnet_me_free()) slab-freed &lt;span class=&quot;code-quote&quot;&gt;&apos;me&apos;&lt;/span&gt; at ffff96c0e179aa80.
00000400:00000200:43.0:1627998228.706168:0:2905:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff96c0a439a990
00000400:00000200:43.0:1627998228.706169:0:2905:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000100:00000001:43.0:1627998228.706169:0:2905:0:(events.c:300:request_in_callback()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000200:43.0:1627998228.706169:0:2905:0:(events.c:310:request_in_callback()) event type 2, status 0, service mgs
00000100:00000040:43.0:1627998228.706170:0:2905:0:(events.c:353:request_in_callback()) incoming req@ffff96bf0fa13850 x1705941104947072 msgsize 224
00000100:00100000:43.0:1627998228.706171:0:2905:0:(events.c:356:request_in_callback()) peer: 12345-10.43.10.24@tcp2 (source: 12345-10.43.10.24@tcp2)
00000100:00000040:43.0:1627998228.706172:0:2905:0:(events.c:365:request_in_callback()) Buffer complete: 62 buffers still posted
00000100:00000001:43.0:1627998228.706173:0:2905:0:(events.c:389:request_in_callback()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving
00000400:00000200:43.0:1627998228.706174:0:2905:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff96c0a439a990
00000400:00000010:43.0:1627998228.706174:0:2905:0:(lib-lnet.h:270:lnet_md_free()) slab-freed &lt;span class=&quot;code-quote&quot;&gt;&apos;md&apos;&lt;/span&gt; at ffff96c0a439a990.
00000400:00000010:43.0:1627998228.706175:0:2905:0:(lib-lnet.h:500:lnet_msg_free()) kfreed &lt;span class=&quot;code-quote&quot;&gt;&apos;msg&apos;&lt;/span&gt;: 440 at ffff96c10fb88e00 (tot 357391084).
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But I&apos;ll provide full logs if I can upload them somewhere?&lt;/p&gt;</comment>
                            <comment id="310638" author="mrb" created="Thu, 19 Aug 2021 12:27:57 +0000"  >&lt;p&gt;Hi Mikhail, just wondering if you have any time to take another look at this for me? I&apos;m still unable to add these OSTs to the filesystem currently and not sure what else to try right now.&lt;/p&gt;

&lt;p&gt;I can provide the full logs to the previous comment if there is an FTP server I can upload them to?&lt;/p&gt;</comment>
                            <comment id="312615" author="delbaryg" created="Mon, 13 Sep 2021 15:26:37 +0000"  >&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;I don&apos;t know if you have finally figure out your issue. On a custom Lustre version (2.12.6 + some patchs) we hit a strange issue, sad lustre file configurations and we were enable to start the fs.&lt;/p&gt;

&lt;p&gt;The root_cause was a sad patch (outside LTS) but the behavior was not clean (in ldiskfs params files have 0 in link count !, lustre complains in config corruption). Even after clean write conf, no way to have a working fs. When I have debug a bit, I have seen that some few clients (~13000) did many noise. Normally in write conf procedure, you have to unmount and remount all clients (+targets) but I was surprised MDT/OST configs were in &quot;bad state&quot;.&lt;/p&gt;

&lt;p&gt;Anyway to fix it, I had to stop all clients and all became good. Write conf procedure completes successfully and after that clients mounted without any issue (no either corruption in config files). If you can, you could test this (just in case) but it is not a seemless solution...&lt;/p&gt;</comment>
                            <comment id="313682" author="mrb" created="Wed, 22 Sep 2021 14:33:11 +0000"  >&lt;p&gt;Hi Gael, that&apos;s interesting thanks for sharing.&lt;/p&gt;

&lt;p&gt;Unfortunately I am still stuck on this issue somewhat, although the impact has lessened a little bit. The current status is that I have a single OST that fails to mount and join the filesystem, the one mentioned above that should be index 73.&lt;/p&gt;

&lt;p&gt;The only new development since my last messages is that there were 5 other OSTs that weren&apos;t mounting previously, but have now mounted successfully and joined the filesystem - index 74 to 79.&lt;/p&gt;

&lt;p&gt;In the last month we had a lot of unrelated LNET routing related issues over in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14987&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-14987&lt;/a&gt; and as part of that, we did in fact restart a large number of both the clients and also all the servers in the filesystems.&lt;/p&gt;

&lt;p&gt;It was after these restarts that I noticed the above OSTs 74 to 79 that were finally able to mount and join the filesystem.&lt;/p&gt;

&lt;p&gt;Unfortunately that still leaves OST 73, which I see the exact same errors with as before.&lt;/p&gt;

&lt;p&gt;I will generate new debug logs for this OST on both OSS and MDS and try to share as before.&lt;/p&gt;</comment>
                            <comment id="313689" author="mrb" created="Wed, 22 Sep 2021 15:04:33 +0000"  >&lt;p&gt;Actually I didn&apos;t notice this in my previous logs, but I note now the failing call to &apos;mgs_revoke_lock()&apos; after attempting to start the missing OST 73 now:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Sep 22 15:46:53 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Sep 22 15:46:53 rds-mds7 kernel: LustreError: 5636:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
Sep 22 15:47:04 rds-mds7 kernel: Lustre: 5636:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1632322013/real 1632322013]  req@ffff962705810000 x1710915848718400/t0(0) o104-&amp;gt;MGS@10.47.2.177@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1632322024 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 22 15:47:04 rds-mds7 kernel: Lustre: 5636:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 22 15:47:07 rds-mds7 kernel: Lustre: rds-d6-MDT0002: haven&apos;t heard from client df0f1789-b047-52d8-233e-98e43aab8799 (at 10.47.2.177@o2ib1) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff95f07ed38c00, cur 1632322027 expire 1632321877 last 1632321800
Sep 22 15:47:07 rds-mds7 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 15:47:07 rds-mds7 kernel: LustreError: 5636:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can&apos;t take cfg lock for 0x36642d736472/0x0 : rc = -11
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Not sure if this last message is of any significance here?&lt;/p&gt;</comment>
                            <comment id="314553" author="eaujames" created="Sat, 2 Oct 2021 13:00:49 +0000"  >
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This line is displayed when a newly target (formatted) is firstly mounted but the requested target index is already allocated for an another target in MGS configuration.&lt;br/&gt;
In your case maybe, the MGS register the OST0049 in its configuration but did not update back OST0049 local configuration (because network issue or locking issue etc...).&lt;/p&gt;

&lt;p&gt;You can check the llog configuration on the MGS to verify if you see your OST with (for example):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl --device MGS llog_print rds-d6-client
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then you could try clean your MGS configuration with &quot;lctl clear_conf&quot;&lt;br/&gt;
doc: &lt;a href=&quot;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.clear_conf&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.clear_conf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then you could try to replace the existing OST target in configuration with your new one : &quot;mkfs.lustre --replace&quot;&lt;br/&gt;
doc: &lt;a href=&quot;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.restore_ost&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.restore_ost&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="318711" author="gerrit" created="Fri, 19 Nov 2021 23:12:09 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/45626&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45626&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14802&quot; title=&quot;MGS configuration problems - cannot add new OST, change parameters, hanging&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14802&quot;&gt;&lt;del&gt;LU-14802&lt;/del&gt;&lt;/a&gt; nodemap: return proper error code&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 05a33d6c10921b630237c570a5b989463e90ca65&lt;/p&gt;</comment>
                            <comment id="324523" author="gerrit" created="Mon, 31 Jan 2022 01:43:55 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/45626/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45626/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14802&quot; title=&quot;MGS configuration problems - cannot add new OST, change parameters, hanging&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14802&quot;&gt;&lt;del&gt;LU-14802&lt;/del&gt;&lt;/a&gt; nodemap: return proper error code&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7d6cddff24de2f79e539eae462554dc21e674511&lt;/p&gt;</comment>
                            <comment id="324545" author="pjones" created="Mon, 31 Jan 2022 04:42:20 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="65246">LU-14855</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="39372" name="lustre-log.1624829616.12985.gz" size="21547877" author="mrb" created="Thu, 1 Jul 2021 09:51:36 +0000"/>
                            <attachment id="39379" name="lustre-log.1625135231.204490.gz" size="2819482" author="mrb" created="Thu, 1 Jul 2021 10:59:41 +0000"/>
                            <attachment id="39373" name="rds-md7.kernel.2021-06-26.log.gz" size="66284" author="mrb" created="Thu, 1 Jul 2021 09:50:24 +0000"/>
                            <attachment id="39374" name="rds-oss59.kernel.2021-06-27.log.gz" size="4437" author="mrb" created="Thu, 1 Jul 2021 09:50:24 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01y7z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>