<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:41:19 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4282] some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible</title>
                <link>https://jira.whamcloud.com/browse/LU-4282</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;As indicated in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt;, I now have a problem on our preproduction file system that stops users from accessing the data, servers from cleanly rebooting etc, stopping any further testing.&lt;/p&gt;

&lt;p&gt;After upgrading the servers from 2.3 to 2.4.1 (MDT build #51 of b2_4 from jenkins) our clients can no longer fully access this file system. The clients can mount the file system and can access one OST on each of the two OSSes, but the other OSSes are not accessible and are shown as inactive in lfs df output and /proc/fs/lustre/lov/*/target_obd, but are shown as UP in lctl dl.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-serv-07 ~]$ lctl dl |grep play01
 91 UP lov play01-clilov-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 4
 92 UP mdc play01-MDT0000-mdc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 93 UP osc play01-OST0000-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 94 UP osc play01-OST0001-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 95 UP osc play01-OST0002-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 96 UP osc play01-OST0003-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 97 UP osc play01-OST0004-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
 98 UP osc play01-OST0005-osc-ffff810076ae2000 9186608e-d432-283c-0e6e-47b800427d3e 5
[bnh65367@cs04r-sc-serv-07 ~]$ lfs df /mnt/play01
UUID                   1K-blocks        Used   Available Use% Mounted on
play01-MDT0000_UUID     78636320     3502948    75133372   4% /mnt/play01[MDT:0]
play01-OST0000_UUID   7691221300  4506865920  3184355380  59% /mnt/play01[OST:0]
play01-OST0001_UUID   7691221300  3765688064  3925533236  49% /mnt/play01[OST:1]
play01-OST0002_UUID : inactive device
play01-OST0003_UUID : inactive device
play01-OST0004_UUID : inactive device
play01-OST0005_UUID : inactive device

filesystem summary:  15382442600  8272553984  7109888616  54% /mnt/play01

[bnh65367@cs04r-sc-serv-07 ~]$ cat /proc/fs/lustre/lov/play01-clilov-ffff810076ae2000/target_obd 
0: play01-OST0000_UUID ACTIVE
1: play01-OST0001_UUID ACTIVE
2: play01-OST0002_UUID INACTIVE
3: play01-OST0003_UUID INACTIVE
4: play01-OST0004_UUID INACTIVE
5: play01-OST0005_UUID INACTIVE
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As expected the fail-over OSS for each OST does see connection attempts and reports (correctly) that that OST is not available on this OSS.&lt;/p&gt;

&lt;p&gt;I have confirmed that the OSTs are mounted on the OSSes correctly.&lt;/p&gt;

&lt;p&gt;For the other client that I have tried to bring back the situation is similar but the OSTs that are inactive are slightly different:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-serv-06 ~]$ lfs df /mnt/play01
UUID                   1K-blocks        Used   Available Use% Mounted on
play01-MDT0000_UUID     78636320     3502948    75133372   4% /mnt/play01[MDT:0]
play01-OST0000_UUID : inactive device
play01-OST0001_UUID   7691221300  3765688064  3925533236  49% /mnt/play01[OST:1]
play01-OST0002_UUID   7691221300  1763305508  5927915792  23% /mnt/play01[OST:2]
play01-OST0003_UUID : inactive device
play01-OST0004_UUID : inactive device
play01-OST0005_UUID : inactive device

filesystem summary:  15382442600  5528993572  9853449028  36% /mnt/play01

[bnh65367@cs04r-sc-serv-06 ~]$ 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;play01-OST0000, play01-OST0002, play01-OST0004 are on one OSS&lt;br/&gt;
play01-OST0001, play01-OST0003, play01-OST0005 are on a different OSS (but all on the same).&lt;/p&gt;

&lt;p&gt;I have tested the network, don&apos;t see any errors, lnet_selftest between the clients and the OSSes works at line rate at least for the first client (1GigE client...), nothing obvious on the second client either.&lt;/p&gt;

&lt;p&gt;For completeness I should probably mention that all the servers (MDS and OSSes) have changed IP addresses at the same time as the upgrade, I have verified the information is correctly changed on the targets, both clients have been rebooted multiple times since the IP address change, without any changes.&lt;/p&gt;
</description>
                <environment>MDS and OSS on Lustre 2.4.1, clients lustre 1.8.9, all Red Hat Enterprise Linux.</environment>
        <key id="22184">LU-4282</key>
            <summary>some OSTs reported as inactive in lfs df, UP with lctl dl, data not accessible</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="mdiep">Minh Diep</assignee>
                                    <reporter username="ferner">Frederik Ferner</reporter>
                        <labels>
                    </labels>
                <created>Wed, 20 Nov 2013 20:04:33 +0000</created>
                <updated>Thu, 20 Feb 2014 21:16:25 +0000</updated>
                            <resolved>Fri, 24 Jan 2014 16:20:48 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="72007" author="mdiep" created="Wed, 20 Nov 2013 23:23:11 +0000"  >&lt;p&gt;Hi Frederik,&lt;/p&gt;

&lt;p&gt;Could you show us the command you used to change IP address on the servers?&lt;/p&gt;</comment>
                            <comment id="72010" author="ferner" created="Wed, 20 Nov 2013 23:47:59 +0000"  >
&lt;p&gt;I unmounted the MDT, MGT, all OSTs, and the two clients I&apos;m currently trying to use, other clients were left up to be rebooted later, changed the options etc in /etc/modprobe.d/lustre.conf to bring up the correct NIDs on the servers, confirmed with lctl list_nids that the correct IPs were identified, then ran this on the mds/mgs&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;tunefs.lustre --erase-params --writeconf /dev/vg_play01/mgs
tunefs.lustre --erase-params --writeconf --mgsnode=172.23.144.5@tcp0 --mgsnode=172.23.144.6@tcp0 --servicenode=172.23.144.5@tcp0 --servicenode=172.23.144.6@tcp0 --param mdt.quota_type=ug --param mdt.group_upcall=/usr/sbin/l_getgroups --mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl /dev/vg_play01/mdt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the OSSes I ran the following for each OST:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;tunefs.lustre --erase-params --writeconf --mgsnode=172.23.144.5@tcp0 --mgsnode=172.23.144.6@tcp0 --servicenode=172.23.144.14@tcp0 --servicenode=172.23.144.18@tcp0 --param ost.quota_type=ug /dev/mapper/ost_play01_0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The mounted first MGS, then MGT, then all OSTs, then tried to bring those two clients back... I&apos;ve actually ran those commands a few more times now, partly because the LBUG on the MDT seemed to confuse things...&lt;/p&gt;</comment>
                            <comment id="72011" author="mdiep" created="Wed, 20 Nov 2013 23:58:40 +0000"  >&lt;p&gt;Why did you include two mgsnode?&lt;/p&gt;

&lt;p&gt;could you show tunefs.lustre --dryrun &amp;lt;mdt&amp;gt;, and tunefs.lustre --dryrun &amp;lt;ost&amp;gt; ?&lt;/p&gt;</comment>
                            <comment id="72014" author="ferner" created="Thu, 21 Nov 2013 00:25:41 +0000"  >&lt;p&gt;to mgsnode because the MGS will fail over between the same two machines as the MDT even though it is on a separate partition.&lt;/p&gt;

&lt;p&gt;tunefs.lustre --dryrun for mdt and ost:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-mds02-03 ~]$ sudo tunefs.lustre --dryrun /dev/mapper/vg_play01-mdt 
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-MDT0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1001
              (MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro,acl
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.5@tcp failover.node=172.23.144.6@tcp mdt.quota_type=ug mdt.group_upcall=/usr/sbin/l_getgroups


   Permanent disk data:
Target:     play01-MDT0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1001
              (MDT no_primnode )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro,acl
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.5@tcp failover.node=172.23.144.6@tcp mdt.quota_type=ug mdt.group_upcall=/usr/sbin/l_getgroups

exiting before disk write.
[bnh65367@cs04r-sc-mds02-03 ~]$ 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;OSTs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs04r-sc-oss01-04 ~]$ for i in /dev/mapper/ost_play01_* ; do sudo tunefs.lustre --dryrun $i ; done
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1402
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0000
Index:      0
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1402
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug

exiting before disk write.
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0001
Index:      1
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0001
Index:      1
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug

exiting before disk write.
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0002
Index:      2
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0002
Index:      2
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug

exiting before disk write.
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0003
Index:      3
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0003
Index:      3
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug

exiting before disk write.
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0004
Index:      4
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0004
Index:      4
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.14@tcp failover.node=172.23.144.18@tcp ost.quota_type=ug

exiting before disk write.
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     play01-OST0005
Index:      5
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug


   Permanent disk data:
Target:     play01-OST0005
Index:      5
Lustre FS:  play01
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.23.144.5@tcp mgsnode=172.23.144.6@tcp failover.node=172.23.144.18@tcp failover.node=172.23.144.14@tcp ost.quota_type=ug

exiting before disk write.
[bnh65367@cs04r-sc-oss01-04 ~]$ 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="72016" author="mdiep" created="Thu, 21 Nov 2013 00:32:36 +0000"  >&lt;p&gt;I don&apos;t understand &quot;MGS will fail over between the same two machines as the MDT even though it is on a separate partition.&quot;&lt;/p&gt;

&lt;p&gt;I assume that cs04r-sc-mds02-03 = 72.23.144.5@tcp&lt;/p&gt;

&lt;p&gt;and another-mss-host = 172.23.144.6@tcp&lt;/p&gt;

&lt;p&gt;are these two sharing the same storage/device?&lt;/p&gt;</comment>
                            <comment id="72017" author="ferner" created="Thu, 21 Nov 2013 00:36:34 +0000"  >&lt;p&gt;Sorry, should have provided a bit more background...&lt;/p&gt;

&lt;p&gt;The file system has two OSS in active-active configuration, with the new IPs 172.23.144.14 and 172.23.144.18 sharing a storage array. For the MDS we also have two servers sharing a storage array, the new IPs for those are indeed 172.23.144.5 (cs04r-sc-mds02-03) and 172.23.144.6 (cs04r-sc-mds02-04).&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Frederik&lt;/p&gt;</comment>
                            <comment id="72019" author="mdiep" created="Thu, 21 Nov 2013 00:49:39 +0000"  >&lt;p&gt;ok thanks. Ah, I also see you share mgs? or it&apos;s a typo?&lt;/p&gt;

&lt;p&gt;tunefs.lustre --erase-params --writeconf /dev/vg_play01/mgs  &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&lt;br/&gt;
tunefs.lustre --erase-params --writeconf --mgsnode=172.23.144.5@tcp0 --mgsnode=172.23.144.6@tcp0 --servicenode=172.23.144.5@tcp0 --servicenode=172.23.144.6@tcp0 --param mdt.quota_type=ug --param mdt.group_upcall=/usr/sbin/l_getgroups --mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl /dev/vg_play01/mdt&lt;/p&gt;

&lt;p&gt;If you have combined mgs/mdt, then you should only have 1 mgsnode&lt;br/&gt;
tunefs.lustre --erase-params --writeconf /dev/vg_play01/mdt&lt;br/&gt;
tunefs.lustre --writeconf --mgsnode=172.23.144.6@tcp0 --servicenode=172.23.144.5@tcp0 --servicenode=172.23.144.6@tcp0 --param mdt.quota_type=ug --param mdt.group_upcall=/usr/sbin/l_getgroups --mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl --mgs --mdt /dev/vg_play01/mdt&lt;/p&gt;

&lt;p&gt;Note: no --erase-params on second tunefs.lustre cmd&lt;/p&gt;
</comment>
                            <comment id="72020" author="mdiep" created="Thu, 21 Nov 2013 00:51:28 +0000"  >&lt;p&gt;don&apos;t forget the unmount all clients and OST while you --writeconf the MDS,&lt;/p&gt;

&lt;p&gt;Then run&lt;br/&gt;
tunefs.lustre --writeconf --ost /dev/mapper/ost_play01_0&lt;/p&gt;</comment>
                            <comment id="72052" author="ferner" created="Thu, 21 Nov 2013 18:04:24 +0000"  >&lt;p&gt;The MGS is on the same shared storage as the MDT, same LVM volume group but a separate logical volume.&lt;/p&gt;

&lt;p&gt;So I think I need your first set of commands. though I don&apos;t see how they are much different from mine. In any case, I&apos;ve run them again after umounting everything, brought everything back up, no change.&lt;/p&gt;

&lt;p&gt;This time I noticed the following -16 errors in the logs, I assume they are because the OSTs are still in recovery but thought I&apos;d mention them. Also there is this initially error about communication with 0@lo that I don&apos;t recall seeing before.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: Lustre: MGS: Logs for fs play01 were removed by user request.  All servers must be restarted in order to regenerate the logs.
Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: Lustre: Setting parameter play01-MDT0000.mdt.quota_type in log play01-MDT0000
Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: Lustre: Skipped 1 previous similar message
Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: Lustre: play01-MDT0000: used disk, loading
Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: Lustre: 4012:0:(mdt_handler.c:4948:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
Nov 21 17:30:06 cs04r-sc-mds02-03 kernel: LustreError: 11-0: play01-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Nov 21 17:30:32 cs04r-sc-mds02-03 kernel: Lustre: MGS: Regenerating play01-OST0000 log by user request.
Nov 21 17:30:32 cs04r-sc-mds02-03 kernel: Lustre: Setting parameter play01-OST0000.ost.quota_type in log play01-OST0000
Nov 21 17:30:32 cs04r-sc-mds02-03 kernel: Lustre: Skipped 1 previous similar message
Nov 21 17:30:38 cs04r-sc-mds02-03 kernel: Lustre: MGS: Regenerating play01-OST0001 log by user request.
Nov 21 17:30:38 cs04r-sc-mds02-03 kernel: Lustre: Setting parameter play01-OST0001.ost.quota_type in log play01-OST0001
Nov 21 17:30:39 cs04r-sc-mds02-03 kernel: LustreError: 11-0: play01-OST0000-osc-MDT0000: Communicating with 172.23.144.14@tcp, operation ost_connect failed with -16.
Nov 21 17:30:54 cs04r-sc-mds02-03 kernel: Lustre: MGS: Regenerating play01-OST0002 log by user request.
Nov 21 17:30:54 cs04r-sc-mds02-03 kernel: Lustre: Setting parameter play01-OST0002.ost.quota_type in log play01-OST0002
Nov 21 17:31:02 cs04r-sc-mds02-03 kernel: LustreError: 11-0: play01-OST0003-osc-MDT0000: Communicating with 172.23.144.18@tcp, operation ost_connect failed with -16.
Nov 21 17:31:02 cs04r-sc-mds02-03 kernel: LustreError: Skipped 1 previous similar message
Nov 21 17:31:07 cs04r-sc-mds02-03 kernel: Lustre: MGS: Regenerating play01-OST0004 log by user request.
Nov 21 17:31:07 cs04r-sc-mds02-03 kernel: Lustre: Skipped 1 previous similar message
Nov 21 17:31:10 cs04r-sc-mds02-03 kernel: LustreError: 11-0: play01-OST0001-osc-MDT0000: Communicating with 172.23.144.18@tcp, operation ost_connect failed with -16.
Nov 21 17:31:11 cs04r-sc-mds02-03 kernel: Lustre: Setting parameter play01-OST0005.ost.quota_type in log play01-OST0005
Nov 21 17:31:11 cs04r-sc-mds02-03 kernel: Lustre: Skipped 2 previous similar messages
Nov 21 17:31:44 cs04r-sc-mds02-03 kernel: LustreError: 11-0: play01-OST0000-osc-MDT0000: Communicating with 172.23.144.14@tcp, operation ost_connect failed with -16.
Nov 21 17:31:44 cs04r-sc-mds02-03 kernel: LustreError: Skipped 2 previous similar messages
Nov 21 17:32:34 cs04r-sc-mds02-03 kernel: Lustre: play01-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
Nov 21 17:32:34 cs04r-sc-mds02-03 kernel: Lustre: play01-MDT0000: Denying connection for new client 45cd72fa-56c3-f257-0ed7-154d629ee603 (at 172.23.136.7@tcp), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="72075" author="ferner" created="Thu, 21 Nov 2013 21:20:24 +0000"  >&lt;p&gt;Ok, more testing with this showed that for all OSTs that were shown as inactive on the clients, the fail over OSS was seeing connection attempts. So failing over the OSTs to the other node makes them available on the clients, however failing over OSTs that were active makes them unavailable on the clients.&lt;/p&gt;

&lt;p&gt;So to make this clearer:&lt;/p&gt;

&lt;p&gt;Initially the OSTs were distributed like this:&lt;/p&gt;

&lt;p&gt;OSTs mounted on 172.23.144.14: play01-OST0000, play01-OST0002, play01-OST0004&lt;br/&gt;
OSTs mounted on 172.23.144.18: play01-OST0001, play01-OST0003, play01-OST0005&lt;/p&gt;

&lt;p&gt;In this configuration the clients were only able to access play01-OST0000 and play01-OST0001.&lt;/p&gt;

&lt;p&gt;The following distribution of OSTs makes them available on both clients I tested today:&lt;/p&gt;

&lt;p&gt;OSTs mounted on 172.23.144.14: play01-OST0000, play01-OST0003, play01-OST0005&lt;br/&gt;
OSTs mounted on 172.23.144.18: play01-OST0001, play01-OST0002, play01-OST0004&lt;/p&gt;

&lt;p&gt;As soon as any of the OSTs is mounted on the other OSS, none of the clients will connect to it, it appears (with a possible exception of clients that have not been rebooted recently, unloading lustre modules and starting again on the client seems to bring them into the first category.&lt;/p&gt;

&lt;p&gt;The same parameters/failnode setup works without problems so far on our other file systems were all servers are still running 1.8.&lt;/p&gt;</comment>
                            <comment id="72240" author="mdiep" created="Mon, 25 Nov 2013 16:52:45 +0000"  >&lt;p&gt;Hi Frederik,&lt;/p&gt;

&lt;p&gt;Is it working now? I believe there might be a small step that we missed somewhere during the process. Please let me know if everything is working. Thanks&lt;/p&gt;</comment>
                            <comment id="72248" author="ferner" created="Mon, 25 Nov 2013 20:02:39 +0000"  >&lt;p&gt;Minh,&lt;/p&gt;

&lt;p&gt;it is sort of working. I have one configuration/setup where all OSTs can be accessed by all clients I&apos;ve tried to bring up. However if I try to bring any of the OSTs up on a different OSS then they are on now, none of my clients even tries to contact this OSS. Recovery doesn&apos;t even start...&lt;/p&gt;

&lt;p&gt;So I&apos;d not say everything is working, but the urgency is lower as we have a work around. (which is valid until one of the servers fails...)&lt;/p&gt;

&lt;p&gt;I would appreciate help in fully resolving this. Let me know if there are any diagnostics that I should provide...&lt;/p&gt;

&lt;p&gt;Kind regards,&lt;br/&gt;
Frederik&lt;/p&gt;</comment>
                            <comment id="72252" author="mdiep" created="Mon, 25 Nov 2013 20:43:05 +0000"  >&lt;p&gt;This seems to relate to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;. could you remount the mdt with ldiskfs and dump the config log?&lt;/p&gt;</comment>
                            <comment id="72383" author="ferner" created="Wed, 27 Nov 2013 11:40:59 +0000"  >&lt;p&gt;CONFIGS directories for MDT and MGS,  including llog_reader output&lt;/p&gt;</comment>
                            <comment id="72384" author="ferner" created="Wed, 27 Nov 2013 11:41:19 +0000"  >&lt;p&gt;Minh,&lt;/p&gt;

&lt;p&gt;I wasn&apos;t quite sure which logs you want, so I remounted both mdt and mgs with ldiskfs, copied all files in the CONFIGS directories to a different location and ran llog_reader over them. The result is in the attached file.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Frederik&lt;/p&gt;</comment>
                            <comment id="72545" author="mdiep" created="Sat, 30 Nov 2013 01:04:02 +0000"  >&lt;p&gt;Hongchao,&lt;/p&gt;

&lt;p&gt;Could you check if this is a dump of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;? Thanks&lt;/p&gt;</comment>
                            <comment id="72546" author="mdiep" created="Sat, 30 Nov 2013 01:09:16 +0000"  >&lt;p&gt;I looked at the client log&lt;/p&gt;

&lt;p&gt;Header size : 8192&lt;br/&gt;
Time : Thu Nov 21 20:51:25 2013&lt;br/&gt;
Number of records: 21&lt;br/&gt;
Target uuid : play01-client &lt;br/&gt;
-----------------------&lt;br/&gt;
#01 (224)marker   4 (flags=0x01, v2.4.1.0) play01-clilov   &apos;lov setup&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#02 (120)attach    0:play01-clilov  1:lov  2:play01-clilov_UUID  &lt;br/&gt;
#03 (168)lov_setup 0:play01-clilov  1:(struct lov_desc)&lt;br/&gt;
		uuid=play01-clilov_UUID  stripe:cnt=1 size=1048576 offset=18446744073709551615 pattern=0x1&lt;br/&gt;
#04 (224)marker   4 (flags=0x02, v2.4.1.0) play01-clilov   &apos;lov setup&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#05 (224)marker   5 (flags=0x01, v2.4.1.0) play01-clilmv   &apos;lmv setup&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#06 (120)attach    0:play01-clilmv  1:lmv  2:play01-clilmv_UUID  &lt;br/&gt;
#07 (168)lov_setup 0:play01-clilmv  1:(struct lov_desc)&lt;br/&gt;
		uuid=play01-clilmv_UUID  stripe:cnt=0 size=0 offset=0 pattern=0&lt;br/&gt;
#08 (224)marker   5 (flags=0x02, v2.4.1.0) play01-clilmv   &apos;lmv setup&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#09 (224)marker   6 (flags=0x01, v2.4.1.0) play01-MDT0000  &apos;add mdc&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#10 (088)add_uuid  nid=172.23.144.5@tcp(0x20000ac179005)  0:  1:172.23.144.5@tcp  &lt;br/&gt;
#11 (128)attach    0:play01-MDT0000-mdc  1:mdc  2:play01-clilmv_UUID  &lt;br/&gt;
#12 (144)setup     0:play01-MDT0000-mdc  1:play01-MDT0000_UUID  2:172.23.144.5@tcp  &lt;br/&gt;
#13 (088)add_uuid  nid=172.23.144.5@tcp(0x20000ac179005)  0:  1:172.23.144.5@tcp  &lt;br/&gt;
#14 (112)add_conn  0:play01-MDT0000-mdc  1:172.23.144.5@tcp  &lt;br/&gt;
#15 (088)add_uuid  nid=172.23.144.6@tcp(0x20000ac179006)  0:  1:172.23.144.5@tcp  &lt;br/&gt;
#16 (112)add_conn  0:play01-MDT0000-mdc  1:172.23.144.5@tcp  &lt;br/&gt;
#17 (160)modify_mdc_tgts add 0:play01-clilmv  1:play01-MDT0000_UUID  2:0  3:1  4:play01-MDT0000-mdc_UUID  &lt;br/&gt;
#18 (224)marker   6 (flags=0x02, v2.4.1.0) play01-MDT0000  &apos;add mdc&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#19 (224)marker   7 (flags=0x01, v2.4.1.0) play01-client   &apos;mount opts&apos; Thu Nov 21 20:51:25 2013-&lt;br/&gt;
#20 (120)mount_option 0:  1:play01-client  2:play01-clilov  3:play01-clilmv  &lt;br/&gt;
#21 (224)marker   7 (flags=0x02, v2.4.1.0) play01-client   &apos;mount opts&apos; Thu Nov 21 20:51:25 2013-&lt;/p&gt;

&lt;p&gt;line #15 add_uuid should have 172.23.144.6 on the second nid instead of *5.&lt;/p&gt;

&lt;p&gt;This is a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt; IMHO&lt;/p&gt;</comment>
                            <comment id="73921" author="hongchao.zhang" created="Fri, 20 Dec 2013 09:58:39 +0000"  >&lt;p&gt;yes, It should be a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Hi Frederik, could you please try with the patch &lt;a href=&quot;http://review.whamcloud.com/#/c/8372/?&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8372/?&lt;/a&gt; thanks&lt;/p&gt;</comment>
                            <comment id="75560" author="mdiep" created="Fri, 24 Jan 2014 16:20:48 +0000"  >&lt;p&gt;dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4243&quot; title=&quot;multiple servicenodes or failnids: wrong client llog registration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4243&quot;&gt;&lt;del&gt;LU-4243&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="77523" author="jfc" created="Thu, 20 Feb 2014 21:10:19 +0000"  >&lt;p&gt;Frederick &amp;#8211; can I check if this is now resolved? If so, I will mark it as such. Thanks ~ jfc.&lt;/p&gt;</comment>
                            <comment id="77525" author="jfc" created="Thu, 20 Feb 2014 21:16:25 +0000"  >&lt;p&gt;Frederick &amp;#8211; my error. I see this is already resolved, so no action required. ~ jfc.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="21965">LU-4242</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13880" name="play01_configs.tar.gz" size="8112" author="ferner" created="Wed, 27 Nov 2013 11:40:59 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw9tj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11756</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>