<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:22:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9023] Second opinion on MDT inode recovery requested</title>
                <link>https://jira.whamcloud.com/browse/LU-9023</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This is a sanity check question. NSC sees no reason the method described below should not work, but due to the high impact a failure would have we&apos;d like a second opinion. We have scheduled downtime to execute it Thursday next week, 26 Jan.&lt;/p&gt;

&lt;p&gt;To sort out the fallout of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8953&quot; title=&quot;ZFS-MDT 100% full. Request for verification of plan to fix&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8953&quot;&gt;&lt;del&gt;LU-8953&lt;/del&gt;&lt;/a&gt; (out of inodes on ZFS MDT solved by adding more disks to the pool) we need to recreate the original pool. The reason we ran out of inodes was that when the vendor sent us hardware for the latest expansion that was supposed to be equivalent to the last shipment the SSD had switched from reporting 512b blocks to 4k blocks. Since I had not hardcoded ashift we ended up with 6-8 times fewer inodes and this was missed in testing.&lt;/p&gt;

&lt;p&gt;There isn&apos;t enough slots in the MDSs to solve this by throwing HW at it as a permanent solution, so I need to move all data from pools with ashift=12 to ashift=9. Do you see any problem with just doing the following:&lt;/p&gt;

&lt;p&gt;(The funny device names come from running LVM just to get more easily identifiable names)&lt;/p&gt;

&lt;p&gt;Unmount the filesystem on all nodes then run something like this for each mdt that needs fixing:&lt;/p&gt;

&lt;p&gt;umount lustre-mdt0/fouo6&lt;br/&gt;
zfs snapshot lustre-mdt0/fouo6@copythis&lt;br/&gt;
zpool create lustre-mdt-tmp -o ashift=9 mirror \&lt;br/&gt;
    /dev/new_sdr/mdt_fouo6new_sdr \&lt;br/&gt;
    /dev/new_sdu/mdt_fouo6new_sdu&lt;br/&gt;
zfs send -R lustre-mdt0/fouo6@copythis | zfs recv lustre-mdt-tmp/fouo6tmp&lt;br/&gt;
zpool destroy lustre-REMOVETHIS-mdt0&lt;br/&gt;
zpool create lustre-mdt0/fouo6 -o ashift=9 \&lt;br/&gt;
    mirror /dev/mds9_sdm/mdt_fouo6_sdm /dev/mds9_sdn/mdt_fouo6_sdn \&lt;br/&gt;
    mirror /dev/mds9_sdo/mdt_fouo6_sdo /dev/mds9_sdp/mdt_fouo6_sdp&lt;br/&gt;
zfs send -R lustre-mdt-tmp/fouo6tmp@copythis | zfs recv lustre-mdt0/fouo6&lt;br/&gt;
mount -t lustre lustre-mdt0/fouo6 /mnt/lustre/local/fouo6&lt;br/&gt;
zpool destroy lustre-mdt-tmp&lt;/p&gt;

&lt;p&gt;The &quot;REMOVETHIS-&quot; inserted due to desktop copy buffer paranoia should be removed before running of course.&lt;/p&gt;</description>
                <environment></environment>
        <key id="43008">LU-9023</key>
            <summary>Second opinion on MDT inode recovery requested</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="gabriele.paciucci">Gabriele Paciucci</assignee>
                                    <reporter username="zino">Peter Bortas</reporter>
                        <labels>
                    </labels>
                <created>Mon, 16 Jan 2017 14:56:37 +0000</created>
                <updated>Tue, 31 Jan 2017 13:50:56 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="180906" author="zino" created="Mon, 16 Jan 2017 15:42:17 +0000"  >&lt;p&gt;That create line is incorrect. Should be just &quot;zpool create lustre-mdt0&quot; without the extra filesystem part.&lt;/p&gt;</comment>
                            <comment id="181176" author="jgmitter" created="Wed, 18 Jan 2017 18:09:57 +0000"  >&lt;p&gt;Hi Zhiqi,&lt;/p&gt;

&lt;p&gt;Do you have any recommendation here?&lt;/p&gt;

&lt;p&gt;Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="181178" author="jgmitter" created="Wed, 18 Jan 2017 18:13:13 +0000"  >&lt;p&gt;Peter,&lt;br/&gt;
While we wait for Zhiqi to comment, you can also see the commentary in &lt;a href=&quot;https://jira.whamcloud.com/browse/LUDOC-161&quot; title=&quot;document backup/restore process for ZFS backing filesystems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LUDOC-161&quot;&gt;&lt;del&gt;LUDOC-161&lt;/del&gt;&lt;/a&gt; as a 2nd source of information.&lt;br/&gt;
Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="181390" author="pjones" created="Thu, 19 Jan 2017 12:56:23 +0000"  >&lt;p&gt;Just a test to check access for &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=zino&quot; class=&quot;user-hover&quot; rel=&quot;zino&quot;&gt;zino&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="181391" author="zino" created="Thu, 19 Jan 2017 13:21:21 +0000"  >&lt;p&gt;Appriciated Joseph,&lt;/p&gt;

&lt;p&gt;That doc in my mind confirms that we are on the right track with this procedure. I&apos;ll wait for Zhiqi to see if he has any further insight. &lt;/p&gt;

&lt;p&gt;(And thanks Peter, I can see the tickets again now.)&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Peter B&lt;/p&gt;</comment>
                            <comment id="181400" author="zhiqi" created="Thu, 19 Jan 2017 14:16:56 +0000"  >&lt;p&gt;Hi Peter B, &lt;/p&gt;

&lt;p&gt;Gabriele will try this procedure in an internal develop lab and update this ticket with his experience. We understand your timing, &quot;We have scheduled downtime to execute it Thursday next week, 26 Jan.&quot;&lt;/p&gt;

&lt;p&gt;We should have results in one day or two. &lt;/p&gt;

&lt;p&gt;Best Regards,&lt;br/&gt;
Zhiqi&lt;/p&gt;</comment>
                            <comment id="181401" author="gabriele.paciucci" created="Thu, 19 Jan 2017 14:22:16 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=zino&quot; class=&quot;user-hover&quot; rel=&quot;zino&quot;&gt;zino&lt;/a&gt;,&lt;br/&gt;
I would say maybe 3 or 4 days &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;BTW I&apos;m London based, so we can organize a call to double check the procedure.&lt;/p&gt;</comment>
                            <comment id="181418" author="zino" created="Thu, 19 Jan 2017 15:41:21 +0000"  >&lt;p&gt;Hi Gabriele,&lt;/p&gt;

&lt;p&gt;Sounds great. Let&apos;s check in on Monday and see how things have progressed. And we can schedule a call then if we feel it&apos;s needed.&lt;/p&gt;</comment>
                            <comment id="181576" author="gabriele.paciucci" created="Fri, 20 Jan 2017 14:58:49 +0000"  >&lt;p&gt;Okay this is a first procedure that doesn&apos;t need a second pool and we save our backup in a gzip file. In my environment, I have in the same pool MDT and MGT.&lt;br/&gt;
The name of the pool is &lt;tt&gt;MDS&lt;/tt&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#  zfs snap -r MDS@backup

#    zfs list -t snapshot
NAME              USED  AVAIL  REFER  MOUNTPOINT
MDS@backup           0      -    96K  -
MDS/mdt0@backup      0      -   489M  -
MDS/mgt@backup       0      -  5.27M  -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#  zfs send -R MDS@backup | gzip &amp;gt; /tmp/backup.gz

#  zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
MDS            495M   360G    96K  /MDS
MDS/mdt0       489M   360G   489M  /MDS/mdt0
MDS/mgt       5.27M   360G  5.27M  /MDS/mgt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# zpool destroy MDS

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#  zpool create -o ashift=9 MDS mirror  /dev/sdc /dev/sde

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
MDS       372G    50K   372G         -     0%     0%  1.00x  ONLINE  -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#   zcat /tmp/backup.gz | zfs recv -F MDS

# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
MDS            121M   360G    19K  /MDS
MDS/mdt0       118M   360G   118M  /MDS/mdt0
MDS/mgt       3.22M   360G  3.22M  /MDS/mgt
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#   zfs list -t snapshot
NAME              USED  AVAIL  REFER  MOUNTPOINT
MDS@backup           0      -    19K  -
MDS/mdt0@backup      0      -   118M  -
MDS/mgt@backup       0      -  3.22M  -
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#  mount -t lustre MDS/mgt /mnt/mgt/
#  mount -t lustre MDS/mdt0 /mnt/mdt0
# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda4      897134592 3685648 893448944   1% /
devtmpfs        32823216       0  32823216   0% /dev
tmpfs           32836956   39648  32797308   1% /dev/shm
tmpfs           32836956    9444  32827512   1% /run
tmpfs           32836956       0  32836956   0% /sys/fs/cgroup
/dev/sda2       10471424  176380  10295044   2% /boot
/dev/sda1        1046516    9644   1036872   1% /boot/efi
tmpfs            6567392       0   6567392   0% /run/user/0
MDS/mgt        374806016    3328 374800640   1% /mnt/mgt
MDS/mdt0       374922752  120960 374799744   1% /mnt/mdt0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


</comment>
                            <comment id="181577" author="gabriele.paciucci" created="Fri, 20 Jan 2017 15:00:09 +0000"  >&lt;p&gt;Do you need the same procedure using another zpool ?&lt;/p&gt;</comment>
                            <comment id="181583" author="zino" created="Fri, 20 Jan 2017 15:30:39 +0000"  >&lt;p&gt;Not really. I like your method better. It does invalidate some of my testing though, so I&apos;ll run some new over the weekend.&lt;/p&gt;</comment>
                            <comment id="181585" author="gabriele.paciucci" created="Fri, 20 Jan 2017 15:44:03 +0000"  >&lt;p&gt;Okay, I&apos;m now on hold waiting for your feedback. &lt;/p&gt;</comment>
                            <comment id="181760" author="zino" created="Mon, 23 Jan 2017 16:43:46 +0000"  >&lt;p&gt;Hi Gabriel,&lt;/p&gt;

&lt;p&gt;The weekends tests looks good. I have some tests I will run over night and lock down the plans tomorrow. A couple of questions:&lt;/p&gt;

&lt;p&gt;1. Did you have any reason that sending the whole pool would be better than sending individual filesystems except that it was easier because you also had the MGT there? Unless there is a reason not to I will send the filesystems, only for clearitys sake. The pools have anonymous names while the MDTs are named after the filsystems. I will be doing this for 3 pools on the same machine, so keeping the names reduces the chance of recv:ing or destroying the wrong filesystem. These will be the actual sends on my end:&lt;/p&gt;

&lt;p&gt;    zfs send -vR lustre-mdt0/fouo6@copythis | gzip &amp;gt; /lustre-mdt-tmpfs/mds0-fouo6.gz &lt;br/&gt;
    zfs send -vR lustre-mdt1/rossby20@copythis | gzip &amp;gt; /lustre-mdt-tmpfs/mds1-rossby20.gz &lt;br/&gt;
    zfs send -vR lustre-mdt2/smhid13@copythis | gzip &amp;gt; /lustre-mdt-tmpfs/mds2-smhid13.gz &lt;/p&gt;

&lt;p&gt;2. I will not be moving the MGT from ashift=12 to ashift=9. Will this cause any problems? I know the question is borderline insane, but this is really the original reason I opened this ticket with you. I&apos;m OK with sorting out everything on the zfs level, but I&apos;m trying to fish for half-insane things like hard-coding offsets on MDT creation time based on number of blocks somewhere deep in Lustre.&lt;/p&gt;</comment>
                            <comment id="181825" author="pjones" created="Mon, 23 Jan 2017 23:56:31 +0000"  >&lt;p&gt;Peter&lt;/p&gt;

&lt;p&gt;Gabriele is unexpectedly out of the office at short notice. Can this wait until he is available again (hopefully next week)?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="181882" author="zino" created="Tue, 24 Jan 2017 12:32:27 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;That&apos;s unfortunate. Of course it&apos;s technically possible to delay this to another week, but the cluster downtime is now to late to stop for this week. I will also have to mount the filesystems ro for a few weeks since the users will run out of inodes before the next window.&lt;/p&gt;

&lt;p&gt;I&apos;d be happy with an answer to just this question: As far as Intel engineers know, is there anything in the filesystem that stores a structure that would be affected by a change in block size; i.e. could cause problems during this data move. We&apos;ll assume for the sake of this discussion that I&apos;ll be able to flawlessly take care of the bit shuffling on disk.&lt;/p&gt;</comment>
                            <comment id="182229" author="gabriele.paciucci" created="Thu, 26 Jan 2017 09:06:24 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=zino&quot; class=&quot;user-hover&quot; rel=&quot;zino&quot;&gt;zino&lt;/a&gt;,&lt;br/&gt;
I&apos;m back...sorry for this. I don&apos;t know if this is too late:&lt;br/&gt;
1. Yes I was sending the whole pool for that reason, but testing only the individual volume worked. But I suggest to have a backup of the whole file system is not a bad idea...just in case&lt;br/&gt;
2. We are not expecting any performance or big capacity requirement on the MGT, so I don&apos;t see any problem to leave with the original ashift.&lt;/p&gt;

&lt;p&gt;Making lustre decide the shift layout at the format time it is something that maybe &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=adilger&quot; class=&quot;user-hover&quot; rel=&quot;adilger&quot;&gt;adilger&lt;/a&gt; can evaluate. Not sure if Lustre can evaluate the physical layout of the disks.&lt;/p&gt;</comment>
                            <comment id="182266" author="zino" created="Thu, 26 Jan 2017 15:43:43 +0000"  >&lt;p&gt;Hi Gabriel,&lt;/p&gt;

&lt;p&gt;You are in time. We got a bit delayed by hardware failing elsewhere in the cluster, so the procedure is just started. We&apos;ll know today if I lost the filesystems or not.&lt;/p&gt;

&lt;p&gt;I&apos;ll make an extra backup of the whole filesystems. It just adds about 1h to the procedure, and that&apos;s worth it.&lt;/p&gt;

&lt;p&gt;I don&apos;t think the tools for formatting really needs any intelligence here really, this was an operator error. But if there are no performance problems with running ashift=9 on 4k block SSDs in the general case it might be a good idea to default to ashift=9 there though. In my tests I&apos;ve not seen any performance advantage outside of the error margin by using ashift=12 on SSDs on the MDS.&lt;/p&gt;</comment>
                            <comment id="182321" author="gabriele.paciucci" created="Thu, 26 Jan 2017 21:28:46 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=zino&quot; class=&quot;user-hover&quot; rel=&quot;zino&quot;&gt;zino&lt;/a&gt;,&lt;br/&gt;
you don&apos;t see performance improvement because the bottleneck is in the code and not in the underline hw performance. We saw very different performance instead in the OST not using ashift=12.&lt;/p&gt;</comment>
                            <comment id="182556" author="zino" created="Mon, 30 Jan 2017 16:09:00 +0000"  >&lt;p&gt;This operation was somewhat delayed by unrelated failures in one of the attached compute clusters, but completed without problems on Friday.&lt;/p&gt;

&lt;p&gt;I have noted one oddity with ZFS snapshots today, but nothing that affects production. I&apos;ll try to figure out that one by tomorrow and then we can close this.&lt;/p&gt;</comment>
                            <comment id="182746" author="zino" created="Tue, 31 Jan 2017 13:50:56 +0000"  >&lt;p&gt;The ZFS oddity seems to be unrelated to the recreation of the filesystems, so I&apos;ll track that separately if needed.&lt;/p&gt;

&lt;p&gt;This concludes this issue from my side. Thanks for the help everyone!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="19505">LUDOC-161</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz0sf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>