<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:53:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12574] Replicating lustre&apos;s metadata only with changelog records + DNEv2</title>
                <link>https://jira.whamcloud.com/browse/LU-12574</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I recently noticed that RENME changelog records are emitted by the MDT of the target directory (and only this MDT):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#&amp;gt; MDSCOUNT=2 lustre/tests/llmount.sh
#&amp;gt; lctl --device lustre-MDT0000 changelog_register
#&amp;gt; lctl --device lustre-MDT0001 changelog_register
#&amp;gt;
#&amp;gt; lfs mkdir -i 0 /mnt/lustre/mdt-0
#&amp;gt; lfs mkdir -i 1 /mnt/lustre/mdt-1
#&amp;gt; touch /mnt/lustre/mdt-0/file-0
#&amp;gt; mv /mnt/lustre/mdt-0/file-0 /mnt/lustre/mdt-1
#&amp;gt;
#&amp;gt; lfs changelog lustre-MDT0000
1 02MKDIR 14:28:45.489241745 2019.07.22 0x0 t=[0x200000402:0x1:0x0] j=lt-lfs.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000007:0x1:0x0] mdt-0
2 01CREAT 14:28:54.679826073 2019.07.22 0x0 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000402:0x1:0x0] file-0
3 11CLOSE 14:28:54.717755997 2019.07.22 0x42 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=10.200.0.1@tcp
#&amp;gt; lfs changelog lustre-MDT0001
1 02MKDIR 14:28:48.788225263 2019.07.22 0x0 t=[0x240000402:0x1:0x0] j=lt-lfs.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000007:0x1:0x0] mdt-1
2 08RENME 14:29:03.315736883 2019.07.22 0x0 t=[0:0x0:0x0] j=mv.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x240000402:0x1:0x0] file-0 s=[0x200000402:0x2:0x0] sp=[0x200000402:0x1:0x0] file-0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The HLINK changelog record behaves similarly. But CREAT and UNLNK are still emitted by the MDT that is in charge of them.&lt;/p&gt;

&lt;p&gt;Now, I may be wrong, but I think that for an application that relies on changelog records to mirror a filesystem&apos;s metadata (eg. RobinHood), it is not always possible to order those records. At the very least, I pretend that performance would take a serious hit as changelog consumers would have to synchronize with one another.&lt;/p&gt;

&lt;p&gt;As an example, consider the following series or renames: &lt;tt&gt;A/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;B/file&lt;/tt&gt;, &lt;tt&gt;B/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;A/file&lt;/tt&gt;; where &lt;tt&gt;A&lt;/tt&gt; and &lt;tt&gt;B&lt;/tt&gt; are directories that live on different MDTs. Let &lt;tt&gt;R1&lt;/tt&gt;, and &lt;tt&gt;R2&lt;/tt&gt; be the changelog consumers for respectively &lt;tt&gt;A&lt;/tt&gt;&apos;s MDT and &lt;tt&gt;B&lt;/tt&gt;&apos;s MDT.&lt;/p&gt;

&lt;p&gt;Without any synchronization between &lt;tt&gt;R1&lt;/tt&gt; and &lt;tt&gt;R2&lt;/tt&gt;, &lt;tt&gt;R1&lt;/tt&gt; may process its RENME record first: it will try to delete &lt;tt&gt;B/file&lt;/tt&gt; from whatever backend it uses, and then insert (/create) &lt;tt&gt;A/file&lt;/tt&gt;. If &lt;tt&gt;R1&lt;/tt&gt; chooses to fail the transaction because &lt;tt&gt;B/file&lt;/tt&gt; does not exist, it effectively waits for &lt;tt&gt;R2&lt;/tt&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;*&amp;#93;&lt;/span&gt;. Otherwise, some time later, &lt;tt&gt;R2&lt;/tt&gt; discovers its own RENME record: it will delete &lt;tt&gt;A/file&lt;/tt&gt; from its backend and add &lt;tt&gt;B/file&lt;/tt&gt; to it.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;*&amp;#93;&lt;/span&gt; And even then, if &lt;tt&gt;R1&lt;/tt&gt; waits for &lt;tt&gt;B/file&lt;/tt&gt; to exist before deleting it, it is possible that the original series of renames is followed by: &lt;tt&gt;A/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;C/file&lt;/tt&gt;; in which case, &lt;tt&gt;R3&lt;/tt&gt; the changelog consumer for &lt;tt&gt;C&lt;/tt&gt;&apos;s MDT will compete with &lt;tt&gt;R2&lt;/tt&gt; when trying to delete &lt;tt&gt;A/file&lt;/tt&gt; (if &lt;tt&gt;R3&lt;/tt&gt; wins, &lt;tt&gt;R1&lt;/tt&gt; is stuck forever).&lt;/p&gt;

&lt;p&gt;I think it is possible to solve this by requiring that CREAT and UNLNK records are matched with CREAT-TO and UNLNK-TO records emitted by the MDT of the affected entry&apos;s parent. And for renames, that an additional RENME-FROM record is emitted on the source directory&apos;s MDT. &lt;span class=&quot;error&quot;&gt;&amp;#91;**&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;**&amp;#93;&lt;/span&gt; I leave it to someone else to find good names for these record types. =)&lt;/p&gt;

&lt;p&gt;This would allow a changelog consumer to process any changelog record it sees without the need to synchronize with any other consumer.&lt;/p&gt;

&lt;p&gt;In the previous example: &lt;tt&gt;R1&lt;/tt&gt; can still see its RENME record first but not before it sees the RENME-FROM (emitted by &lt;tt&gt;A/file&lt;/tt&gt; --&amp;gt;&#160;&lt;tt&gt;B/file&lt;/tt&gt;). In this case, it will necessary delete&#160;&lt;tt&gt;A/file&lt;/tt&gt; before re-inserting it. R2 does something similar in that it will necessarily insert &lt;tt&gt;B/file&lt;/tt&gt; before it deletes it. The processing becomes rather simple: on RENME create the target, on RENME-FROM delete the source.&lt;/p&gt;</description>
                <environment></environment>
        <key id="56464">LU-12574</key>
            <summary>Replicating lustre&apos;s metadata only with changelog records + DNEv2</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="adilger">Andreas Dilger</assignee>
                                    <reporter username="cealustre">CEA</reporter>
                        <labels>
                    </labels>
                <created>Mon, 22 Jul 2019 15:55:02 +0000</created>
                <updated>Wed, 5 Aug 2020 13:50:23 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="251845" author="adilger" created="Mon, 22 Jul 2019 22:24:38 +0000"  >&lt;p&gt;My understanding of the &lt;tt&gt;RENME&lt;/tt&gt; record is that if the &lt;tt&gt;A/file -&amp;gt; B/file&lt;/tt&gt; rename resulted in a &lt;tt&gt;t=&lt;span class=&quot;error&quot;&gt;&amp;#91;FID&amp;#93;&lt;/span&gt;&lt;/tt&gt; (target) field being included in that record, like the following record for a (local) rename &quot;&lt;tt&gt;mv list list.old&lt;/tt&gt;&quot; in directory &lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000061c1:0x87d:0x0&amp;#93;&lt;/span&gt;&lt;/tt&gt; that deletes the &lt;tt&gt;list.old&lt;/tt&gt; target with FID &lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x20001ffd1:0x3a4:0x0&amp;#93;&lt;/span&gt;&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;66701284 08RENME 10:28:34.579612216 2019.07.22 0x1 t=[0x20001ffd1:0x3a4:0x0] j=mv.500 p=[0x2000061c1:0x87d:0x0] list.old s=[0x20001ffd1:0x3c5:0x0] sp=[0x2000061c1:0x87d:0x0] list
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Otherwise, the rename did not delete a target file, and if one is found then the resync operation should be treated with suspicion. &lt;/p&gt;

&lt;p&gt;In your above example, the &lt;tt&gt;R1&lt;/tt&gt; consumer knows whether the &lt;tt&gt;A/file -&amp;gt; B/file&lt;/tt&gt; (with &lt;tt&gt;B/file&lt;/tt&gt; overwrite) is valid because it knows the FID of the target file being unlinked.  If &lt;tt&gt;B/file&lt;/tt&gt; does not match the expected FID, then it is not the right target to be removing.  Likewise, the &lt;tt&gt;R2&lt;/tt&gt; consumer should knows that the &lt;tt&gt;B/file&lt;/tt&gt; source is not the correct source FID to be renaming.&lt;/p&gt;

&lt;p&gt;I&apos;m not fundamentally &lt;em&gt;against&lt;/em&gt; adding a changelog record on the source MDT, but there are definitely real implementation complexities associated with this.  Firstly, the MDT of the source directory doesn&apos;t really see the rename &quot;operation&quot; at all, since this is handled by the MDT of the target directory.  The source MDT only sees an OSP &quot;update&quot; request to remove the name entry from the source directory, possibly with a decref on the source directory if it is a directory being renamed vs. a regular file.  Such an update record might also be generated in case of an unlink, so it isn&apos;t possible to determine on the source MDT side what needs to be added to the changelog &lt;em&gt;just&lt;/em&gt; from this update request.  It &lt;em&gt;might&lt;/em&gt; be enough to insert a changelog record from the OSP update that serves as a &quot;resync point&quot; if it can contain enough information about the originating MDT operation to link the two records.  This may be complicated by ordering constraints, because I don&apos;t &lt;em&gt;think&lt;/em&gt; the MDT transno is assigned at the time that the changelog record is written, otherwise it would have been included in the record itself already.&lt;/p&gt;

&lt;p&gt;The DNE distributed transaction mechanism has distributed transaction recovery logs written to all the involved remote MDTs (the &quot;source directory MDT&quot; in this case) from the master MDT (the &quot;target directory MDT&quot;), but those transaction logs are &lt;b&gt;not&lt;/b&gt; interpreted by the remote MDT, only &quot;blobs&quot; are written by the master MDT for &lt;b&gt;its own&lt;/b&gt; eventual use in case of distributed recovery is needed if the master MDT crashes.  It would add implementation and recovery complexity if the master MDT was also injecting records into the remote MDT&apos;s changelog during its distributed transactions, since this introduces further ordering constraints during transaction replay.&lt;/p&gt;

&lt;p&gt;Since this issue only exists in the case of distributed transactions, there would likely need to be some resync between the changelog consumers at the point of distributed operations (at least those involved in a particular operation), even if they are not coordinated during other activity. Otherwise, any number of problems could be introduced in the resync activities, as is shown in the above examples.  This is one reason why we generally try to avoid distributed operations if possible, so file creation and such are typically only done in the local parent directory.&lt;/p&gt;</comment>
                            <comment id="251861" author="bougetq" created="Tue, 23 Jul 2019 09:38:48 +0000"  >&lt;p&gt;I think you misunderstood my example, I am not trying to handle concurrent renames. When I wrote &lt;tt&gt;A/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;B/file&lt;/tt&gt;, &lt;tt&gt;B/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;A/file&lt;/tt&gt;, I meant two &lt;em&gt;sequential&lt;/em&gt; renames, that are totally ordered on the FS:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
$&amp;gt; mv A/file B/file
$&amp;gt; mv B/file A/file
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;My point is that although they are ordered on the FS, it is not possible to infer the order from changelog records only.&lt;/p&gt;

&lt;p&gt;I think it might help to use a concrete example, so I will go with RobinHood.&lt;/p&gt;

&lt;p&gt;In RobinHood (v3) the namespace is maintained in a table named &lt;tt&gt;NAMES&lt;/tt&gt; that matches every (Parent FID, Name) with a FID. When a RENME occurs (one that does not overwrite the destination file), RobinHood issues two SQL requests:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sql&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;DELETE&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt;
&lt;span class=&quot;code-keyword&quot;&gt;WHERE&lt;/span&gt; parent_id = old_parent &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt; = old_name &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; id = fid;
&lt;span class=&quot;code-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt; (parent_id, &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt;, id)
&lt;span class=&quot;code-keyword&quot;&gt;VALUES&lt;/span&gt; (new_parent, new_name, fid)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And this only works if there is never more than one process issuing this kind of request for a given FID.&lt;/p&gt;

&lt;p&gt;With the example in the description, and this initial state:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;parent_id&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;name&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;id&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;A-FID&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file-FID&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;The correct final state should be:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;parent_id&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;name&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;id&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;A-FID&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file-FID&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;But the following can happen:&lt;/p&gt;

&lt;p&gt;R1 processes &lt;tt&gt;B/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;A/file&lt;/tt&gt; first (without synchronizing with R2), it issues:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sql&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;DELETE&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt;
&lt;span class=&quot;code-keyword&quot;&gt;WHERE&lt;/span&gt; parent_id = B-FID &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt; = &lt;span class=&quot;code-keyword&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; id = fid;
&lt;span class=&quot;code-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt; (parent_id, &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt;, id)
&lt;span class=&quot;code-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;code-keyword&quot;&gt;A&lt;/span&gt;-FID, &lt;span class=&quot;code-keyword&quot;&gt;file&lt;/span&gt;, fid)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And the result is:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;parent_id&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;name&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;id&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;A-FID&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file-FID&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;(No record is actually inserted because there is a unique constraint on (parent_id, name))&lt;/p&gt;

&lt;p&gt;R2 then processes &lt;tt&gt;A/file&lt;/tt&gt; --&amp;gt; &lt;tt&gt;B/file&lt;/tt&gt; and issues:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-sql&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;DELETE&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt;
&lt;span class=&quot;code-keyword&quot;&gt;WHERE&lt;/span&gt; parent_id = &lt;span class=&quot;code-keyword&quot;&gt;A&lt;/span&gt;-FID &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt; = &lt;span class=&quot;code-keyword&quot;&gt;file&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;AND&lt;/span&gt; id = fid;
&lt;span class=&quot;code-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;INTO&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;NAMES&lt;/span&gt; (parent_id, &lt;span class=&quot;code-keyword&quot;&gt;name&lt;/span&gt;, id)
&lt;span class=&quot;code-keyword&quot;&gt;VALUES&lt;/span&gt; (B-FID, &lt;span class=&quot;code-keyword&quot;&gt;file&lt;/span&gt;, fid)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The final state is:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;parent_id&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;name&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;id&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;B-FID&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;file-FID&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;&#160;-----&lt;/p&gt;

&lt;p&gt;Of course, the fact remains that it would be &lt;b&gt;hard&lt;/b&gt; to add the changelog records I described.&lt;/p&gt;

&lt;p&gt;&amp;gt; Such an update record might also be generated in case of an unlink&lt;/p&gt;

&lt;p&gt;This does not need to be an issue. A changelog consumer does not need to know whether the file was unlinked or renamed. It only needs to know &quot;&lt;em&gt;this path&lt;/em&gt; for &lt;em&gt;this FID&lt;/em&gt; is no longer valid&quot;.&lt;/p&gt;

&lt;p&gt;&amp;gt; there would likely need to be some resync between the changelog consumers at the point of distributed operations&lt;/p&gt;

&lt;p&gt;I think it can be proven that this is not always possible. If someone can suggest something that always works (and is moderately performant), I will happily take it.&lt;/p&gt;

&lt;p&gt;&amp;gt; so file creation and such are typically only done in the local parent directory&lt;/p&gt;

&lt;p&gt;Except when the parent directory is striped over several MDTs, in which case, the CREAT record is emitted on the created entry&apos;s MDT, not the parent directory&apos;s.&lt;/p&gt;</comment>
                            <comment id="251983" author="adilger" created="Wed, 24 Jul 2019 20:44:11 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I think you misunderstood my example, I am not trying to handle concurrent renames. When I wrote &lt;tt&gt;A/file --&amp;gt; B/file&lt;/tt&gt;, &lt;tt&gt;B/file --&amp;gt; A/file&lt;/tt&gt;, I meant two &lt;em&gt;sequential&lt;/em&gt; renames, that are totally ordered on the FS&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes, this was understood.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A changelog consumer does not need to know whether the file was unlinked or renamed. It only needs to know &quot;this path for this FID is no longer valid&quot;.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;That might be a possible solution.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Except when the parent directory is striped over several MDTs, in which case, the CREAT record is emitted on the created entry&apos;s MDT, not the parent directory&apos;s.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Even in the striped directory case, the default is that the create and unlink are done on a single MDT because the client hashes the filename and selects which MDT shard to create the inode on (there is no mechanism to &lt;em&gt;create&lt;/em&gt; a regular file with a remote name at this time).  The only time remote operations are needed is if the file has been renamed/hard linked to a different MDT (hopefully relatively rare), or for remote directories.&lt;/p&gt;</comment>
                            <comment id="252013" author="bougetq" created="Thu, 25 Jul 2019 12:31:50 +0000"  >&lt;blockquote&gt;&lt;p&gt;the default is that the create and unlink are done on a single MDT&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I did not know that... But I think this is still workable.&lt;/p&gt;

&lt;p&gt;We are re-architecturing how Robinhood processes changelogs and we defined 5 new types of changelog records for metadata mirroring:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;CREATE&lt;/li&gt;
	&lt;li&gt;LINK&lt;/li&gt;
	&lt;li&gt;UNLINK&lt;/li&gt;
	&lt;li&gt;DELETE&lt;/li&gt;
	&lt;li&gt;UPDATE&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;where CREATE and UPDATE are pretty much the same thing (we use upsert on the backend in both case).&lt;/p&gt;

&lt;p&gt;CREATE, UPDATE and DELETE maintain metadata of a given inode, LINK and UNLINK take care of the namespace.&lt;/p&gt;

&lt;p&gt;The mapping from Lustre&apos;s records to Robinhood&apos;s looks like:&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Lustre&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;Robinhood&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;CREAT&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;CREATE + LINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;HLINK&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;LINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLNK&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLNK (last)&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLINK + DELETE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;RENME (DNEv1)&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLINK + LINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;RENME (DNEv2)&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;LINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;RENME-FROM (DNEv2)&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;UNLINK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&lt;em&gt;TBD&lt;/em&gt;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&lt;em&gt;TBD&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;If I understand correctly, with striped directories, it is possible to see:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;RENME (&amp;lt;=&amp;gt; LINK) before CREAT (&amp;lt;=&amp;gt; CREATE);&lt;/li&gt;
	&lt;li&gt;RENME-FROM / RENME (&amp;lt;=&amp;gt; UNLINK / LINK) after UNLNK (last) / RENME (last) (&amp;lt;=&amp;gt; DELETE).&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;1. Works well by default (there will be entries with only namespace metadata for a while, but that is eventually consistent).&lt;br/&gt;
 2. On DELETE, record the index of the last changelog record on each MDT and defer the actual deletion until after the last of those records is processed. This is not ideal, but it should still perform quite well as Robinhood can keep processing any other record.&lt;/p&gt;

&lt;p&gt;One area I left out is cross-MDT migration. They might require a bit of work, but I think it is manageable without adding any new changelog record.&lt;/p&gt;</comment>
                            <comment id="255389" author="bougetq" created="Wed, 25 Sep 2019 18:15:06 +0000"  >&lt;p&gt;We discussed at LAD&apos;19 the need for a way to extract the mdt-index out of a FID. It turns out this is already implemented in the &lt;tt&gt;llapi&lt;/tt&gt; by &lt;tt&gt;llapi_get_mdt_index_by_fid()&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="255418" author="bougetq" created="Thu, 26 Sep 2019 12:13:34 +0000"  >&lt;p&gt;The whole LU was discussed at LAD&apos;19. For now, RobinHood (v4) will assume that at some point Lustre will support some sort of RENAME-FROM event on the source MDT.&lt;/p&gt;

&lt;p&gt;If it becomes clear this is not the way to go forward, please let me know.&lt;/p&gt;</comment>
                            <comment id="255579" author="olaf" created="Mon, 30 Sep 2019 09:41:52 +0000"  >&lt;p&gt;For context, DMF7 uses a number of tables to track namespace state. For this LU, of interest are the &lt;em&gt;inode&lt;/em&gt;&#160;and &lt;em&gt;name&lt;/em&gt;&#160;tables. If I understand Quentin&apos;s update above correctly, CREATE/UPDATE/DELETE would apply to the DMF7&#160;&lt;em&gt;inode&lt;/em&gt;&#160;table, reflecting when a new inode is created, updated, and destroyed. The other operations, LINK/UNLINK, update the parent+name to fid mapping maintained in the &lt;em&gt;name&lt;/em&gt;&#160;table.&lt;/p&gt;

&lt;p&gt;So a Lustre &lt;tt&gt;RENME&lt;/tt&gt;&#160;record can expand to an UNLINK, a LINK, and if there is a victim, also a DELETE.&lt;/p&gt;

&lt;p&gt;The &lt;tt&gt;MIGRT&lt;/tt&gt;&#160;record expands to a DELETE for the old fid, CREATE for the new one, matching UNLINK and LINK for the name, plus maybe some magic to trace file history across MIGRT.&lt;/p&gt;

&lt;p&gt;In either case it would be useful if the UNLINK action were visible in the changelog on the source MDT. If you want to synchronize changelog readers across MDTs, it tells you there should be a matching &lt;tt&gt;RENME&lt;/tt&gt; or &lt;tt&gt;MIGRT&lt;/tt&gt; on the other MDT. Without synchronization across MDTs it at least allows for the name changes within a directory to be correctly ordered because they can all be handled using the changelog for just that MDT. If the &quot;RENAME-FROM&quot; contains the source dir fid, source name, and fid of the inode being renamed then I think that would be sufficient for both DMF and RobinHood to work with.&lt;/p&gt;</comment>
                            <comment id="275522" author="adilger" created="Thu, 16 Jul 2020 03:10:29 +0000"  >&lt;p&gt;Would it be sufficient to add some kind of unique identifier for the split records (e.g. distributed transaction ID) so that they can be linked between the two Changelogs?  I wouldn&apos;t want to impose performance slowdowns due to synchronization between MDTs to ensure that they are always committed in-order to disk.  &lt;/p&gt;</comment>
                            <comment id="275535" author="bougetq" created="Thu, 16 Jul 2020 07:06:51 +0000"  >&lt;p&gt;Yes, but I would rather like to avoid it as well. It just moves the synchronization issue further down the stack (ie. to changelog consumers), and :&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;I assume there is already some kind of synchronization between MDTs, to ensure a consistent view of the filesystem ;&lt;/li&gt;
	&lt;li&gt;there isn&apos;t any sort of synchronization right now in my implementation of changelog consumers.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;As a matter of fact, it does not matter much to me that both records are committed &lt;b&gt;in-order&lt;/b&gt;. What really matters is that either both are committed, or none.&lt;br/&gt;
 I suppose it incurs the same amount of complexity though.&lt;/p&gt;</comment>
                            <comment id="275538" author="olaf" created="Thu, 16 Jul 2020 08:32:32 +0000"  >&lt;p&gt;A unique ID would be helpful for the DMF implementation - we have to do similar synchronization for reasons not related to Lustre, and could use such an identifier. Apart from that I agree with his points.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="55184">LU-12084</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="49443">LU-10283</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00k0f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>