Details
-
Question/Request
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
I recently noticed that RENME changelog records are emitted by the MDT of the target directory (and only this MDT):
#> MDSCOUNT=2 lustre/tests/llmount.sh #> lctl --device lustre-MDT0000 changelog_register #> lctl --device lustre-MDT0001 changelog_register #> #> lfs mkdir -i 0 /mnt/lustre/mdt-0 #> lfs mkdir -i 1 /mnt/lustre/mdt-1 #> touch /mnt/lustre/mdt-0/file-0 #> mv /mnt/lustre/mdt-0/file-0 /mnt/lustre/mdt-1 #> #> lfs changelog lustre-MDT0000 1 02MKDIR 14:28:45.489241745 2019.07.22 0x0 t=[0x200000402:0x1:0x0] j=lt-lfs.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000007:0x1:0x0] mdt-0 2 01CREAT 14:28:54.679826073 2019.07.22 0x0 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000402:0x1:0x0] file-0 3 11CLOSE 14:28:54.717755997 2019.07.22 0x42 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=10.200.0.1@tcp #> lfs changelog lustre-MDT0001 1 02MKDIR 14:28:48.788225263 2019.07.22 0x0 t=[0x240000402:0x1:0x0] j=lt-lfs.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x200000007:0x1:0x0] mdt-1 2 08RENME 14:29:03.315736883 2019.07.22 0x0 t=[0:0x0:0x0] j=mv.0 ef=0xf u=0:0 nid=10.200.0.1@tcp p=[0x240000402:0x1:0x0] file-0 s=[0x200000402:0x2:0x0] sp=[0x200000402:0x1:0x0] file-0
The HLINK changelog record behaves similarly. But CREAT and UNLNK are still emitted by the MDT that is in charge of them.
Now, I may be wrong, but I think that for an application that relies on changelog records to mirror a filesystem's metadata (eg. RobinHood), it is not always possible to order those records. At the very least, I pretend that performance would take a serious hit as changelog consumers would have to synchronize with one another.
As an example, consider the following series or renames: A/file --> B/file, B/file --> A/file; where A and B are directories that live on different MDTs. Let R1, and R2 be the changelog consumers for respectively A's MDT and B's MDT.
Without any synchronization between R1 and R2, R1 may process its RENME record first: it will try to delete B/file from whatever backend it uses, and then insert (/create) A/file. If R1 chooses to fail the transaction because B/file does not exist, it effectively waits for R2 [*]. Otherwise, some time later, R2 discovers its own RENME record: it will delete A/file from its backend and add B/file to it.
[*] And even then, if R1 waits for B/file to exist before deleting it, it is possible that the original series of renames is followed by: A/file --> C/file; in which case, R3 the changelog consumer for C's MDT will compete with R2 when trying to delete A/file (if R3 wins, R1 is stuck forever).
I think it is possible to solve this by requiring that CREAT and UNLNK records are matched with CREAT-TO and UNLNK-TO records emitted by the MDT of the affected entry's parent. And for renames, that an additional RENME-FROM record is emitted on the source directory's MDT. [**]
[**] I leave it to someone else to find good names for these record types. =)
This would allow a changelog consumer to process any changelog record it sees without the need to synchronize with any other consumer.
In the previous example: R1 can still see its RENME record first but not before it sees the RENME-FROM (emitted by A/file --> B/file). In this case, it will necessary delete A/file before re-inserting it. R2 does something similar in that it will necessarily insert B/file before it deletes it. The processing becomes rather simple: on RENME create the target, on RENME-FROM delete the source.