[LU-10283] changelog entries for creates in striped directories use stripe FID as pfid Created: 27/Nov/17  Updated: 13/Jan/24  Resolved: 13/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Nikitas Angelinas
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-10265 lustre_rsync DNE support Open
is related to LU-12574 Replicating lustre's metadata only wi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When we create files in striped directories the changelog entries emitted use the parent stripe FID (instead of the parent dir FID) as the pfid for the create:

m:lustre# lfs mkdir -c2 d0
m:lustre# lfs path2fid d0
[0x200000402:0xf9f:0x0]
m:lustre# lfs getdirstripe d0
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx         FID[seq:oid:ver]
     0         [0x200000400:0x3d3:0x0]        
     1         [0x240000401:0x3d3:0x0]        
m:lustre# touch d0/f{0,1}
m:lustre# lfs changelog lustre-MDT0000
10753 02MKDIR 14:19:40.273195957 2017.11.27 0x0 t=[0x200000402:0xf9f:0x0] j=lfs.0 p=[0x200000007:0x1:0x0] d0
10754 01CREAT 14:20:08.243388795 2017.11.27 0x0 t=[0x200000402:0xfa0:0x0] j=touch.0 p=[0x200000400:0x3d3:0x0] f1
10755 11CLOSE 14:20:08.245569226 2017.11.27 0x42 t=[0x200000402:0xfa0:0x0] j=touch.0
m:lustre# lfs changelog lustre-MDT0001
11883 01CREAT 14:20:08.240982376 2017.11.27 0x0 t=[0x240000402:0x111f:0x0] j=touch.0 p=[0x240000401:0x3d3:0x0] f0
11884 11CLOSE 14:20:08.242496774 2017.11.27 0x42 t=[0x240000402:0x111f:0x0] j=touch.0

This confuses lustre_rsync. I wonder if we should fix this.



 Comments   
Comment by John Hammond [ 27/Nov/17 ]

Thomas, Henri, Quentin,

Does robinhood handle this correctly?

Comment by Andreas Dilger [ 28/Nov/17 ]

It seems to me that returning the FID of the shard is not the best for the ChangeLog, because the details of the directory striping should not be exposed in this way. The striping of the directory may change over time, and the target directory may not have the same striping either.

If “lfs fid2path [shard FID]” returns the same parent path for all of the shards, then this detail should not be totally evident to lustre_rsync, but any tools that are comparing the parent directories by FID may think that these two files were created in different directories.

Thoughts on how to fix this? Since each shard stores the LMV EA with the parent FID. it should be possible to log the proper parent FID into the ChangeLog, but I’m wondering if we might lose something else if we do that?

Comment by John Hammond [ 28/Nov/17 ]

> If “lfs fid2path [shard FID]” returns the same parent path for all of the shards, then this detail should not be totally evident to lustre_rsync, but any tools that are comparing the parent directories by FID may think that these two files were created in different directories.

Yes, "lfs fid2path [shard FID]" does return the parent path. However there are some cases in lustre_rsync where the parent path does not exist in the archive, so we create the file in .lustrerepl and store the tfid, pfid, and name in the status log. Then if lustre_rsync later sees a rename on the pfid then it moves all saved files with matching pfid from the .lustrerepl directory to the rename destination in the target archive.

Comment by John Hammond [ 30/Nov/17 ]

Allô? Any comment from the RBH developers?

Comment by Thomas Leibovici [ 01/Dec/17 ]

Current rbh implementation would expect the directory fid, not the shard fid which is somehow a lustre internal.

One could say that the shard fid could indicate the MDT where entries are located, but this information is already given by the MDT stream that has the log record.

Comment by Olaf Weber [ 16/Aug/18 ]

From a DMF perspective we'd also expect the directory fid to be reported.

Comment by Andreas Dilger [ 25/Sep/19 ]

Discussed at LAD'19 is that the ChangeLog could store the actual directory FID rather than the shard FID. In general, the shard FID is not very useful to userspace, since the directory striping should be transparent to users, and if the directory is restriped the shards could change anyway. On the MDTs where the operation is being done, it should be possible to know that the operation is done in a striped directory and what the actual directory FID is, so this should be possible to implement. It shouldn't cause problems for existing Changelog consumers, since it wouldn't be different than operations within a local directory.

Comment by Olaf Weber [ 02/Feb/22 ]

We are now encountering this in the field. Are there any plans to address this?

Comment by Gerrit Updater [ 14/Jun/23 ]

"Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51322
Subject: LU-10283 mdd: fix parent FID in changelog of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 80258995e44b9911deb54e9d914443a98a680020

Comment by Nikitas Angelinas [ 14/Jun/23 ]

I have submitted a patch from Dmitry Ivanov that seems to address this issue, by detecting whether a directory is striped using XATTR_NAME_LMV and if so, using mdd_parent_fid() to obtain the real parent FID for use in the generated changelog record:

# git describe
v2_15_56-1-g80258995e4
# lfs mkdir -i -1 -c 2 /mnt/lustre/testdir0
# lctl get_param mdd.*.changelog_striped_dir_real_pfid
mdd.lustre-MDT0000.changelog_striped_dir_real_pfid=0
mdd.lustre-MDT0001.changelog_striped_dir_real_pfid=0
# lfs getdirstripe /mnt/lustre/testdir0
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: crush
mdtidx FID[seq:oid:ver]
0 [0x200000400:0x2:0x0]
1 [0x240000401:0x2:0x0]
# lfs path2fid /mnt/lustre/testdir0
[0x200000402:0x1:0x0]
# touch /mnt/lustre/testdir0/testfile0
# lfs changelog lustre-MDT0000; lfs changelog lustre-MDT0001
...
2 01CREAT 21:46:35.984819711 2023.06.14 0x0 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo p=[0x200000400:0x2:0x0] testfile0
3 11CLOSE 21:46:36.028827790 2023.06.14 0x42 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo
# lctl set_param mdd.*.changelog_striped_dir_real_pfid=1
mdd.lustre-MDT0000.changelog_striped_dir_real_pfid=1
mdd.lustre-MDT0001.changelog_striped_dir_real_pfid=1
# lctl get_param mdd.*.changelog_striped_dir_real_pfid
mdd.lustre-MDT0000.changelog_striped_dir_real_pfid=1
mdd.lustre-MDT0001.changelog_striped_dir_real_pfid=1
# touch /mnt/lustre/testdir0/testfile1
# lfs changelog lustre-MDT0000; lfs changelog lustre-MDT0001
...
2 01CREAT 21:46:35.984819711 2023.06.14 0x0 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo p=[0x200000400:0x2:0x0] testfile0
3 11CLOSE 21:46:36.028827790 2023.06.14 0x42 t=[0x200000402:0x2:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo
4 01CREAT 21:47:08.772277807 2023.06.14 0x0 t=[0x200000402:0x3:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo p=[0x200000402:0x1:0x0] testfile1
5 11CLOSE 21:47:08.831376478 2023.06.14 0x42 t=[0x200000402:0x3:0x0] j=touch.0 ef=0xf u=0:0 nid=0@lo

Sergey Cheremencev had shown that this patch can result in an increased number of cross-MDT RPCs, so the added functionality needs to be explicitly enabled by setting the changelog_striped_dir_real_pfid tunable and is disabled by default. There have been some discussions re the possibility of avoiding the extra cross-MDT RPCs by obtaining the real parent fid from the parent's REMOTE_PARENT_DIR entry's linkEA, but Vitaly reckoned this would still require some RPCs in cases where the parent's fid is in a different MDT. Unfortunately, I am not sure if this is accurate and/or if we could add any additional information to the REMOTE_PARENT_DIR entries to use them for avoiding the extra RPCs in this case?

Comment by Olaf Weber [ 11/Jul/23 ]

In his review comments Andreas worries about compatibility with tools that rely on the stripe FID being returned in the changelog records. Does anyone know whether such tools actually exist?

Comment by Guillaume Courrier [ 11/Jul/23 ]

As far as robinhood is concerned, it assumes that the pfid in the changelog record is the FID of the parent directory. We didn't catch this issue in the first implementation of the new changelog reader of Robinhood 4. Robinhood doesn't manipulate shard FIDs. So from its perspective, this would result in a bug. The fix in patch 51322 would work for us. A tunable might be useful to be able to at least know which version of the changelog we are reading (to know whether the pfid is the actual FID of the directory or not). A new record in the changelog would be fine as well.

Comment by Andreas Dilger [ 31/Jul/23 ]

If everyone considers this a bug, I'd be fine to fix the bug by default, and just have a tunable to revert to the previous behavior in the field if some customer specifically needs it. I suspect there will be few users for this, and the tunable can be marked for removal in some future release.

Comment by Olaf Weber [ 01/Aug/23 ]

My vote is "bug to be fixed by default".

Comment by Gerrit Updater [ 13/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51322/
Subject: LU-10283 mdd: fix parent FID in changelog of striped directory
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3554923af9e3260235865d90949ecd2924bbbc0e

Comment by Peter Jones [ 13/Dec/23 ]

Landed for 2.16

Generated at Sat Feb 10 02:33:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.