[LU-2677] Adding LMA to OST object Created: 25/Jan/13  Updated: 29/Nov/17  Resolved: 11/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Di Wang Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Related
is related to LU-3128 filter_fid on OST not updated during ... Closed
is related to LU-10248 Need to update PFID of OST objects af... Resolved
Severity: 3
Rank (Obsolete): 6250

 Description   

Per Discussion with Alex, LMA should be added to OST object EA as well.



 Comments   
Comment by nasf (Inactive) [ 25/Jan/13 ]

Add FID-in-LMA for OST object means you need to check FLD during OI scrub, but when OI scrub run, the FLD or related init work may be not ready yet. So please make sure your patch will not break something.

Comment by Andreas Dilger [ 25/Jan/13 ]

Pasting thread from email thread about this topic recently:

On 2012-12-19, at 4:59, "Pershin, Mike" <mike.pershin@intel.com> wrote:
> in context of LU-838 I did small change resulting all files to have LMA attribute, which is just self-fid now (not sure that makes sense to still call it 'mdt_attr', btw).

Is that required for 2.4, or is it already done?

> This cause test 27z in sanity.sh to fail.
> The reason was filter_fid is not fit into inode body anymore and debugfs can't find it. I have few things to discuss in view of that.
>
> 1. do we really need ff_seq and ff_id in filter_fid? If all files on OST will have normal FID does that make sense to add lma and remove id/seq > from filter_fid, making it just 'parent_fid'?

Changing structure of filter_fid will require lustre/utils/ll_recover_lost_found_objs to be changed as well, but it would be possible to do that with the Lustre release. It may also cause problems for upgraded filesystems which cannot be fixed so easily, without some compatibility support.

Is there not enough room in the inode even with the smaller LMA size?

> 2. Debugfs complains about both filter_fid and lma if they are smaller than it expects. It is OK for filter_fid so far but new small LMA was landed in master already, is that bug in debugfs or that is fixed already?

That is not fixed. It was never expected that LMA would get smaller, only larger.

> If we will make filter_fid just parent fid and store LMA too then both fits into inode body on OST.

What is the size of these structs, and how much space is left in the inode?

My real preference would be to have only the filter_fid, but I understand this is a layering problem. I also don't like the duplication of storing the FID in the filter_fid, LMA, LOV, etc.

If we are changing the on-disk layout and wanted to be more consistent with the MDT we might consider storing the parent FID in the linkEA and removing the filter_fid entirely? We could use the stripe index for the object name? It is not fatal to be missing the filter_fid on case of downgrade, and this could be fixed up as part of LFSCK 2.

I'm not against changing this, but it would need a quick effort to get it completed in time for 2.4.

Comment by Andreas Dilger [ 25/Jan/13 ]

Alex later replied:

On Dec 20, 2012, at 11:27 AM, "Pershin, Mike" <mike.pershin@intel.com>
wrote:
> > My real preference would be to have only the filter_fid, but I understand this is a layering problem. I also don't like the duplication of storing the FID in the filter_fid, LMA, LOV, etc.
> as do I, ideally this should be unified in LMA which is exactly just self-fid and flags.

in LMA ? it's two different fids with different purposes: one is used by Lustre (mds fid referencing this object) and another is self fid
which Lustre does not need, but some specific in-osd features like OI scrub do use.

> I think this is not urgent just an observation while fixing some issues with LU-838. I have partial solution which works - add LMA just to local files in addition to normal files but don't add to the OST objects

what about OST objects in a normal sequences introduced with DNE?

I consider LMA (new one storing just a fid) as a internal to OSD mechanism to make OI more reliable.
it's not mandatory and basically it's not required for Lustre to work. especially if there is another mechanism
making OI robust.

Comment by Andreas Dilger [ 25/Jan/13 ]

Alex later replied again:

On Dec 20, 2012, at 5:37 PM, "Pershin, Mike" <mike.pershin@intel.com>
wrote:
> Alex, I meant only self-fid parts in both EA, filter_fid contains parent FID plus own id/seq, which is not used anywhere except utils. If we would have LMA for every file then it is not needed in filter_fid IMHO

this is what I thought as well. sorry for noise parent fid will be used by LFSCK.

Comment by Andreas Dilger [ 25/Jan/13 ]

So, in conclusion, if this is going to be done for 2.4 (which would be the right place for it), then the "filter_fid" needs to be replaced with a simple xattr that just contains the OST object parent FID + stripe index for LFSCK Phase 2 and "ll_recover_lost_found_objs". It needs to fit within the 256-byte inode size used on the OSTs along with the LMA.

  • debugfs needs a patch to handle the smaller "lma" xattr and the new "opf" (OST/Object parent FID) xattr (or whatever it is called) in addition to the old "fid" xattr
  • ll_recover_lost_found_objs needs a patch to handle the new "opf" xattr + LMA in addition to the old "fid" xattr
  • ofd_object_ff_check() needs to be able to handle both "opf" and "fid" xattr
  • should LFSCK Phase 1.5 convert "fid" to "opf"? Just adding "lma" will spill over to an external xattr block and hurt OST performance for "stat" of objects significantly (this happened when very old filesystems were upgraded to a version of Lustre that stored "fid" on 128-bit inodes).
Comment by Alex Zhuravlev [ 31/Jan/13 ]

Andreas, what's the deadline for these changes? I'm fine to work on this, if needed, but the inspections are supposed to be the top priority at the moment.

Comment by Andreas Dilger [ 31/Jan/13 ]

Alex, this can reasonably be considered a bug fix since it has a noticeable impact on performance, so it can be landed after the feature freeze.

Comment by Andreas Dilger [ 05/Mar/13 ]

Alex,
to clarify, does the current code store an LMA on the OST objects or not? If it does not, then this could be removed from the blocker list. If it does, then this needs to be fixed because the LMA + filter_fid overflow the OST in-inode xattr size.

Comment by Alex Zhuravlev [ 06/Mar/13 ]

osd-ldiskfs does not store LMA on OST objects, osd-zfs does. but I'd like to fix this in 2.4, if possible. I'm fine to do this myself, immediately.

Comment by Andreas Dilger [ 07/Mar/13 ]

I notice that if we shrink lustre_mdt_attrs by another 8 bytes (removing "lma_flags", which is currently unused) then it would be possible to store both lustre_mdt_attrs and filter_fid in the same 256-byte ldiskfs inode. This is fine to change within the 2.4 release, since it was already modified in 2.4 due to removal of HSM and SOM xattrs, and 2.1 already has compatibility code for the smaller LMA xattr. Since LMA already has __u32 lma_compat and __u32 lma_incompat flags, we don't really need lma_flags very much.

If you are planning to store both LMA and FF on ldiskfs OST objects, then it would also be possible to remove ff_objid and ff_seq from struct filter_fid, making it only 16 bytes in size + 20 byte header. That would need the code in ll_recover_lost_found_objs to be able to handle both the old and new filter_fid structure size, and check for FF (to determine if this is an OFD object) + LMA (for self FID) when rebuilding the object directories.

With the smaller LMA and FF, there would even be 16 bytes free for future use, or in case more static fields are added to struct ext4_inode.

Comment by Andreas Dilger [ 08/Mar/13 ]

Moved this over to Mike, since he has fewer blocker bugs than Di or Alex.

Alex, if you have the bandwidth and desire to do this yourself, please discuss with Mike and assign to yourself.

Comment by Mikhail Pershin [ 28/Mar/13 ]

http://review.whamcloud.com/#change,5838

patch removes lma_flags from lma and ff_seq, ff_objid from filter_fid. Tools are changed to handle old filter_fid and new changes. I had to change sanity.sh test_27z to don't use debugfs until it will understand new filter_fid.

Comment by Andreas Dilger [ 06/Apr/13 ]

http://review.whamcloud.com/5964 is patch for e2fsprogs to allow using smaller FF and LMA structures.

Comment by John Hammond [ 08/Apr/13 ]

What depends on this xattr being correct? I ask because I was poking at layout swap this morning and noticed that it was not being updated accordingly.

# llmount.sh 
...
# cd /mnt/lustre
# lfs setstripe -c2 f0
# dd if=/dev/zero bs=1M count=2 | tr '\0' 'X' > f0
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.0201865 s, 104 MB/s
# lfs getstripe f0
f0
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	             1	          0x1	             0
	     1	             1	          0x1	             0

# touch f1
# ls -lh
total 2.0M
-rw-r--r-- 1 root root 2.0M Apr  8 13:54 f0
-rw-r--r-- 1 root root    0 Apr  8 13:54 f1
# lfs path2fid f0
[0x200000400:0x1:0x0]
# lfs path2fid f1
[0x200000400:0x3:0x0]
# umount /mnt/ost1
# mount /tmp/lustre-ost1 /mnt/ost1 -t ldiskfs -o loop
# ls -lh /mnt/ost1/O/0/d1/1
-rw-rw-rw- 1 root root 1.0M Apr  8 13:54 /mnt/ost1/O/0/d1/1
# ll_decode_filter_fid /mnt/ost1/O/0/d1/1
/mnt/ost1/O/0/d1/1: objid=4294967296 seq=1 parent=[0x200000400:0x1:0x0]
# umount /mnt/ost1
# mount /tmp/lustre-ost1  /mnt/ost1 -t lustre -o loop
# lfs swap_layouts f0 f1
# ls -lh
total 2.0M
-rw-r--r-- 1 root root    0 Apr  8 13:57 f0
-rw-r--r-- 1 root root 2.0M Apr  8 13:57 f1
# umount /mnt/ost1
# mount /tmp/lustre-ost1 /mnt/ost1 -t ldiskfs -o loop
# ls -lh /mnt/ost1/O/0/d1/1
-rw-rw-rw- 1 root root 1.0M Apr  8 13:57 /mnt/ost1/O/0/d1/1
# ll_decode_filter_fid /mnt/ost1/O/0/d1/1
/mnt/ost1/O/0/d1/1: objid=4294967296 seq=1 parent=[0x200000400:0x1:0x0]
Comment by Andreas Dilger [ 08/Apr/13 ]

John, could you please file the layout swap as a separate bug. That is a known issue with the current layout swap implementation, and while the solution was discussed at one point (IIRC, sending setattr to the OSTs to update the parents) I guess that this was not implemented yet. It isn't a totally critical failure, since the parent FID is only used in case of MDT corruption, and it will eventually be fixed by LFSCK Phase 2, it would still be better to get it correct during the swap.

Comment by John Hammond [ 08/Apr/13 ]

I have created LU-3128 to track the layout swap interaction bug.

Comment by Andreas Dilger [ 09/Apr/13 ]

There is a prerequisite patch also: http://review.whamcloud.com/5967

Comment by Jodi Levi (Inactive) [ 11/Apr/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:27:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.