[LU-8916] ods-zfs doesn't manage ZAP sizes correctly Created: 07/Dec/16  Updated: 23/Jan/18  Resolved: 22/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Giuseppe Di Natale (Inactive) Assignee: Alex Zhuravlev
Resolution: Not a Bug Votes: 0
Labels: llnl, patch

Issue Links:
Related
is related to LU-8753 Recovery already passed deadline with... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I mounted a ZFS backed lustre target thru the ZPL and it appears that ZAP sizes are not being adjusted as child objects are added or removed. Examples below.

Looking at update_log_dir, performing a ls -l yields the following:

drw-r--r--. 2 root root    2 Dec 31  1969 update_log_dir

But, that's wrong. We should be seeing a size of 3 as an ls shows us there are 3 children.

-bash-4.1# ls -a update_log_dir
.  ..  [0x200000400:0x1:0x0]

Looking at oi.10, ls -l displays a size of 0:

drwxr-xr-x. 0 root root    0 Dec 31  1969 oi.10

But, the size we should be seeing is 12 as per the ls below.

-bash-4.1# ls -a oi.10
.   0x20000000a:0x0:0x0  0xa:0x3:0x0  0xa:0x5:0x0  0xa:0x7:0x0  0xa:0x9:0x0
..  0xa:0x0:0x0          0xa:0x4:0x0  0xa:0x6:0x0  0xa:0x8:0x0  0xa:0xa:0x0

In the ZPL, the size of a directory should be the number of objects contained within it. The osd-zfs is clearly breaking that.

As a practical use case, in LU-8753, we couldn't remove update_log_dir since the directory didn't qualify as empty with a size of 1. The ZPL expects an empty directory to have a size of 2 (for '.' and '..').



 Comments   
Comment by Giuseppe Di Natale (Inactive) [ 07/Dec/16 ]

I'll be submitting a preliminary patch shortly. I also have a few questions I need help with.

Comment by Gerrit Updater [ 07/Dec/16 ]

Giuseppe Di Natale (dinatale2@llnl.gov) uploaded a new patch: https://review.whamcloud.com/24207
Subject: LU-8916 osd-zfs: Correct ZAP size accounting
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7f6def52b783a18225183e6aa5fdf7d3f5112295

Comment by Olaf Faaland [ 07/Dec/16 ]

Note that the patch does not fix all cases. Joe will submit questions as well.

Comment by Peter Jones [ 07/Dec/16 ]

Nathaniel

Could you please assist with this one?

Thanks

Peter

Comment by Andreas Dilger [ 07/Dec/16 ]

Note that the OIs are ZAPs, but are not strictly real directories because entries there do not increase the nlink count on the files/dirs that they are referencing. I definitely agree that being able to view and possibly remove entries from the OI but if that also decrements the dnode link count then this could cause potential issues if unlinking FID files from the OI ZAP via ZPL mountpoint.

Comment by Giuseppe Di Natale (Inactive) [ 07/Dec/16 ]

I've not been looking at the link count too closely yet. I was going to look at that bit later. I'm strictly talking about the size of a directory as reported by ZPL. It is quite possible that I'm missing part of the picture and I should start paying attention to the link count.

Comment by Andreas Dilger [ 08/Dec/16 ]

The OI files explicitly do NOT track the link count themselves (for subdirectories) or on the referenced inodes (for regular files) in order to not upset the visible nlink counter on the inode under normal usage.

Comment by Alex Zhuravlev [ 11/Apr/17 ]

I measured the patch against zfs performance branch:
before:
mdt 1 file 3000000 dir 16 thr 16 create 306471.11 [ 240979.28, 357970.29]
after:
mdt 1 file 3000000 dir 16 thr 16 create 267600.94 [ 999.97, 347099.55]

going to present the result to ZFS guys ...

Comment by Alex Zhuravlev [ 20/Apr/17 ]

this is what Matt suggested:
1. Make Lustre not update SA_ZPL_SIZE.
2. Make the ZPL accept an incorrect SA_ZPL_SIZE without panicking. The ZPL will set z_size based on zap_count(), rather than on SA_ZPL_SIZE.
3. To maintain backwards compatibility, make the ZPL continue to write an accurate SA_ZPL_SIZE (based on the z_size which will still be accurate due to (2).

I've been working on the patch to ZFS..

Comment by Alex Zhuravlev [ 21/Apr/17 ]

also, can you please clarify the last statement in the description:
As a practical use case, in LU-8753, we couldn't remove update_log_dir since the directory didn't qualify as empty with a size of 1. The ZPL expects an empty directory to have a size of 2 (for '.' and '..').

but we do set size to 2 for all regular directories (though don't update it):
Object lvl iblk dblk dsize lsize %full type
3121 2 128K 16K 11.0K 32K 100.00 ZFS directory
192 bonus System attributes
path /ROOT/TOOR
uid 0
gid 0
mode 40755
size 2

so you should be possible to remove with ZPL? but check for emptiness won't work correctly and I'm going to fix this in ZPL as Matt suggested.

Comment by Peter Jones [ 21/Aug/17 ]

AFAIK Alex is the one working on thos

Comment by Alex Zhuravlev [ 04/Sep/17 ]

a short update.. so with the uptodate ZFS (0.7) it's possible to remove such a directory. actually it's possible to remove non-empty directory as well (like CONFIGS/), which I guess is not very nice. as mentioned before, the suggested solution is to let ZFS check a counter in ZAP structure.
two options have been considered:
a) check ZAP's internal counter in zfs_dirempty() – Matt was fine with that, but that would be extra cost to access that structure (dnode# -> dnode_t structure translation) or we'd need to keep a reference to dnode_t in znode structure
b) wait until SA becomes less expensive, I believe there is a common wish to going this way

Comment by Olaf Faaland [ 18/Sep/17 ]

Olaf to correspond with Alex to understand options better and decide what further work is necessary.

Comment by Olaf Faaland [ 16/Nov/17 ]

Alex,

a) check ZAP's internal counter in zfs_dirempty() – Matt was fine with that, but that would be extra cost to access that structure (dnode# -> dnode_t structure translation) or we'd need to keep a reference to dnode_t in znode structure

If the ZAP's internal counter is only consulted when dzp->z_size != 2 then the cost is only paid when the rmdir would normally fail anyway, right? Or are you thinking of always checking both dzp->z_size and checking the ZAP's internal counter?

Can you expand upon option (b) which I take it you prefer? It sounds like you have reason to think SAs will become cheaper to update and access. Is there in-progress work that will make that so? Thanks.

Comment by Alex Zhuravlev [ 27/Nov/17 ]

sorry for the late reply.
we can't consult with dzp->z_size if we want to keep performance as cost of z_size maintainance is very high (due to SA expenses).
unfortunately to access ZAP's internal counter we have to lookup dnode_p* structure by dnode number (stored in znode), which isn't free. the good thing is that rmdir isn't that frequent operation. Ideally we should store dnode_p* in znode structure, I'll talk to Matt about this possibility. in the mean time I'm going to have a patch adding that check for the internal counter. this should fix your issue with silent removal of non-empty directories, correct me If I'm wrong please.

Comment by Peter Jones [ 02/Jan/18 ]

I believe that bzzz has pushed a proposed change to ZoL - https://github.com/zfsonlinux/zfs/pull/6987

 

Comment by Olaf Faaland [ 22/Jan/18 ]

ZoL PR 6987 was replaced with
https://github.com/zfsonlinux/zfs/pull/7019

which was landed to ZoL master and queued for inclusion in zfs-0.7.6 when it is tagged.

Comment by Peter Jones [ 22/Jan/18 ]

Fixed in ZFS

Generated at Sat Feb 10 02:21:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.