[LU-2435] inode accounting in osd-zfs is racy Created: 07/May/12  Updated: 15/Apr/18  Resolved: 30/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Brian Behlendorf Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: RZ_LS, llnl, quota

Issue Links:
Blocker
is blocking LU-7895 zfs metadata performance improvements Resolved
Duplicate
Related
is related to LU-7991 Add project quota for ZFS Resolved
is related to LU-8326 sanity-quota test_21 test failed to r... Resolved
is related to LU-6965 osd_object.c:928:osd_attr_set Resolved
is related to LU-8927 osp-syn processes contending for osq_... Resolved
is related to LU-9192 lfs quota reports wrong file count wh... Resolved
is related to LU-2619 Bogus value of dqb_curinodes returned... Resolved
is related to LU-5638 sanity-quota test_33 for ZFS-based ba... Closed
is related to LU-7991 Add project quota for ZFS Resolved
Severity: 3
Epic: server
Rank (Obsolete): 3010

 Description   

Observed on the MDT during weekend testing with an mixed I/O workload. I agree failure here shouldn't be fatal, however I'm not sure we should even bother telling the administrator. If it's true there's nothing that needs to be (or can be) done, then there's no point is logging this to the console. Changing it to a CDEBUG() would be preferable. If however the administrator needs to do something (which I don't believe is the case) that should be part of the message.

2012-05-04 17:12:17 LustreError: 14242:0:(osd_handler.c:997:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000023a1:0xf3ed:0x0] from accounting ZAP for grp 36427 (-2)
2012-05-04 17:12:17 LustreError: 14242:0:(osd_handler.c:997:osd_object_destroy()) Skipped 1 previous similar message
2012-05-04 17:16:22 LustreError: 14244:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x20000239a:0x991a:0x0] from accounting ZAP for usr 36427 (-2)
2012-05-04 17:16:30 LustreError: 5618:0:(osd_handler.c:2003:osd_object_create()) lcz-MDT0000: failed to add [0x20000239a:0x9d46:0x0] to accounting ZAP for usr 36427 (-2)
2012-05-04 17:16:34 LustreError: 14241:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000013ca:0x1952b:0x0] from accounting ZAP for usr 36427 (-2)
2012-05-04 17:16:53 LustreError: 6778:0:(osd_handler.c:997:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000013ca:0x19e3f:0x0] from accounting ZAP for grp 36427 (-2)
2012-05-04 17:31:21 LustreError: 6778:0:(osd_handler.c:2003:osd_object_create()) lcz-MDT0000: failed to add [0x200002393:0x1ee18:0x0] to accounting ZAP for usr 36427 (-2)
2012-05-04 17:31:32 LustreError: 6770:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x200002395:0x167e0:0x0] from accounting ZAP for usr 36427 (-2)


 Comments   
Comment by Johann Lombardi (Inactive) [ 15/May/12 ]

Brian, i guess you have all those messages because you have an "old" filesystem which hadn't the accounting ZAPs when it was created.
I'm fine to "hide" those errors, but the problem is that accounting will be broken and the administrator won't be notified ...

Comment by Brian Behlendorf [ 15/May/12 ]

Actually this was on a newly formatted filesystem. While I haven't looked close I was attributing this to perhaps another concurrent destroy operation or some other race?. Unfortunately I don't have full logs, when I see it again I'll try and grab some debugging.

Comment by Johann Lombardi (Inactive) [ 21/Jun/12 ]

the problem is that zap_increment() has no locking (it first looks up the value and then updates it), so inode accounting is currently racy with concurrent destroy/create, you are right

Comment by Johann Lombardi (Inactive) [ 19/Dec/12 ]

Brian, do you still see those warnings?

Comment by Johann Lombardi (Inactive) [ 07/Jul/14 ]

The race should be addressed by http://review.whamcloud.com/#/c/7157 which got reverted but will hopefully be re-landed soon.
Unfortunately, we have no way to fix on-disk inode accounting (ZFS has no fsck), but inode estimate can be enabled via:

# lctl set_param osd-zfs.*-MDT*.quota_iused_estimate=1
Comment by Johann Lombardi (Inactive) [ 10/Jul/14 ]

Brian, Alex, according to you, what is the best way to fix on-disk inode accounting? Is there an easy way in ZFS to iterate over all dnodes? Maybe we could implement a quiescent quota check in ZFS OSD to fix accounting when it goes wrong like in this case.

Comment by Brian Behlendorf [ 14/Jul/14 ]

Johann You can use the dmu_object_next() function to walk all of the dnodes in a object set. It will be a little tricky to do this online because as Alex noticed in another issue objects being actively allocated will be skipped. I wouldn't be opposed to us adding a little debug patch or writing utility to do this. But I think the take-away from this is we need to be more ruthless about letting these kinds of defects in.

Comment by Brian Behlendorf [ 14/Jul/14 ]

Alternately, since the Lustre code uses the DMU_GROUPUSED_OBJECT and the DMU_USERUSED_OBJECT we could probably fix this with a file-level backup. By mounting each server through the Posix layer and rsync'ing it to a new dataset we'd effectively regenerate the correct quota accounting. This would strip off all the fids in the directories but a subsequent lfsck should fix it. That would be a nice way to verify lfsck actually works correctly.

There are other ways to tackle this as well, extending 'zpool scrub' or 'zfs send/recv' but I really don't like any of them. Better to just fix the root cause and ensure it never comes back then add ugly code to handle this case and have to live with it forever.

Comment by Johann Lombardi (Inactive) [ 29/Jul/14 ]

Johann You can use the dmu_object_next() function to walk all of the dnodes in a object set. It will be a little tricky to do this online because as Alex noticed in another issue objects being actively allocated will be skipped. I wouldn't be opposed to us adding a little debug patch or writing utility to do this.

ok, i will have a look. Thanks.

But I think the take-away from this is we need to be more ruthless about letting these kinds of defects in.

Agreed, although i have always been opposed to maintain space accounting inside osd. It would have been simpler if ZFS was doing inode accounting in the same way as block accounting.

Alternately, since the Lustre code uses the DMU_GROUPUSED_OBJECT and the DMU_USERUSED_OBJECT we could probably fix this with a file-level backup. By mounting each server through the Posix layer and rsync'ing it to a new dataset we'd effectively regenerate the correct quota accounting.

DMU_{USER,GROUP}USED_OBJECT are used for block accounting which works fine. The issue here is with inode accounting which is maintained by osd-zfs since zfs does not support this.

Thanks.

Comment by Alex Zhuravlev [ 29/Jul/14 ]

Johann, the patch does quite the same as ZFS - in-core structure to track delta, then apply delta at sync.

Comment by Brian Behlendorf [ 29/Jul/14 ]

> It would have been simpler if ZFS was doing inode accounting in the same way as block accounting.

I'm not completely opposed to adding inode accounting to in to the ZFS code. That would allow us to integrate better with the existing Linux quota utilities which would be nice.

Comment by Johann Lombardi (Inactive) [ 31/Jul/14 ]

I'm not completely opposed to adding inode accounting to in to the ZFS code. That would allow us to integrate better with the existing Linux quota utilities which would be nice.

Great, i will provide a patch against the ZFS tree and then change ZFS OSD to use ZFS inode accounting when available. Thanks.

Comment by Johann Lombardi (Inactive) [ 07/Aug/14 ]

ZFS patch pending review:
https://github.com/zfsonlinux/zfs/issues/2576

Comment by Johann Lombardi (Inactive) [ 29/Oct/14 ]

Reassign to Isaac as per our discussion by email

Comment by Andreas Dilger [ 25/Nov/14 ]

Johann, Isaac, it looks like the patch got a review from Richard Yao on the pull request https://github.com/zfsonlinux/zfs/pull/2577:

I do not have time to do a full review, but I can give you some comments. First, there are some style issues with this:

./lib/libzfs/libzfs_dataset.c: 2606: continuation line not indented by 4 spaces
./lib/libzfs/libzfs_dataset.c: 2608: line > 80 characters
./lib/libzfs/libzfs_dataset.c: 2608: continuation line not indented by 4 spaces
./lib/libzfs/libzfs_dataset.c: 2742: spaces instead of tabs
./module/zfs/zfs_vfsops.c: 473: spaces instead of tabs
./module/zfs/zfs_vfsops.c: 548: spaces instead of tabs
./module/zfs/zfeature_common.c: 223: indent by spaces instead of tabs
./module/zfs/zfeature_common.c: 224: indent by spaces instead of tabs
./module/zfs/zfeature_common.c: 225: indent by spaces instead of tabs
./module/zfs/dnode_sync.c: 582: line > 80 characters
./module/zfs/dmu_objset.c: 1136: spaces instead of tabs
./module/zfs/dmu_objset.c: 1137: spaces instead of tabs
./module/zfs/dmu_objset.c: 1163: continuation line not indented by 4 spaces
./module/zfs/dmu_objset.c: 1165: continuation line not indented by 4 spaces
./module/zfs/dmu_objset.c: 1167: line > 80 characters
./module/zfs/dmu_objset.c: 1171: line > 80 characters
./module/zfs/dmu_objset.c: 1215: line > 80 characters
./cmd/zfs/zfs_main.c: 2807: spaces instead of tabs
./cmd/zfs/zfs_main.c: 2810: line > 80 characters
./cmd/zfs/zfs_main.c: 2810: spaces instead of tabs

Second, I see that you used ZFS instead of using it. That is likely the logical choice, but unfortunately, Solaris has already used that and reusing it will make an unfortunate situation with respect to platform incompatibility worse. This is why feature flags were designed, but so far, they have only been implemented for the zpool version because we had nothing that needed it for the ZFS version. This would merit feature flags on the ZFS version, so that would need to be done before it could be merged, provided that there are no other potential issues and this is portable enough that other Open ZFS implementations could merge it so that we remain compatible with them.

{quota}

Can you update the patch to move this code forward?

Comment by Johann Lombardi (Inactive) [ 26/Nov/14 ]

Yup, i am aware and discussed this a bit with Brian some time ago. Isaac has agreed to take over the patch and will be updating it.

Comment by Brian Behlendorf [ 01/Dec/14 ]

During the OpenZFS summit Isaac and I had a chance to discuss this. We came up with a nice clean solution based on Johann initial patch. If Issac has the time to do the heavy lifting on this I'm happy to work with him to get the patch reviewed and in to a form where it can be merged. There's a fair bit of work remaining to get it where it needs to be. But it should be pretty straight forward:

Required functionality:

  • Add a feature flag.
  • Rework the code so this feature can be enabled without rewiring every dnode.
  • Rework the code to use sync tasks so it can be resumed across pool import/exports.
  • Update the zpool status command to provide status information while the feature is being enabled.
  • Add the ioctl() handlers so utilities such as repquota work.
  • Update send/recv to handle this change.
  • Update the zfs(8) and zpool-features(5) man pages.
Comment by Isaac Huang (Inactive) [ 02/Dec/14 ]

Two notes about the implementation:

  • May add two new special ZAP objects for the dnode accounting, so that the feature can be easily disabled by removing the objects.
  • One way to enable it without rewriting every dnode is to take a snapshot and iterate over all objects in the snapshot while doing incremental accounting for changes after the snapshot was created:
    1. Return -EAGAIN to any query before objects in the snapshot are all counted
    2. Remove the snapshot when counting done

I'll be able to begin the work later this week.

Comment by Andreas Dilger [ 02/Dec/14 ]

It isn't clear how the snapshot will help? There will continue to be changed beyond the snapshot that woukd need to be accounted, so I don't think that will help.

There is already a process for iterating all the dnodes to enable regular quota, so hopefully the same process can be used for inode accounting? As long as accounting is done incrementally and (IIRC) inodes flagged so they are not counted twice then this can proceed while the filesystem is in use. perhaps an auxiliary bitmap while the initial accounting is enabled, that can be deleted when finished?

Comment by Brian Behlendorf [ 02/Dec/14 ]

> One way to enable it without rewriting every dnode is to take a snapshot

This is a good idea but we can do better. If we were to use full snapshots there are some significant downsides.

1. We'd need to take a snapshot per dataset and it's not uncommon for pools to have 1000s, 10,000s, or 100,000s of datasets. Doubling this just to enable the feature is a bit heavy handed.
2. While the snapshots exist we can't free any data in the pool since it will be referenced by a snapshot. This would be problematic for pools which are already near capacity.
3. I suspect cleanly handling all the possible failure modes will be fairly complicated. You'll need to do the creation of all the snapshots in a sync task and be able to unwind them all in the event of a failure. You'll also want to do it in a single tx so that either all the snapshots exist or none of them do. When we're talking about a large number of snapshots this may take a significant amount of time (several seconds).
4. Doing any operation which spans datasets complicates things considerably. If this could be avoided it would greatly simplify the problem.

Luckily, for this specific case a full snapshot isn't needed. Storing the TXG number in which the feature was enabled and the per-dataset dnode number for the traversal is enough. This is possible because every dnode already stores the TXG number it was originally allocated in (dn->dn_allocated_txg). We can also leverage the fact that the traversal will strictly happen for lowest to highest numbered dnodes. Which means we can split the problem up like this:

1. Newly allocated dnodes always update the quota ZAP

2. Freed dnodes update the quota ZAP as follows

  • if (dn->dn_object < dataset->scan_object): dnode has been traversed by scan update quota ZAP
  • if (dn->dn_object >= dataset->scan_object): dnode has NOT been traversed by scan no update needed

3. Dnode traversal scan, this can be done with the existing dnode iterator

  • if (dn->dn_allocated_txg < feature_txg): dnode is not accounted for update quota ZAP
  • if (dn->dn_allocated_txg >= feature_txg): dnode is new has been accounted for during create no update needed

The traversal part of this would need to be done in small batches in a sync task. This would allow us to transactional update the dataset->scan_object on disk so the traversal can be resumed and it simplifies concurrency concerns. Doing it this way addresses my concerns above and nicely simplifies the logic and the amount of code needed. For example, all that's needed to abort the entire operation or disable the feature is to stop the traversal sync task (if running) and remove the two new ZAPs.

Comment by Gerrit Updater [ 15/Jun/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15294
Subject: LU-2435 osd-zfs: use zfs native dnode accounting
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a44dfc5b64383dddf5b18a683ab98ce5b7cd4da

Comment by Jinshan Xiong (Inactive) [ 15/Jun/15 ]

I pushed patch 15294 to gerrit to use ZFS dnode accounting for Lustre inode accounting. This is the work phase one. The second phase is to launch a thread to iterate all dnodes in the objset of MDT to repair accounting of legacy file system.

Comment by Jinshan Xiong (Inactive) [ 16/Jun/15 ]

I'm going to start the 2nd phase of this work. At the time of dnode accounting feature is enabled, it will check if the file system is a legacy one. In that case, it will launch a thread to iterate all dnodes in all dataset and repair dnode accounting. I have two concerns for this task:

1. this piece of code will become dead once the scanning task is done, and new FS won't need it at all. Will ZFS upstream team accept this code? Based on the same concern, I will make an independent patch based on https://github.com/zfsonlinux/zfs/pull/2577;
2. does it need to do anything special for snapshot and clone?

Thanks.

Comment by Andreas Dilger [ 17/Jun/15 ]

There is already existing code to do the dnode iteration to enable user/group block quota on an existing filesystem, and I suspect that this is still part of the ZFS code, and may never be deleted. There hopefully shouldn't be the need for much new code beyond the existing block quota iterator.

Comment by Brian Behlendorf [ 17/Jun/15 ]

To my knowledge there is no existing code for dnode accounting in ZFS. How to implement it in a reasonable way was described above but no one has yet done that work. However once it is done and merged we'll have to maintain it forever since old filesystems may always need to be upgraded.

Comment by Jinshan Xiong (Inactive) [ 25/Jun/15 ]

The proposal made by Andreas worked. This is what I have done:

0. prepare an 'old' lustre zfs backend and copy some files into the file system with user 'tstusr';
1. apply patch http://review.whamcloud.com/15180 to zfs repo and patch 15294 to lustre;
2. recompile and install zfs and lustre;
3. upgrade pool by 'zpool upgrade lustre-mdt1';
4. mount Lustre;
5. upgrade mdt to use native dnode accounting by 'lctl set_param osd-zfs.lustre-MDT0000.quota_native_dnused_upgrade';
6. and it worked. I can get correct dnode use accounting by 'lfs quota -u tstusr /mnt/lustre';

Comment by Peter Jones [ 13/Jul/15 ]

Jinshan

The link #15180 does not seem to work. Could you please confirm whether that is correct?

Peter

Comment by Jinshan Xiong (Inactive) [ 13/Jul/15 ]

I will work on this.

Comment by Andreas Dilger [ 11/Jan/16 ]

The latest version of the patch is actually https://github.com/zfsonlinux/zfs/pull/3983 and not 3723.

Comment by Alex Zhuravlev [ 27/Mar/16 ]

each ZAP declaration adds ~0.5MB to space/memory reservation. so dnode accounting adds ~1MB, which is 20% of single-stripe object creation.

Comment by Jinshan Xiong (Inactive) [ 28/Mar/16 ]

Is ~0.5MB the worst case or on average? Can you please describe this in detail?

Comment by Alex Zhuravlev [ 28/Mar/16 ]

depends on how exactly we call dmu_tx_hold_zap(): the more specific, the less credits. unfortunately when the key is specified, then declaration becomes very expensing due to hash lock/lookup. also, the final calculation depends on inflation size:
asize = spa_get_asize(tx->tx_pool->dp_spa, towrite + tooverwrite);

spa_get_asize(spa_t *spa, uint64_t lsize)
{
return (lsize * spa_asize_inflation);

int spa_asize_inflation = 24;

Comment by Gerrit Updater [ 20/Mar/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26090
Subject: LU-2435 osd-zfs: for test only
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eac70d59649d2ab3bd5b7d0acc94cbf67a430251

Comment by Gerrit Updater [ 30/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/15294/
Subject: LU-2435 osd-zfs: use zfs native dnode accounting
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 11afef00b6af407b8987076bd4f1ec9bc77eb75e

Comment by Peter Jones [ 30/Mar/17 ]

Landed for 2.10

Generated at Sat Feb 10 01:25:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.