[LU-2435] inode accounting in osd-zfs is racy Created: 07/May/12 Updated: 15/Apr/18 Resolved: 30/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Brian Behlendorf | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | RZ_LS, llnl, quota | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Epic: | server | ||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 3010 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Observed on the MDT during weekend testing with an mixed I/O workload. I agree failure here shouldn't be fatal, however I'm not sure we should even bother telling the administrator. If it's true there's nothing that needs to be (or can be) done, then there's no point is logging this to the console. Changing it to a CDEBUG() would be preferable. If however the administrator needs to do something (which I don't believe is the case) that should be part of the message. 2012-05-04 17:12:17 LustreError: 14242:0:(osd_handler.c:997:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000023a1:0xf3ed:0x0] from accounting ZAP for grp 36427 (-2) 2012-05-04 17:12:17 LustreError: 14242:0:(osd_handler.c:997:osd_object_destroy()) Skipped 1 previous similar message 2012-05-04 17:16:22 LustreError: 14244:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x20000239a:0x991a:0x0] from accounting ZAP for usr 36427 (-2) 2012-05-04 17:16:30 LustreError: 5618:0:(osd_handler.c:2003:osd_object_create()) lcz-MDT0000: failed to add [0x20000239a:0x9d46:0x0] to accounting ZAP for usr 36427 (-2) 2012-05-04 17:16:34 LustreError: 14241:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000013ca:0x1952b:0x0] from accounting ZAP for usr 36427 (-2) 2012-05-04 17:16:53 LustreError: 6778:0:(osd_handler.c:997:osd_object_destroy()) lcz-MDT0000: failed to remove [0x2000013ca:0x19e3f:0x0] from accounting ZAP for grp 36427 (-2) 2012-05-04 17:31:21 LustreError: 6778:0:(osd_handler.c:2003:osd_object_create()) lcz-MDT0000: failed to add [0x200002393:0x1ee18:0x0] to accounting ZAP for usr 36427 (-2) 2012-05-04 17:31:32 LustreError: 6770:0:(osd_handler.c:991:osd_object_destroy()) lcz-MDT0000: failed to remove [0x200002395:0x167e0:0x0] from accounting ZAP for usr 36427 (-2) |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 15/May/12 ] |
|
Brian, i guess you have all those messages because you have an "old" filesystem which hadn't the accounting ZAPs when it was created. |
| Comment by Brian Behlendorf [ 15/May/12 ] |
|
Actually this was on a newly formatted filesystem. While I haven't looked close I was attributing this to perhaps another concurrent destroy operation or some other race?. Unfortunately I don't have full logs, when I see it again I'll try and grab some debugging. |
| Comment by Johann Lombardi (Inactive) [ 21/Jun/12 ] |
|
the problem is that zap_increment() has no locking (it first looks up the value and then updates it), so inode accounting is currently racy with concurrent destroy/create, you are right |
| Comment by Johann Lombardi (Inactive) [ 19/Dec/12 ] |
|
Brian, do you still see those warnings? |
| Comment by Johann Lombardi (Inactive) [ 07/Jul/14 ] |
|
The race should be addressed by http://review.whamcloud.com/#/c/7157 which got reverted but will hopefully be re-landed soon. # lctl set_param osd-zfs.*-MDT*.quota_iused_estimate=1 |
| Comment by Johann Lombardi (Inactive) [ 10/Jul/14 ] |
|
Brian, Alex, according to you, what is the best way to fix on-disk inode accounting? Is there an easy way in ZFS to iterate over all dnodes? Maybe we could implement a quiescent quota check in ZFS OSD to fix accounting when it goes wrong like in this case. |
| Comment by Brian Behlendorf [ 14/Jul/14 ] |
|
Johann You can use the dmu_object_next() function to walk all of the dnodes in a object set. It will be a little tricky to do this online because as Alex noticed in another issue objects being actively allocated will be skipped. I wouldn't be opposed to us adding a little debug patch or writing utility to do this. But I think the take-away from this is we need to be more ruthless about letting these kinds of defects in. |
| Comment by Brian Behlendorf [ 14/Jul/14 ] |
|
Alternately, since the Lustre code uses the DMU_GROUPUSED_OBJECT and the DMU_USERUSED_OBJECT we could probably fix this with a file-level backup. By mounting each server through the Posix layer and rsync'ing it to a new dataset we'd effectively regenerate the correct quota accounting. This would strip off all the fids in the directories but a subsequent lfsck should fix it. That would be a nice way to verify lfsck actually works correctly. There are other ways to tackle this as well, extending 'zpool scrub' or 'zfs send/recv' but I really don't like any of them. Better to just fix the root cause and ensure it never comes back then add ugly code to handle this case and have to live with it forever. |
| Comment by Johann Lombardi (Inactive) [ 29/Jul/14 ] |
ok, i will have a look. Thanks.
Agreed, although i have always been opposed to maintain space accounting inside osd. It would have been simpler if ZFS was doing inode accounting in the same way as block accounting.
DMU_{USER,GROUP}USED_OBJECT are used for block accounting which works fine. The issue here is with inode accounting which is maintained by osd-zfs since zfs does not support this. Thanks. |
| Comment by Alex Zhuravlev [ 29/Jul/14 ] |
|
Johann, the patch does quite the same as ZFS - in-core structure to track delta, then apply delta at sync. |
| Comment by Brian Behlendorf [ 29/Jul/14 ] |
|
> It would have been simpler if ZFS was doing inode accounting in the same way as block accounting. I'm not completely opposed to adding inode accounting to in to the ZFS code. That would allow us to integrate better with the existing Linux quota utilities which would be nice. |
| Comment by Johann Lombardi (Inactive) [ 31/Jul/14 ] |
Great, i will provide a patch against the ZFS tree and then change ZFS OSD to use ZFS inode accounting when available. Thanks. |
| Comment by Johann Lombardi (Inactive) [ 07/Aug/14 ] |
|
ZFS patch pending review: |
| Comment by Johann Lombardi (Inactive) [ 29/Oct/14 ] |
|
Reassign to Isaac as per our discussion by email |
| Comment by Andreas Dilger [ 25/Nov/14 ] |
|
Johann, Isaac, it looks like the patch got a review from Richard Yao on the pull request https://github.com/zfsonlinux/zfs/pull/2577: I do not have time to do a full review, but I can give you some comments. First, there are some style issues with this: ./lib/libzfs/libzfs_dataset.c: 2606: continuation line not indented by 4 spaces Second, I see that you used ZFS instead of using it. That is likely the logical choice, but unfortunately, Solaris has already used that and reusing it will make an unfortunate situation with respect to platform incompatibility worse. This is why feature flags were designed, but so far, they have only been implemented for the zpool version because we had nothing that needed it for the ZFS version. This would merit feature flags on the ZFS version, so that would need to be done before it could be merged, provided that there are no other potential issues and this is portable enough that other Open ZFS implementations could merge it so that we remain compatible with them. {quota}Can you update the patch to move this code forward? |
| Comment by Johann Lombardi (Inactive) [ 26/Nov/14 ] |
|
Yup, i am aware and discussed this a bit with Brian some time ago. Isaac has agreed to take over the patch and will be updating it. |
| Comment by Brian Behlendorf [ 01/Dec/14 ] |
|
During the OpenZFS summit Isaac and I had a chance to discuss this. We came up with a nice clean solution based on Johann initial patch. If Issac has the time to do the heavy lifting on this I'm happy to work with him to get the patch reviewed and in to a form where it can be merged. There's a fair bit of work remaining to get it where it needs to be. But it should be pretty straight forward: Required functionality:
|
| Comment by Isaac Huang (Inactive) [ 02/Dec/14 ] |
|
Two notes about the implementation:
I'll be able to begin the work later this week. |
| Comment by Andreas Dilger [ 02/Dec/14 ] |
|
It isn't clear how the snapshot will help? There will continue to be changed beyond the snapshot that woukd need to be accounted, so I don't think that will help. There is already a process for iterating all the dnodes to enable regular quota, so hopefully the same process can be used for inode accounting? As long as accounting is done incrementally and (IIRC) inodes flagged so they are not counted twice then this can proceed while the filesystem is in use. perhaps an auxiliary bitmap while the initial accounting is enabled, that can be deleted when finished? |
| Comment by Brian Behlendorf [ 02/Dec/14 ] |
|
> One way to enable it without rewriting every dnode is to take a snapshot This is a good idea but we can do better. If we were to use full snapshots there are some significant downsides. 1. We'd need to take a snapshot per dataset and it's not uncommon for pools to have 1000s, 10,000s, or 100,000s of datasets. Doubling this just to enable the feature is a bit heavy handed. Luckily, for this specific case a full snapshot isn't needed. Storing the TXG number in which the feature was enabled and the per-dataset dnode number for the traversal is enough. This is possible because every dnode already stores the TXG number it was originally allocated in (dn->dn_allocated_txg). We can also leverage the fact that the traversal will strictly happen for lowest to highest numbered dnodes. Which means we can split the problem up like this: 1. Newly allocated dnodes always update the quota ZAP 2. Freed dnodes update the quota ZAP as follows
3. Dnode traversal scan, this can be done with the existing dnode iterator
The traversal part of this would need to be done in small batches in a sync task. This would allow us to transactional update the dataset->scan_object on disk so the traversal can be resumed and it simplifies concurrency concerns. Doing it this way addresses my concerns above and nicely simplifies the logic and the amount of code needed. For example, all that's needed to abort the entire operation or disable the feature is to stop the traversal sync task (if running) and remove the two new ZAPs. |
| Comment by Gerrit Updater [ 15/Jun/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15294 |
| Comment by Jinshan Xiong (Inactive) [ 15/Jun/15 ] |
|
I pushed patch 15294 to gerrit to use ZFS dnode accounting for Lustre inode accounting. This is the work phase one. The second phase is to launch a thread to iterate all dnodes in the objset of MDT to repair accounting of legacy file system. |
| Comment by Jinshan Xiong (Inactive) [ 16/Jun/15 ] |
|
I'm going to start the 2nd phase of this work. At the time of dnode accounting feature is enabled, it will check if the file system is a legacy one. In that case, it will launch a thread to iterate all dnodes in all dataset and repair dnode accounting. I have two concerns for this task: 1. this piece of code will become dead once the scanning task is done, and new FS won't need it at all. Will ZFS upstream team accept this code? Based on the same concern, I will make an independent patch based on https://github.com/zfsonlinux/zfs/pull/2577; Thanks. |
| Comment by Andreas Dilger [ 17/Jun/15 ] |
|
There is already existing code to do the dnode iteration to enable user/group block quota on an existing filesystem, and I suspect that this is still part of the ZFS code, and may never be deleted. There hopefully shouldn't be the need for much new code beyond the existing block quota iterator. |
| Comment by Brian Behlendorf [ 17/Jun/15 ] |
|
To my knowledge there is no existing code for dnode accounting in ZFS. How to implement it in a reasonable way was described above but no one has yet done that work. However once it is done and merged we'll have to maintain it forever since old filesystems may always need to be upgraded. |
| Comment by Jinshan Xiong (Inactive) [ 25/Jun/15 ] |
|
The proposal made by Andreas worked. This is what I have done: 0. prepare an 'old' lustre zfs backend and copy some files into the file system with user 'tstusr'; |
| Comment by Peter Jones [ 13/Jul/15 ] |
|
Jinshan The link #15180 does not seem to work. Could you please confirm whether that is correct? Peter |
| Comment by Jinshan Xiong (Inactive) [ 13/Jul/15 ] |
|
I will work on this. |
| Comment by Andreas Dilger [ 11/Jan/16 ] |
|
The latest version of the patch is actually https://github.com/zfsonlinux/zfs/pull/3983 and not 3723. |
| Comment by Alex Zhuravlev [ 27/Mar/16 ] |
|
each ZAP declaration adds ~0.5MB to space/memory reservation. so dnode accounting adds ~1MB, which is 20% of single-stripe object creation. |
| Comment by Jinshan Xiong (Inactive) [ 28/Mar/16 ] |
|
Is ~0.5MB the worst case or on average? Can you please describe this in detail? |
| Comment by Alex Zhuravlev [ 28/Mar/16 ] |
|
depends on how exactly we call dmu_tx_hold_zap(): the more specific, the less credits. unfortunately when the key is specified, then declaration becomes very expensing due to hash lock/lookup. also, the final calculation depends on inflation size: spa_get_asize(spa_t *spa, uint64_t lsize) int spa_asize_inflation = 24; |
| Comment by Gerrit Updater [ 20/Mar/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26090 |
| Comment by Gerrit Updater [ 30/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/15294/ |
| Comment by Peter Jones [ 30/Mar/17 ] |
|
Landed for 2.10 |