[LU-8068] Large ZFS Dnode support - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Rank (Obsolete):
9223372036854775807

Description

Unlanded patches exist in upstream ZFS to increase the dnode size which need to be evaluated for their impact (hopefully improvement) for Lustre metadata performance on ZFS MDTs:

https://github.com/zfsonlinux/zfs/pull/3542

Attachments

Issue Links

is blocking

LU-7895 zfs metadata performance improvements

Resolved

is duplicated by

LU-8424 osd_object.c:1330:22: error: 'DN_MAX_BONUSLEN' undeclared (first use in this function) DMU_OT_SA, DN_MAX_BONUSLEN, tx);

Resolved

is related to

LU-8124 MDT zpool capacity consumed at greater rate than inode allocation

Resolved

is related to

LU-6483 Add xattrset to mdsrate

Resolved

mentioned in: Page Loading...

Activity

[LU-8068] Large ZFS Dnode support

Peter Jones added a comment - 02/Jun/16 11:45 AM

Landed for 2.9

Peter Jones added a comment - 02/Jun/16 11:45 AM Landed for 2.9

Gerrit Updater added a comment - 02/Jun/16 4:45 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20367/
Subject: ~~LU-8068~~ osd-zfs: large dnode support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9765c6174ef580fb4deef4e7faea6d5ed634b00f

Gerrit Updater added a comment - 02/Jun/16 4:45 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20367/ Subject: LU-8068 osd-zfs: large dnode support Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9765c6174ef580fb4deef4e7faea6d5ed634b00f

Gerrit Updater added a comment - 20/May/16 11:03 PM

Ned Bass (bass6@llnl.gov) uploaded a new patch: http://review.whamcloud.com/20367
Subject: ~~LU-8068~~ osd-zfs: large dnode compatibility
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f0d8afaec213a7f471c3b22b9940de5c5cd192e3

Gerrit Updater added a comment - 20/May/16 11:03 PM Ned Bass (bass6@llnl.gov) uploaded a new patch: http://review.whamcloud.com/20367 Subject: LU-8068 osd-zfs: large dnode compatibility Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f0d8afaec213a7f471c3b22b9940de5c5cd192e3

Ned Bass (Inactive) added a comment - 20/May/16 8:46 PM

We're currently testing with the following patch to mitigate the performance impact of metadnode backfilling. It uses a naive heuristic (rescan after 4096 unlinks at most once per txg) but this is simple and probably achieves 99% of the performance to be gained here.

https://github.com/LLNL/zfs/commit/050b0e69

Ned Bass (Inactive) added a comment - 20/May/16 8:46 PM We're currently testing with the following patch to mitigate the performance impact of metadnode backfilling. It uses a naive heuristic (rescan after 4096 unlinks at most once per txg) but this is simple and probably achieves 99% of the performance to be gained here. https://github.com/LLNL/zfs/commit/050b0e69

Andreas Dilger added a comment - 26/Apr/16 6:06 PM - edited

The large dnode patch is blocked behind https://github.com/zfsonlinux/zfs/pull/4460 which is the performance problem that Alex and Ned identified, but currently that patch is only a workaround and needs to be improved before landing. I've described in that ticket what seems to be a reasonable approach for making a production-ready solution, but to summarize:

by default the dnode allocator should just use a counter that continues at the next file offset (as in the existing 4460 patch)
if dnodes are being unlinked, a (per-cpu?) counter of unlinked dnodes and the minimum unlinked dnode number should be tracked (these values could be racy since it isn't critical that their values be 100% accurate)
when the unlinked dnode counter exceeds some threshold (e.g. 4x number of inodes created in previous TXG, or 64x the number of dnodes that fit into a leaf block, or some tunable number of unlinked dnodes specified by userspace) then scanning should restart at the minimum unlinked dnode number instead of "0" to avoid scanning a large number of already-allocated dnode blocks

Alex, in order to move the large dnode patch forward, could you or Nathaniel work on an updated 4460 patch so that we can get on with landing the large dnode patch.

Andreas Dilger added a comment - 26/Apr/16 6:06 PM - edited The large dnode patch is blocked behind https://github.com/zfsonlinux/zfs/pull/4460 which is the performance problem that Alex and Ned identified, but currently that patch is only a workaround and needs to be improved before landing. I've described in that ticket what seems to be a reasonable approach for making a production-ready solution, but to summarize: by default the dnode allocator should just use a counter that continues at the next file offset (as in the existing 4460 patch) if dnodes are being unlinked, a (per-cpu?) counter of unlinked dnodes and the minimum unlinked dnode number should be tracked (these values could be racy since it isn't critical that their values be 100% accurate) when the unlinked dnode counter exceeds some threshold (e.g. 4x number of inodes created in previous TXG, or 64x the number of dnodes that fit into a leaf block, or some tunable number of unlinked dnodes specified by userspace) then scanning should restart at the minimum unlinked dnode number instead of "0" to avoid scanning a large number of already-allocated dnode blocks Alex, in order to move the large dnode patch forward, could you or Nathaniel work on an updated 4460 patch so that we can get on with landing the large dnode patch.

Alex Zhuravlev added a comment - 28/Mar/16 8:27 AM

Ned refreshed the patch to address that performance issue and now it's doing much better.
first of all, now I'm able to complete some tests where I was getting OOM (because of huge memory consumption by 8K spill, I guess).
now it makes sense to benchmark on a real storage as amount of IO with this patch is few times less:
1K vs (512byte dnode + 8K spill) per dnode OR 976MB vs 8300MB per 1M dnodes.

Alex Zhuravlev added a comment - 28/Mar/16 8:27 AM Ned refreshed the patch to address that performance issue and now it's doing much better. first of all, now I'm able to complete some tests where I was getting OOM (because of huge memory consumption by 8K spill, I guess). now it makes sense to benchmark on a real storage as amount of IO with this patch is few times less: 1K vs (512byte dnode + 8K spill) per dnode OR 976MB vs 8300MB per 1M dnodes.

Alex Zhuravlev added a comment - 22/Mar/16 7:21 PM

tried Lustre with that patch (on top of master ZFS):
before:
mdt 1 file 500000 dir 1 thr 1 create 21162.48 [ 18998.73, 22999.10]
after:
mdt 1 file 500000 dir 1 thr 1 create 18019.70 [ 15999.09, 19999.20]

osd-zfs/ was modified to ask for 1K dnodes, verified with zdb:
Object lvl iblk dblk dsize dnsize lsize %full type
10000 1 16K 512 0 1K 512 0.00 ZFS plain file

notice zero dsize meaning no spill was allocated.

Alex Zhuravlev added a comment - 22/Mar/16 7:21 PM tried Lustre with that patch (on top of master ZFS): before: mdt 1 file 500000 dir 1 thr 1 create 21162.48 [ 18998.73, 22999.10] after: mdt 1 file 500000 dir 1 thr 1 create 18019.70 [ 15999.09, 19999.20] osd-zfs/ was modified to ask for 1K dnodes, verified with zdb: Object lvl iblk dblk dsize dnsize lsize %full type 10000 1 16K 512 0 1K 512 0.00 ZFS plain file notice zero dsize meaning no spill was allocated.

Alex Zhuravlev added a comment - 22/Mar/16 6:08 PM

clean zfs/master:
Created 1000000 in 29414ms in 1 threads - 33997/sec
Created 1000000 in 20045ms in 2 threads - 49887/sec
Created 1000000 in 19259ms in 4 threads - 51923/sec
Created 1000000 in 17284ms in 8 threads - 57856/sec

zfs/master + large dnodes:
Created 1000000 in 40618ms in 1 threads - 24619/sec
Created 1000000 in 28142ms in 2 threads - 35534/sec
Created 1000000 in 25731ms in 4 threads - 38863/sec
Created 1000000 in 25244ms in 8 threads - 39613/sec

Alex Zhuravlev added a comment - 22/Mar/16 6:08 PM clean zfs/master: Created 1000000 in 29414ms in 1 threads - 33997/sec Created 1000000 in 20045ms in 2 threads - 49887/sec Created 1000000 in 19259ms in 4 threads - 51923/sec Created 1000000 in 17284ms in 8 threads - 57856/sec zfs/master + large dnodes: Created 1000000 in 40618ms in 1 threads - 24619/sec Created 1000000 in 28142ms in 2 threads - 35534/sec Created 1000000 in 25731ms in 4 threads - 38863/sec Created 1000000 in 25244ms in 8 threads - 39613/sec

Alex Zhuravlev added a comment - 22/Mar/16 4:49 PM - edited

hmm, I was using old version.. let me try the new one. this will take some time - the patch doesn't apply to 0.6.5

Alex Zhuravlev added a comment - 22/Mar/16 4:49 PM - edited hmm, I was using old version.. let me try the new one. this will take some time - the patch doesn't apply to 0.6.5

Alex Zhuravlev added a comment - 22/Mar/16 4:44 PM

yes, I was about to play with the code, but got confused by that performance issue. and yes, 1K should be more than enough: LinkEA would be 48+ bytes, LOVEA is something like 56+, then LMA and VBR (which I'd hope we can put into ZPL dnode, but in the worst case it's another 24+8 bytes).

Alex Zhuravlev added a comment - 22/Mar/16 4:44 PM yes, I was about to play with the code, but got confused by that performance issue. and yes, 1K should be more than enough: LinkEA would be 48+ bytes, LOVEA is something like 56+, then LMA and VBR (which I'd hope we can put into ZPL dnode, but in the worst case it's another 24+8 bytes).

People

Assignee:: Alex Zhuravlev

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 21/Aug/15 10:14 PM

Updated:: 27/Feb/17 10:55 PM

Resolved:: 02/Jun/16 11:45 AM