[LU-8068] Large ZFS Dnode support - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Rank (Obsolete):
9223372036854775807

Description

Unlanded patches exist in upstream ZFS to increase the dnode size which need to be evaluated for their impact (hopefully improvement) for Lustre metadata performance on ZFS MDTs:

https://github.com/zfsonlinux/zfs/pull/3542

Attachments

Issue Links

is blocking

LU-7895 zfs metadata performance improvements

Resolved

is duplicated by

LU-8424 osd_object.c:1330:22: error: 'DN_MAX_BONUSLEN' undeclared (first use in this function) DMU_OT_SA, DN_MAX_BONUSLEN, tx);

Resolved

is related to

LU-8124 MDT zpool capacity consumed at greater rate than inode allocation

Resolved

is related to

LU-6483 Add xattrset to mdsrate

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-8068] Large ZFS Dnode support

Alex Zhuravlev added a comment - 28/Mar/16 8:27 AM

Ned refreshed the patch to address that performance issue and now it's doing much better.
first of all, now I'm able to complete some tests where I was getting OOM (because of huge memory consumption by 8K spill, I guess).
now it makes sense to benchmark on a real storage as amount of IO with this patch is few times less:
1K vs (512byte dnode + 8K spill) per dnode OR 976MB vs 8300MB per 1M dnodes.

Alex Zhuravlev added a comment - 28/Mar/16 8:27 AM Ned refreshed the patch to address that performance issue and now it's doing much better. first of all, now I'm able to complete some tests where I was getting OOM (because of huge memory consumption by 8K spill, I guess). now it makes sense to benchmark on a real storage as amount of IO with this patch is few times less: 1K vs (512byte dnode + 8K spill) per dnode OR 976MB vs 8300MB per 1M dnodes.

Alex Zhuravlev added a comment - 22/Mar/16 7:21 PM

tried Lustre with that patch (on top of master ZFS):
before:
mdt 1 file 500000 dir 1 thr 1 create 21162.48 [ 18998.73, 22999.10]
after:
mdt 1 file 500000 dir 1 thr 1 create 18019.70 [ 15999.09, 19999.20]

osd-zfs/ was modified to ask for 1K dnodes, verified with zdb:
Object lvl iblk dblk dsize dnsize lsize %full type
10000 1 16K 512 0 1K 512 0.00 ZFS plain file

notice zero dsize meaning no spill was allocated.

Alex Zhuravlev added a comment - 22/Mar/16 7:21 PM tried Lustre with that patch (on top of master ZFS): before: mdt 1 file 500000 dir 1 thr 1 create 21162.48 [ 18998.73, 22999.10] after: mdt 1 file 500000 dir 1 thr 1 create 18019.70 [ 15999.09, 19999.20] osd-zfs/ was modified to ask for 1K dnodes, verified with zdb: Object lvl iblk dblk dsize dnsize lsize %full type 10000 1 16K 512 0 1K 512 0.00 ZFS plain file notice zero dsize meaning no spill was allocated.

Alex Zhuravlev added a comment - 22/Mar/16 6:08 PM

clean zfs/master:
Created 1000000 in 29414ms in 1 threads - 33997/sec
Created 1000000 in 20045ms in 2 threads - 49887/sec
Created 1000000 in 19259ms in 4 threads - 51923/sec
Created 1000000 in 17284ms in 8 threads - 57856/sec

zfs/master + large dnodes:
Created 1000000 in 40618ms in 1 threads - 24619/sec
Created 1000000 in 28142ms in 2 threads - 35534/sec
Created 1000000 in 25731ms in 4 threads - 38863/sec
Created 1000000 in 25244ms in 8 threads - 39613/sec

Alex Zhuravlev added a comment - 22/Mar/16 6:08 PM clean zfs/master: Created 1000000 in 29414ms in 1 threads - 33997/sec Created 1000000 in 20045ms in 2 threads - 49887/sec Created 1000000 in 19259ms in 4 threads - 51923/sec Created 1000000 in 17284ms in 8 threads - 57856/sec zfs/master + large dnodes: Created 1000000 in 40618ms in 1 threads - 24619/sec Created 1000000 in 28142ms in 2 threads - 35534/sec Created 1000000 in 25731ms in 4 threads - 38863/sec Created 1000000 in 25244ms in 8 threads - 39613/sec

Alex Zhuravlev added a comment - 22/Mar/16 4:49 PM - edited

hmm, I was using old version.. let me try the new one. this will take some time - the patch doesn't apply to 0.6.5

Alex Zhuravlev added a comment - 22/Mar/16 4:49 PM - edited hmm, I was using old version.. let me try the new one. this will take some time - the patch doesn't apply to 0.6.5

Alex Zhuravlev added a comment - 22/Mar/16 4:44 PM

yes, I was about to play with the code, but got confused by that performance issue. and yes, 1K should be more than enough: LinkEA would be 48+ bytes, LOVEA is something like 56+, then LMA and VBR (which I'd hope we can put into ZPL dnode, but in the worst case it's another 24+8 bytes).

Alex Zhuravlev added a comment - 22/Mar/16 4:44 PM yes, I was about to play with the code, but got confused by that performance issue. and yes, 1K should be more than enough: LinkEA would be 48+ bytes, LOVEA is something like 56+, then LMA and VBR (which I'd hope we can put into ZPL dnode, but in the worst case it's another 24+8 bytes).

Andreas Dilger added a comment - 22/Mar/16 4:42 PM

Alex, your comment is on an old version of the patch, and not on the main pull request (https://github.com/zfsonlinux/zfs/pull/3542), so I don't think Ned will be looking there? Also, hopefully you are not using this old version of the patch (8f9fdb228), but rather the newest patch (ba39766)?

Andreas Dilger added a comment - 22/Mar/16 4:42 PM Alex, your comment is on an old version of the patch, and not on the main pull request ( https://github.com/zfsonlinux/zfs/pull/3542 ), so I don't think Ned will be looking there? Also, hopefully you are not using this old version of the patch (8f9fdb228), but rather the newest patch (ba39766)?

Andreas Dilger added a comment - 22/Mar/16 4:35 PM

Rereading the large dnode patch, it seems that the caller can specify the dnode size on a per-dnode basis, so ideally we can add support for this to the osd-zfs code, but if not specified it will take the dataset property. Is 1KB large enough to hold the dnode + LOVEA + linkea + FID?

Andreas Dilger added a comment - 22/Mar/16 4:35 PM Rereading the large dnode patch, it seems that the caller can specify the dnode size on a per-dnode basis, so ideally we can add support for this to the osd-zfs code, but if not specified it will take the dataset property. Is 1KB large enough to hold the dnode + LOVEA + linkea + FID?

Alex Zhuravlev added a comment - 22/Mar/16 4:29 PM

Andreas, I've made a comment to github already, no reply so far. Hope Ned has that seen.

so far I've tested large dnodes with ZPL only and noticed significant degradation, so I took a timeout hoping to see comments from Ned.
I haven't tested Lustre with large dnodes.

the patch allows to ask for dnode of specific size and I think we can do this given that we declare everything (including LOVEA of known size) ahead.
we can easly track this in OSD.

Alex Zhuravlev added a comment - 22/Mar/16 4:29 PM Andreas, I've made a comment to github already, no reply so far. Hope Ned has that seen. so far I've tested large dnodes with ZPL only and noticed significant degradation, so I took a timeout hoping to see comments from Ned. I haven't tested Lustre with large dnodes. the patch allows to ask for dnode of specific size and I think we can do this given that we declare everything (including LOVEA of known size) ahead. we can easly track this in OSD.

Andreas Dilger added a comment - 22/Mar/16 4:23 PM

Alex, could you please post on the patch in GitHub so the LLNL folks can see.

Also, it isn't clear what the difference is between your two tests. In the first case you wrote the create rate is down from 29k to 20k, is that for ZPL create rate? I don't expect this feature to help the non-Lustre case, since ZPL doesn't use SAs that can fit into the large dnode space, so it is just overhead.

In the second case you wrote the create rate is up from 13k to 20k when you shrink the LOVEA, so presumably this is Lustre, but without the large dnode patch?

What is the performance with Lustre with normal LOVEA size (1-4 stripes) + large dnodes? Presumably that would be 13k +/- some amount, not 29k +/- some amount?

Also, my (vague) understanding of this patch is that it dynamically allocates space for the dnode, possibly using up space for dnode numbers following it? Does this fail if the dnode is not declared large enough for all future SAs during the initial allocation? IIRC, the osd-zfs code stores the layout and link xattrs to the dnode in a separate operation, which may make the large dnode patch ineffective. It may also have problems with multiple threads allocating dnodes from the same block in parallel, since it doesn't know at dnode allocation time how large the SA space Lustre eventually needs. Maybe my understanding of how this feature was implemented is wrong?

Andreas Dilger added a comment - 22/Mar/16 4:23 PM Alex, could you please post on the patch in GitHub so the LLNL folks can see. Also, it isn't clear what the difference is between your two tests. In the first case you wrote the create rate is down from 29k to 20k, is that for ZPL create rate? I don't expect this feature to help the non-Lustre case, since ZPL doesn't use SAs that can fit into the large dnode space, so it is just overhead. In the second case you wrote the create rate is up from 13k to 20k when you shrink the LOVEA, so presumably this is Lustre, but without the large dnode patch? What is the performance with Lustre with normal LOVEA size (1-4 stripes) + large dnodes? Presumably that would be 13k +/- some amount, not 29k +/- some amount? Also, my (vague) understanding of this patch is that it dynamically allocates space for the dnode, possibly using up space for dnode numbers following it? Does this fail if the dnode is not declared large enough for all future SAs during the initial allocation? IIRC, the osd-zfs code stores the layout and link xattrs to the dnode in a separate operation, which may make the large dnode patch ineffective. It may also have problems with multiple threads allocating dnodes from the same block in parallel, since it doesn't know at dnode allocation time how large the SA space Lustre eventually needs. Maybe my understanding of how this feature was implemented is wrong?

Alex Zhuravlev added a comment - 22/Mar/16 2:31 PM

I tried this patch with createmany on a directly mounted ZFS. it degrades create performance from ~29K/sec to ~20K/sec. I'm not sure how quickly this degradation can be addressed, but in general large dnode patch looks very important. to simulate it I tweaked the code to shrink LOVEA to just few bytes so that we fit bonus. and this brought creation rate from ~13K to ~20K in mds-survey.

Alex Zhuravlev added a comment - 22/Mar/16 2:31 PM I tried this patch with createmany on a directly mounted ZFS. it degrades create performance from ~29K/sec to ~20K/sec. I'm not sure how quickly this degradation can be addressed, but in general large dnode patch looks very important. to simulate it I tweaked the code to shrink LOVEA to just few bytes so that we fit bonus. and this brought creation rate from ~13K to ~20K in mds-survey.

Jinshan Xiong (Inactive) added a comment - 11/Sep/15 6:48 PM

It looks like this feature has some conflicts with storing XATTR as system attribute, which blocks further performance benchmark. I'm waiting for the author's response to move forward.

Jinshan Xiong (Inactive) added a comment - 11/Sep/15 6:48 PM It looks like this feature has some conflicts with storing XATTR as system attribute, which blocks further performance benchmark. I'm waiting for the author's response to move forward.

People

Assignee:: Alex Zhuravlev

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 21/Aug/15 10:14 PM

Updated:: 27/Feb/17 10:55 PM

Resolved:: 02/Jun/16 11:45 AM