[LU-5391] osd-zfs: ZAP objects use 4K blocks for both indirect and leaf blocks Created: 22/Jul/14 Updated: 26/Sep/14 Resolved: 26/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Isaac Huang (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prz, zfs | ||
| Severity: | 3 |
| Rank (Obsolete): | 15006 |
| Description |
|
For example, on a MDS an oi.xx directory: The code in __osd_zap_create(): This seemed inefficient. In fact the default leaf block size for fat ZAP is 16K: I changed the block sizes to 16K indirect and 128K leaf, and saw a 10% increase in mds-survey creation rates. |
| Comments |
| Comment by Isaac Huang (Inactive) [ 22/Jul/14 ] |
|
Brian or Alex, can you comment? |
| Comment by Alex Zhuravlev [ 22/Jul/14 ] |
|
4K seem to be small, of course, but I don't think 128K is good either. this might be OK for a relatively small directories, but when a directory is big, then evenly distributed load will be touching many different blocks leading to very low write density - we'll have to make 128K I/O for very few entries to modify. I'd suggest to stay with default 16K. |
| Comment by Isaac Huang (Inactive) [ 22/Jul/14 ] |
|
I agree that 128K was too aggressive. I changed to 16K/16K and still got 7.4% and 14% increases (over current 4K/4K) for mds-survey creation and destroy rates respectively. I'd suggest to increase indirect block to 16K, which is the default indirect block size used by ZPL directories (e.g. / on a MDT), to reduce levels of indirection, and to increase leaf block to 16K, which matches fzap_default_block_shift. |
| Comment by Isaac Huang (Inactive) [ 22/Jul/14 ] |
|
Patch pushed to http://review.whamcloud.com/#/c/11182/ |
| Comment by Brian Behlendorf [ 22/Jul/14 ] |
|
This is going to be a trade off between memory usage, bandwidth, and IO/s. I think adopting the ZPL defaults of 16k strikes a good balance but I'd be careful about drawing any performance conclusions. This may help create performance but it will hurt other workloads. For example, I'd expect that increasing the OI leaf block size on the MDS would improve performance as long as the entire OI can be cached (small filesystems). But once the OI size is significantly larger than memory (large filesystems) I see two downsides. 1) pulling in a larger block takes slightly longer, and 2) because the FIDs are hashed uniformly over the leaves this effectively reduces by 4x the cached working set size for a given set of FIDs. |
| Comment by Isaac Huang (Inactive) [ 22/Jul/14 ] |
|
Brian, thanks for the comment: |
| Comment by Brian Behlendorf [ 22/Jul/14 ] |
|
> the FIDs in the working set are so sparse that each leaf block holds only one... Yes, exactly. From everything I've seen the FIDs are distributed very uniformly over the leaves. This is a good thing and a bad thing. So let's say you have a set of 1024 files all of which are being frequently accessed so the OI leaf blocks all end up on the ARCs MFU. If the OI FatZAP has enough total entries my expectation would be that each entry would hash to a different leaf. So with 4K leaf blocks you'll consume roughly 4M of memory. With 16k leak blocks you're looking at 16M for the same workloads. This may still be a reasonable trade-off to make, but it's something which should be considered. |
| Comment by Peter Jones [ 26/Aug/14 ] |
|
Landed for 2.7 |