[LU-822] allow multiple Object Index files to be created Created: 03/Nov/11 Updated: 28/Aug/15 Resolved: 05/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Story Points: | 2 | ||||||||
| Rank (Obsolete): | 4764 | ||||||||
| Description |
|
Per discussion in email, having multiple Object Index files improves performance of the MDS significantly. It appears from initial discussion that this problem is more directly correlated to concurrent access to the OI, and not as much to the number of entries in the OI. While I agree it is a good idea to also investigate why the single OI has a concurrency problem, it also makes sense to be able to add multiple OIs for testing, and potentially for production use. Care must be taken to ensure that there are no compatibility issues introduced if the new OSD can handle multiple OIs, but the filesystem was formatted with only a single OI and upgraded, or if it was upgraded and then downgraded. Lustre does not support formatting the filesystem with a new version of Lustre and then downgrading to an older version. For OSDs with multiple OI files it will use filenames oi.[0..OSD_OI_FID_NR-1]. We can start with OSD_OI_FID_NR=32 for now, but this should be flexible. The oi.N used for (SEQ % OSD_OI_FID_NR) will put FIDs from a single client in a single OI (for locality during single-threaded creates) and distribute multiple clients across multiple OIs fairly uniformly to minimize contention, because the FID SEQ values are allocated sequentially. I am aware that "oi.16" previously had a different meaning, namely that oi.16 was the size of the FID, not an OI index. However, the previous use of "oi.5" was a benchmark hack that only worked for the first 32k In order to address these issues, I would recommend proceeding as follows:
As part of lfsck Phase I, if any oi.N is missing (except in the special case of only oi.16 exising) it should be recreated and lfsck triggered to do a full OI scrub/rebuild (the OI count may be completely transparent to lfsck, I'm not sure yet). |
| Comments |
| Comment by Liang Zhen (Inactive) [ 04/Nov/11 ] |
|
more test results for multiple OIs http://jira.whamcloud.com/secure/attachment/10580/multiple_OIs.pdf
So I think we have solid evidence that multiple OIs is helpful for performance. |
| Comment by Liang Zhen (Inactive) [ 08/Nov/11 ] |
|
I still can't fully undertand implementation of IAM, but I think at least one scalability issue of single OI is dynlock can't scale for hundreds or thousands threads because all locks are linked on a single list and dynlock_lock() will search the list for twice and having one memory allocation, whereas htree_lock is using skiplist and only search once. This can be optimized but we will have another complex lock implementation, I'm not sure whether we can use htree_lock to replace dynlock, I think dynlock allows nested locking Protection of splitting could be another issue but I'm not quite sure, anyway, I think the first issue we need to sort out is which part of those functions can be put in 2.2 and which part should be put in Fanyong's OI scrub/rebuild. Liang |
| Comment by Andreas Dilger [ 08/Nov/11 ] |
|
I'm happy to add the code to allow multiple OIs, and for new filesystems it can use them. Old filesystems can continue to use a single OI, until such a time as OI scrub can repair this automatically, as described in my previous posting. Given that we know multiple OIs scale well, I think the simple effort there is better spent than optimizing the dynlock locking (which is not going to be used anywhere else, AFAIK). |
| Comment by Liang Zhen (Inactive) [ 24/Nov/11 ] |
|
Andreas, I actually have a question about your first comment: I'm not quite sure "flexible" here, do you mean we want to increase number of OI containers dynamically? could you explain a little about this? |
| Comment by Andreas Dilger [ 25/Nov/11 ] |
|
By "flexible" I mean that the code should be able to handle a reasonably arbitrary number of OI containers (e.g. 1, 32, 64, etc) found at mount time, even if the code is initially to creating new filesystems with 32 OIs. This will allow the code to adapt in case the OI count on new filesystems changes again for some reason. There is no need to dynamically balance the OI count at runtime, just to allow the same code to detect the number of OIs at mount time on different filesystems which might have different OI counts for whatever reason (different default value in the future, reformatted code, etc. |
| Comment by Liang Zhen (Inactive) [ 08/Dec/11 ] |
|
patch is here: http://review.whamcloud.com/#change,1822 |
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Liang Zhen (Inactive) [ 09/Jan/12 ] |
|
patch landed |
| Comment by Mikhail Pershin [ 10/Jan/12 ] |
|
I re-open this bug as it introduced issue on 32-bit arch. osd_fid2oi():
return &osd->od_oi_table[fid->f_seq % osd->od_oi_count];
that work only on 64bit and complains about missing __umoddi3 symbol on 32bit. Usually we are using in Lustre do_div() for such divisions, but better to use div_u64_rem() as it returns both result and reminder without changing dividend in-place. Maybe this must be just new bug, this is up to you |
| Comment by Andreas Dilger [ 10/Jan/12 ] |
|
If we limit od_oi_count to a power-of-two value (which is reasonable and can easily be checked in the one location that it can be set) then this can mask off the low 32 bits and then do modulus, or even just mask: return &osd->od_oi_table[fid->f_seq & (osd->od_oi_count - 1)]; |
| Comment by Mikhail Pershin [ 10/Jan/12 ] |
|
I've ended up with such patch, if nobody objects, I can post it to review. diff --git a/lustre/osd-ldiskfs/osd_internal.h b/lustre/osd-ldiskfs/osd_internal.h index a9450a4..0a8419f 100644 --- a/lustre/osd-ldiskfs/osd_internal.h +++ b/lustre/osd-ldiskfs/osd_internal.h @@ -398,11 +398,14 @@ static inline int osd_fid_is_igif(const struct lu_fid *fid) static inline struct osd_oi * osd_fid2oi(struct osd_device *osd, const struct lu_fid *fid) { + __u32 idx; + if (!fid_is_norm(fid)) return NULL; LASSERT(osd->od_oi_table != NULL && osd->od_oi_count >= 1); - return &osd->od_oi_table[fid->f_seq % osd->od_oi_count]; + div_u64_rem(fid->f_seq, osd->od_oi_count, &idx); + return &osd->od_oi_table[idx]; } #endif /* __KERNEL__ */ |
| Comment by Mikhail Pershin [ 10/Jan/12 ] |
|
Andreas, I've just seen your comment, yes, that would be even better |
| Comment by Liang Zhen (Inactive) [ 10/Jan/12 ] |
|
right... I will post a patch for this |
| Comment by Peter Jones [ 16/Jan/12 ] |
|
Aren't we deprecating 32 bit arch for master? |
| Comment by Liang Zhen (Inactive) [ 16/Jan/12 ] |
|
Peter, it's a very smallfix anyway, it's reviewed & tested, I think it can be landed very soon |
| Comment by Andreas Dilger [ 16/Jan/12 ] |
|
Liang, I saw in the lfsck test output that it is creating 64 OI files, when I thought only 32 were being created: https://maloo.whamcloud.com/test_logs/cbd4e1ce-3f50-11e1-990e-5254004bbbd3 09:47:16:MDT: dirfid [0x2:0x0:0x0] child [0xd:0x120d447a:0x0] file oi.16.0 09:47:20:MDT: dirfid [0x2:0x0:0x0] child [0x4a:0x120d44b7:0x0] file oi.16.61 |
| Comment by Andreas Dilger [ 19/Jan/12 ] |
|
Add performance test results. |
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 19/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Andreas Dilger [ 27/Jan/12 ] |
|
Liang, should OSD_OI_FID_OID_BITS be changed to have only 32 OI files? I thought this is what had been agreed upon, and the test results in the mds-survey show no improvement of 64 OIs over 32 OIs. |
| Comment by Liang Zhen (Inactive) [ 31/Jan/12 ] |
|
OK, I can change this. I'm using 64 just because it's the initial value I used in my first patch. |
| Comment by Andreas Dilger [ 22/Feb/12 ] |
|
Reopen issue to track changing OSD_OI_FID_OID_BITS to 32, which has shown optimum performance in earlier testing, and again in ORI-507 on ZFS. |
| Comment by Andreas Dilger [ 05/Apr/12 ] |
|
Since 2.2 was released with 64 OIs, we're pretty much stuck with this for now. |