[LU-3124] bad interaction between layout lock and wide striping Created: 08/Apr/13 Updated: 19/Apr/13 Resolved: 19/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Johann Lombardi (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 7589 | ||||||||||||
| Description |
|
To implement the layout lock, Fanyong and Jinshan added a LVB to ibit lock so that the LOV EA can be packed in completion ASTs. However, for wide-striped files, the client is not expecting such a big LVB buffer in an AST. |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 08/Apr/13 ] |
|
Jinshan, Fanyong, any thought on this one? |
| Comment by Alex Zhuravlev [ 08/Apr/13 ] |
|
for the number of reasons I'd suggest to think of not using LVB for this purpose, if possible. I'd think one extra RPC to fetch update LOVEA after restore (or new striping) isn't that bad. |
| Comment by Johann Lombardi (Inactive) [ 08/Apr/13 ] |
|
Alex, could you please advise what are the other reasons? Why not pack it in the LVB if this fits? |
| Comment by Alex Zhuravlev [ 08/Apr/13 ] |
|
1) taking objects and locks in the different order leads to issues like |
| Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ] |
If we did this, we would have to follow the same model not for restore only, but all layout lock enqueue. This will produce some confusion on the client side, i.e. even you have layout lock caching, the layout may still be obsoleted.
No, this is not an order of object and lock. Instead, I realize any attempt to find an object in dlm thread would have the same problem. This is a BUG, let's solve it.
Not sure I understand this part.
Yes, before the IO starts and when the IO is finished. However, the overhead shouldn't be as much as you thought because we usually bring a lock handle to help find the lock quickly. But I don't know how this is related. |
| Comment by Alex Zhuravlev [ 08/Apr/13 ] |
|
obsolete layout just means it needs to be refetched (in case you really let it obsolete). DLM should not be dealing with storage in the first place. DLM is a component to help in-core coherency. as for lock revalidation, I think this was original reason why we started to lookup pages first and trust that, otherwise LDLM match took a lot with 10byte writes, for example. |
| Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ] |
|
For this issue itself, I don't quite understand it because we have set the LVB size as max_mds_easize - if it can't fit there in LVB, it may not be fit into getattr as well. |
| Comment by Johann Lombardi (Inactive) [ 09/Apr/13 ] |
intent, OST LVB are just other examples where the DLM is integrated with "storage".
It shouldn't be a lock_match() call in this case since we have the lock handle, just a hash lookup.
Eric said that there have been issues reported lately related to request size which was increased because of wide striping support. He does not want us to use the maximum LOV EA size by default and would rather use a buffer of a reasonable size by default and fetch the layout in a second RPC if it does not fit. In any case, i think we need a regression test to exercise this code path, for instance:
Add it to 2.4 blocker to have feedback from Oleg and Andreas. |
| Comment by Alex Zhuravlev [ 09/Apr/13 ] |
|
> intent, OST LVB are just other examples where the DLM is integrated with "storage". I'd say the same applies to OST LVB ... intents don't actually mix locks with storage as lock structure is created, but not enqueued. it's especially not very good that we're taking these "resources" in different order: most of paths grab object first, > It shouldn't be a lock_match() call in this case since we have the lock handle, just a hash lookup. well, it is now and still hash lookup isn't that free, especially on fat nodes? then we've got two different paths for getting data on MDS: for layout it's LVB, for others it's not. and the purpose is to save one RPC in the very rare case? we can discuss this in details in a week. |
| Comment by Andreas Dilger [ 09/Apr/13 ] |
|
Wasn't there already a patch to fetch the layout with a separate RPC if the layout was too large to fix in the reply buffer? If this needs a protocol change to fix, it needs to be finished before the 2.4 release. |
| Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ] |
We can't do in reverse order because right now it waits for dead objects in mdt_object_find(). To be honest, I don't understand why it needs to wait for destroying objects even FIDs are unique, that means if they are destroyed, they never come back again. |
| Comment by Alex Zhuravlev [ 09/Apr/13 ] |
|
you're confusing "destroy" with "leaving cache". we did have plans to drop objects actively in some cases (like lfsck scanning), actually we can do in any order as long as it's consistent everywhere. |
| Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ] |
Isn't it a better way by using bulk to transfer large layout? |
| Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ] |
From what I have seen, only destroying objects can be set with BANSHEE(local objects are special cases) for now. That is being said, one object can leave cache in two ways: explicitly because it's being destroyed; or LRU shrinker if memory is under pressure. For the latter case, the object must not have any references and BANSHEE won't be used at all. Taking one step back, BANSHEE is just a flag to mark this object is dying but it still has references - only destroying object should be of this attribute. Let me know if I missed something, thanks. |
| Comment by Alex Zhuravlev [ 09/Apr/13 ] |
|
there MUST be no special cases in this core component, this is way to risky to rely on. we need a way to force object to leave the cache. BANSHEE is a way to do so (currently used for few local objects and OSP objects). having that said, I guess we can't change ordering to the correct one now as it's very late in the release cycle. but then notice |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
I just remembered that we're doing a separate enqueue to read directories, which happen much more often than layout change. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
|
Well, we must do that in readdir because for most directories it's impossible to fill it in one RPC. I tend to think the current implementation of lu_object_find() is incorrect, because finding an object again will be deadlocked itself. We should unhash the objects when it's being destroyed. I've already done this at client side for layout change and it's working so far. About `the problem' of duplicate objects, it's actually okay because it may have already had duplicate objects at object create time. Even what you said is true, this is a system problem not just layout lock's problem because LVB is using the same way as well. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
|
If we had to fix this, I would say we went to a wrong way from start. Variable LVB type patch is not needed at all and client side code has to rework significantly because layout has to be returned other ways. Also we have to rework getattr because right now we piggyback layout in getattr and if we hold layout lock on the client and fecth layout with geattr, deadlock will surely happen. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
please, have a look at I_FREEING and how it's used. as for readdir, we could easily save 1 RPC putting 4-8K in the first reply to ENQUEUE request. in very many case that would be a whole directory. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
I thought an inode can't be freed unless its count drops to zero? That means FREEING state is actually doing cleanup work. Is it changed in latest kernel to allow set I_FREEING on busy inodes? Do you have another good example?
Come on, everybody knows the performance bottleneck of `ls -l' is at glimpse call but not this readdir request. I don't think that additional RPC is rare BTW. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
> I thought an inode can't be freed only if its count drops to zero? That means FREEING state is actually doing cleanup work. Is it changed in latest kernel to allow set I_FREEING on busy inodes? how can one schedule object for removal from the cache unless reference is held? the point was that no duplication allowed. > Come on, everybody knows the performance bottleneck of `ls -l' is at glimpse call but not this readdir request. I don't think that additional RPC is rare BTW. very good point - one more RPC isn't considered as a performance killer even for very frequent operation like readdir. layout is not going to change often, in the majority of cases we still be packing it into GETATTR reply like we do for regular attributes. ENQUEUE + lockless GETATTR should be needed only when layout has changed. this is very similar to readdir case: we can get UPDATE with getattr, we can enqueue it within readdir. if we start to do like this, the problem disappears immediately as grabbing resources in the same order everywhere is The Right Thing. in contrast, if we put this workaround into lu infrastructure we still have the issue with wide stripes. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
Not sure if we're on the same page. For inode, it can be set with I_FREEING only if its reference count drops to zero, and then cleanup work starts immediately which means the waiting process can be awoken in a limited time; but for lu_object, we can set BANSHEE to it no matter how many reference there are, and new finders have to be blocked for unknown time. As for duplication, are you talking about duplication in cache or in memory? We've already had duplication in memory at object creation time.
I agree with this, this is actually not in the performance critical path so performance is not my consideration here. I said that to challenge your point of `nobody is complaining a lot about this' BTW, since layout lock can share the same dlm lock with lookup and update lock, so we can get here even layout is not changed.
this sounds true. |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
There are two problems here: 1. live lock; 2. layout is too large. So we still need the fix in lu infrastructure plus extra RPC to fetch layout if layout is too large to fit in completion AST buffer. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
what kind of live lock? as for "too large" i don't quite see the problem yet: 1) the client sends enqueue/getattr |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ] |
|
no I mean if the layout can fit in completion buffer, we still need to piggy it back in completion AST. This is the mostly case. |
| Comment by Alex Zhuravlev [ 11/Apr/13 ] |
|
not sure why we do. I'm quite sure the most frequent case is when we get layout with enqueue/open+create and it will stay around for very long like openlock does. |
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/13 ] |
|
I pushed a patch to: http://review.whamcloud.com/6042 for this problem, please take a look. |
| Comment by Andreas Dilger [ 12/Apr/13 ] |
|
This is closely related to |
| Comment by Peter Jones [ 19/Apr/13 ] |
|
Landed for 2.4 |