[LU-3124] bad interaction between layout lock and wide striping Created: 08/Apr/13  Updated: 19/Apr/13  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Johann Lombardi (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Related
is related to LU-2807 lockup in server completion ast -> lu... Resolved
is related to LU-2807 lockup in server completion ast -> lu... Resolved
Severity: 3
Rank (Obsolete): 7589

 Description   

To implement the layout lock, Fanyong and Jinshan added a LVB to ibit lock so that the LOV EA can be packed in completion ASTs. However, for wide-striped files, the client is not expecting such a big LVB buffer in an AST.
I think the MDT should have a upper LVB limit over which it only packs the "header" (i.e. the most important field is the generation) and the OST FID list will have to be fetched via another RPC from the client.



 Comments   
Comment by Johann Lombardi (Inactive) [ 08/Apr/13 ]

Jinshan, Fanyong, any thought on this one?

Comment by Alex Zhuravlev [ 08/Apr/13 ]

for the number of reasons I'd suggest to think of not using LVB for this purpose, if possible. I'd think one extra RPC to fetch update LOVEA after restore (or new striping) isn't that bad.

Comment by Johann Lombardi (Inactive) [ 08/Apr/13 ]

Alex, could you please advise what are the other reasons? Why not pack it in the LVB if this fits?

Comment by Alex Zhuravlev [ 08/Apr/13 ]

1) taking objects and locks in the different order leads to issues like LU-2807
2) layout can't be easily generated on the client side, even if nobody can't see it until landed on MDS
3) [related issue] we need to match the lock on every IO (twice, if I understand right) will lead to significant performance penalty on a workload with tiny IO

Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ]

for the number of reasons I'd suggest to think of not using LVB for this purpose, if possible. I'd think one extra RPC to fetch update LOVEA after restore (or new striping) isn't that bad.

If we did this, we would have to follow the same model not for restore only, but all layout lock enqueue. This will produce some confusion on the client side, i.e. even you have layout lock caching, the layout may still be obsoleted.

1) taking objects and locks in the different order leads to issues like LU-2807

No, this is not an order of object and lock. Instead, I realize any attempt to find an object in dlm thread would have the same problem. This is a BUG, let's solve it.

2) layout can't be easily generated on the client side, even if nobody can't see it until landed on MDS

Not sure I understand this part.

3) [related issue] we need to match the lock on every IO (twice, if I understand right) will lead to significant performance penalty on a workload with tiny IO

Yes, before the IO starts and when the IO is finished. However, the overhead shouldn't be as much as you thought because we usually bring a lock handle to help find the lock quickly. But I don't know how this is related.

Comment by Alex Zhuravlev [ 08/Apr/13 ]

obsolete layout just means it needs to be refetched (in case you really let it obsolete).

DLM should not be dealing with storage in the first place. DLM is a component to help in-core coherency.

as for lock revalidation, I think this was original reason why we started to lookup pages first and trust that, otherwise LDLM match took a lot with 10byte writes, for example.

Comment by Jinshan Xiong (Inactive) [ 08/Apr/13 ]

For this issue itself, I don't quite understand it because we have set the LVB size as max_mds_easize - if it can't fit there in LVB, it may not be fit into getattr as well.

Comment by Johann Lombardi (Inactive) [ 09/Apr/13 ]

DLM should not be dealing with storage in the first place. DLM is a component to help in-core coherency.

intent, OST LVB are just other examples where the DLM is integrated with "storage".

as for lock revalidation, I think this was original reason why we started to lookup pages first and trust that, otherwise LDLM match took a lot with 10byte writes, for example

It shouldn't be a lock_match() call in this case since we have the lock handle, just a hash lookup.

For this issue itself, I don't quite understand it because we have set the LVB size as max_mds_easize - if it can't fit there in LVB, it may not be fit into getattr as well.

Eric said that there have been issues reported lately related to request size which was increased because of wide striping support. He does not want us to use the maximum LOV EA size by default and would rather use a buffer of a reasonable size by default and fetch the layout in a second RPC if it does not fit. In any case, i think we need a regression test to exercise this code path, for instance:

  • create 2 wide-striped files
  • swap the layout of those 2 files
  • sleep in mdd_swap_layout() via a fail_loc
  • try to read/write to one of those files
  • writer should be blocked on layout lock enqueue
  • when sleep times out, layout should be granted back to reader/writer via CP AST

Add it to 2.4 blocker to have feedback from Oleg and Andreas.

Comment by Alex Zhuravlev [ 09/Apr/13 ]

> intent, OST LVB are just other examples where the DLM is integrated with "storage".

I'd say the same applies to OST LVB ... intents don't actually mix locks with storage as lock structure is created, but not enqueued.
the problem is that even lock cancellation may involve another dive into ldlm, then storage again.
so, in the end the chain will be: ldlm (enqueue) -> storage (mds) -> ldlm (cancel) -> storage.
I don't think this is that good.

it's especially not very good that we're taking these "resources" in different order: most of paths grab object first,
then enqueue lock. in the case of layout we do in the reverse order. the latter is better actually as we need it for
DNE anyway.

> It shouldn't be a lock_match() call in this case since we have the lock handle, just a hash lookup.

well, it is now and still hash lookup isn't that free, especially on fat nodes?

then we've got two different paths for getting data on MDS: for layout it's LVB, for others it's not.

and the purpose is to save one RPC in the very rare case?

we can discuss this in details in a week.

Comment by Andreas Dilger [ 09/Apr/13 ]

Wasn't there already a patch to fetch the layout with a separate RPC if the layout was too large to fix in the reply buffer?

If this needs a protocol change to fix, it needs to be finished before the 2.4 release.

Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ]

it's especially not very good that we're taking these "resources" in different order: most of paths grab object first,
then enqueue lock. in the case of layout we do in the reverse order. the latter is better actually as we need it for
DNE anyway.

We can't do in reverse order because right now it waits for dead objects in mdt_object_find(). To be honest, I don't understand why it needs to wait for destroying objects even FIDs are unique, that means if they are destroyed, they never come back again.

Comment by Alex Zhuravlev [ 09/Apr/13 ]

you're confusing "destroy" with "leaving cache". we did have plans to drop objects actively in some cases (like lfsck scanning),
and BANSHEE is doing exactly this. I find the current approach safe and very robust in sense that no object duplicates is possible.

actually we can do in any order as long as it's consistent everywhere.

Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ]

Wasn't there already a patch to fetch the layout with a separate RPC if the layout was too large to fix in the reply buffer?

Isn't it a better way by using bulk to transfer large layout?

Comment by Jinshan Xiong (Inactive) [ 09/Apr/13 ]

you're confusing "destroy" with "leaving cache". we did have plans to drop objects actively in some cases (like lfsck scanning),
and BANSHEE is doing exactly this. I find the current approach safe and very robust in sense that no object duplicates is possible.

From what I have seen, only destroying objects can be set with BANSHEE(local objects are special cases) for now. That is being said, one object can leave cache in two ways: explicitly because it's being destroyed; or LRU shrinker if memory is under pressure. For the latter case, the object must not have any references and BANSHEE won't be used at all.

Taking one step back, BANSHEE is just a flag to mark this object is dying but it still has references - only destroying object should be of this attribute.

Let me know if I missed something, thanks.

Comment by Alex Zhuravlev [ 09/Apr/13 ]

there MUST be no special cases in this core component, this is way to risky to rely on.

we need a way to force object to leave the cache. BANSHEE is a way to do so (currently used for few local objects and OSP objects).
I planned to use this to basically invalidate DNE/WBC-related cache (when remote object is cached and we about to lose LDLM lock).

having that said, I guess we can't change ordering to the correct one now as it's very late in the release cycle. but then notice
we've got this problem with wide striping and with the current approach we'd have to fetch this large LOV EA and then attach it in
the form of LVB to a lock? sounds like another "special" case, eh?

Comment by Alex Zhuravlev [ 11/Apr/13 ]

I just remembered that we're doing a separate enqueue to read directories, which happen much more often than layout change.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

Well, we must do that in readdir because for most directories it's impossible to fill it in one RPC.

I tend to think the current implementation of lu_object_find() is incorrect, because finding an object again will be deadlocked itself. We should unhash the objects when it's being destroyed. I've already done this at client side for layout change and it's working so far.

About `the problem' of duplicate objects, it's actually okay because it may have already had duplicate objects at object create time.

Even what you said is true, this is a system problem not just layout lock's problem because LVB is using the same way as well.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

If we had to fix this, I would say we went to a wrong way from start. Variable LVB type patch is not needed at all and client side code has to rework significantly because layout has to be returned other ways. Also we have to rework getattr because right now we piggyback layout in getattr and if we hold layout lock on the client and fecth layout with geattr, deadlock will surely happen.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

please, have a look at I_FREEING and how it's used.

as for readdir, we could easily save 1 RPC putting 4-8K in the first reply to ENQUEUE request. in very many case that would be a whole directory.
we don't do this and nobody is complaining a lot about this. why very rare additional RPC to fetch LOVEA would be that bad?

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

please, have a look at I_FREEING and how it's used.

I thought an inode can't be freed unless its count drops to zero? That means FREEING state is actually doing cleanup work. Is it changed in latest kernel to allow set I_FREEING on busy inodes?

Do you have another good example?

as for readdir, we could easily save 1 RPC putting 4-8K in the first reply to ENQUEUE request. in very many case that would be a whole directory.
we don't do this and nobody is complaining a lot about this. why very rare additional RPC to fetch LOVEA would be that bad?

Come on, everybody knows the performance bottleneck of `ls -l' is at glimpse call but not this readdir request. I don't think that additional RPC is rare BTW.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

> I thought an inode can't be freed only if its count drops to zero? That means FREEING state is actually doing cleanup work. Is it changed in latest kernel to allow set I_FREEING on busy inodes?

how can one schedule object for removal from the cache unless reference is held? the point was that no duplication allowed.

> Come on, everybody knows the performance bottleneck of `ls -l' is at glimpse call but not this readdir request. I don't think that additional RPC is rare BTW.

very good point - one more RPC isn't considered as a performance killer even for very frequent operation like readdir. layout is not going to change often, in the majority of cases we still be packing it into GETATTR reply like we do for regular attributes. ENQUEUE + lockless GETATTR should be needed only when layout has changed. this is very similar to readdir case: we can get UPDATE with getattr, we can enqueue it within readdir.

if we start to do like this, the problem disappears immediately as grabbing resources in the same order everywhere is The Right Thing.

in contrast, if we put this workaround into lu infrastructure we still have the issue with wide stripes.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

how can one schedule object for removal from the cache unless reference is held? the point was that no duplication allowed.

Not sure if we're on the same page. For inode, it can be set with I_FREEING only if its reference count drops to zero, and then cleanup work starts immediately which means the waiting process can be awoken in a limited time; but for lu_object, we can set BANSHEE to it no matter how many reference there are, and new finders have to be blocked for unknown time.

As for duplication, are you talking about duplication in cache or in memory? We've already had duplication in memory at object creation time.

very good point - one more RPC isn't considered as a performance killer even for very frequent operation like readdir. layout is not going to change often, in the majority of cases we still be packing it into GETATTR reply like we do for regular attributes. ENQUEUE + lockless GETATTR should be needed only when layout has changed. this is very similar to readdir case: we can get UPDATE with getattr, we can enqueue it within readdir.

I agree with this, this is actually not in the performance critical path so performance is not my consideration here. I said that to challenge your point of `nobody is complaining a lot about this'

BTW, since layout lock can share the same dlm lock with lookup and update lock, so we can get here even layout is not changed.

if we start to do like this, the problem disappears immediately as grabbing resources in the same order everywhere is The Right Thing.

this sounds true.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

if we start to do like this, the problem disappears immediately as grabbing resources in the same order everywhere is The Right Thing.

There are two problems here: 1. live lock; 2. layout is too large. So we still need the fix in lu infrastructure plus extra RPC to fetch layout if layout is too large to fit in completion AST buffer.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

what kind of live lock? as for "too large" i don't quite see the problem yet:

1) the client sends enqueue/getattr
2) lovea doesn't fit - mds returns layout lock + eadatalen
3) the client fetches lovea with appropriate buffer holding layout lock from (2)

Comment by Jinshan Xiong (Inactive) [ 11/Apr/13 ]

no I mean if the layout can fit in completion buffer, we still need to piggy it back in completion AST. This is the mostly case.

Comment by Alex Zhuravlev [ 11/Apr/13 ]

not sure why we do. I'm quite sure the most frequent case is when we get layout with enqueue/open+create and it will stay around for very long like openlock does.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/13 ]

I pushed a patch to: http://review.whamcloud.com/6042 for this problem, please take a look.

Comment by Andreas Dilger [ 12/Apr/13 ]

This is closely related to LU-2807, and should probably be fixed by the same patch.

Comment by Peter Jones [ 19/Apr/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:31:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.