[LU-2877] sanity test_34h failed Multiop blocked on ftruncate, pid= Created: 26/Feb/13  Updated: 01/May/13  Resolved: 01/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: mq313

Severity: 3
Rank (Obsolete): 6951

 Description   

Happens pretty frequently in maloo since April 2012.
Sample failures:
https://maloo.whamcloud.com/test_sets/33b06c22-8065-11e2-9b82-52540035b04c

https://maloo.whamcloud.com/sub_tests/9ccebafe-807b-11e2-b777-52540035b04c

I think the underlying problem is that multiop as written can block on not just truncate lock, but also at group lock getting time it can block on cache flush.

        dd if=/dev/zero of=$DIR/$tfile bs=1M count=10 || error
        $MULTIOP $DIR/$tfile OG${gid}T${sz}g${gid}c &
        MULTIPID=$!
        sleep 2

Co perhaps we need to add sync or other way of cache flush after dd before doing multiop.



 Comments   
Comment by Oleg Drokin [ 26/Feb/13 ]

My idea of a patch is http://review.whamcloud.com/#change,5541

Comment by Peter Jones [ 05/Mar/13 ]

Landed for 2.4

Comment by Andreas Dilger [ 11/Mar/13 ]

My test at https://maloo.whamcloud.com/test_sets/0be54d54-8626-11e2-b472-52540035b04c still fails with Oleg's patch landed.

This is from http://review.whamcloud.com/5470, which has parent commit hash b7f949e04bbd4533316f0ca09b4b7d4f1765eca1, which is commit one after Oleg's patch commit hash 8fcac3f7a25c2d97974d292830dabe1611274085.

Comment by Oleg Drokin [ 12/Mar/13 ]

There was another patch for a similar issue from Johann that also changed multiop, but I can no longer find it. Johann, do you remember?

Also it's strange that sync was not enough, since there should be nothing to flush left.

Comment by Johann Lombardi (Inactive) [ 14/Mar/13 ]

It was http://review.whamcloud.com/#change,5558 which also uses multiop_bg_pause. Not sure it really makes a difference ...

That said, this latest report is with ldiskfs while all the previous ones i looked at were with ZFS ...

Comment by Johann Lombardi (Inactive) [ 14/Mar/13 ]

On 2nd thought, i think multiop_bg_pause is not suitable here. We probably need to look at the logs to understand where we are slow/stuck.

Comment by Oleg Drokin [ 19/Mar/13 ]

I wonder if it's a timing issue left - multiop starts late and has no chance to terminate by the time we check for it? Or I guess truncate serve-side mght be taking longer time under disk load?

Comment by Oleg Drokin [ 29/Mar/13 ]

Ok, another attempt at fixing: http://review.whamcloud.com/5882

Comment by Andreas Dilger [ 01/May/13 ]

Sanity test_32h hasn't failed in over a month, even before the most recent patch landed...

Generated at Sat Feb 10 01:28:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.