[LU-6570] sanityn test_40a: create is blocked Created: 06/May/15  Updated: 09/Sep/16  Resolved: 09/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.5.5
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2

Issue Links:
Duplicate
is duplicated by LU-6672 sanityn test_40b: create is blocked Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0859146e-f35b-11e4-9186-5254006e85c2.

The sub-test test_40a failed with the following error:

create is blocked

Please provide additional information about the failure here.

Info required for matching: sanityn 40a



 Comments   
Comment by Andreas Dilger [ 06/May/15 ]

It looks like this test may be racy under load:

# check that pid exists hence second operation wasn't blocked by first one
# if it is so then there is no conflict, return 0
# else second operation is conflicting with first one, return 1
check_pdo_conflict() {
        local pid=$1
        local conflict=0
        sleep 1 # to ensure OP1 is finished on client if OP2 is blocked by OP1
        if [[ `ps --pid $pid | wc -l` == 1 ]]; then
                conflict=1
                echo "Conflict"
        else
                echo "No conflict"
        fi
        return $conflict
}

If a process is around for 1s then it would cause a failure? In any case, this has only ever happened once, so it isn't clear if it was just a very small race condition or something else. I looked through the other sanityn tests for the patch http://review.whamcloud.com/14100 where this was hit, but no other cases of this failure.

Comment by Jian Yu [ 11/Jun/15 ]

On master branch, this is a regression failure introduced by patch http://review.whamcloud.com/14495.

Comment by Di Wang [ 19/Jun/15 ]

No, there are no protocol changes for the patch. The failures on 2.5.5 is because of mistakes they made when porting the patch http://review.whamcloud.com/#/c/14495/ from master to b2_5. I am not sure about the failure on master, and it is not even about DNE because all of operations in sanityn 40a are being done under non-striped directory, not sure why it is related with 14495.

Comment by Di Wang [ 20/Jun/15 ]

Just check the debug log, the failure seems caused by the test-script itself.

On the client1, it mkdir $DIR1/$tfile

0000080:00200000:1.0:1432173251.411997:0:5365:0:(namei.c:1153:ll_mkdir()) VFS Op:name=f40a.sanityn, dir=[0x200000007:0x1:0x0](ffff88006b8c0a98)

on the server side, because of OBD_FAIL_MDS_PDO_LOCK, it will be scheduled timeout for 15 seconds. Note: during timeout, it actually holds the update lock of DIR1 (root)

Then on client2, which will do touch touch $DIR2/$tfile-2, first it will try to revalidate DIR2 , usually client only requires LOOKUP lock for this revalidation.

00000002:00010000:0.0:1432173252.415097:0:5367:0:(mdc_locks.c:1148:mdc_intent_lock()) (name: ,[0x200000007:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: lookup flags 00

But it seems on the MDT side, UPDATE lock is added back because of

                      /* If the file has not been changed for some time, we
                         * return not only a LOOKUP lock, but also an UPDATE
                         * lock and this might save us RPC on later STAT. For
                         * directories, it also let negative dentry cache start
                         * working for this dir. */
                        if (ma->ma_valid & MA_INODE &&
                            ma->ma_attr.la_valid & LA_CTIME &&
                            info->mti_mdt->mdt_namespace->ns_ctime_age_limit +
                                ma->ma_attr.la_ctime < cfs_time_current_sec())
                                child_bits |= MDS_INODELOCK_UPDATE;

But unfortunately, this is FID revalidate(no name in the request) so no PDO lock. then it will be blocked by the lock above (holded by client1 request handler).

Hmm, I guess we can just add "touch $DIR2" before client1 "mkdir $IR1/$tfile", then this ctime_age_limit will not be triggered, i.e. UPDATE lock will not be required during revalidate /, the problem should be fixed.

Comment by Gerrit Updater [ 20/Jun/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15366
Subject: LU-6570 tests: fix 40a in sanityn.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 459bb83d92ddc50aa5caa71b629ba29e53266120

Comment by Gerrit Updater [ 09/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15366/
Subject: LU-6570 tests: fix 40a in sanityn.sh
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2083f8a34b556c9e64e09a4529df7d9bdbdd7532

Comment by Peter Jones [ 09/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:01:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.