[LU-6570] sanityn test_40a: create is blocked Created: 06/May/15 Updated: 09/Sep/16 Resolved: 09/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.5.5 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2 | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0859146e-f35b-11e4-9186-5254006e85c2. The sub-test test_40a failed with the following error: create is blocked Please provide additional information about the failure here. Info required for matching: sanityn 40a |
| Comments |
| Comment by Andreas Dilger [ 06/May/15 ] |
|
It looks like this test may be racy under load: # check that pid exists hence second operation wasn't blocked by first one
# if it is so then there is no conflict, return 0
# else second operation is conflicting with first one, return 1
check_pdo_conflict() {
local pid=$1
local conflict=0
sleep 1 # to ensure OP1 is finished on client if OP2 is blocked by OP1
if [[ `ps --pid $pid | wc -l` == 1 ]]; then
conflict=1
echo "Conflict"
else
echo "No conflict"
fi
return $conflict
}
If a process is around for 1s then it would cause a failure? In any case, this has only ever happened once, so it isn't clear if it was just a very small race condition or something else. I looked through the other sanityn tests for the patch http://review.whamcloud.com/14100 where this was hit, but no other cases of this failure. |
| Comment by Jian Yu [ 11/Jun/15 ] |
|
On master branch, this is a regression failure introduced by patch http://review.whamcloud.com/14495. |
| Comment by Di Wang [ 19/Jun/15 ] |
|
No, there are no protocol changes for the patch. The failures on 2.5.5 is because of mistakes they made when porting the patch http://review.whamcloud.com/#/c/14495/ from master to b2_5. I am not sure about the failure on master, and it is not even about DNE because all of operations in sanityn 40a are being done under non-striped directory, not sure why it is related with 14495. |
| Comment by Di Wang [ 20/Jun/15 ] |
|
Just check the debug log, the failure seems caused by the test-script itself. On the client1, it mkdir $DIR1/$tfile 0000080:00200000:1.0:1432173251.411997:0:5365:0:(namei.c:1153:ll_mkdir()) VFS Op:name=f40a.sanityn, dir=[0x200000007:0x1:0x0](ffff88006b8c0a98) on the server side, because of OBD_FAIL_MDS_PDO_LOCK, it will be scheduled timeout for 15 seconds. Note: during timeout, it actually holds the update lock of DIR1 (root) Then on client2, which will do touch touch $DIR2/$tfile-2, first it will try to revalidate DIR2 00000002:00010000:0.0:1432173252.415097:0:5367:0:(mdc_locks.c:1148:mdc_intent_lock()) (name: ,[0x200000007:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: lookup flags 00 But it seems on the MDT side, UPDATE lock is added back because of /* If the file has not been changed for some time, we
* return not only a LOOKUP lock, but also an UPDATE
* lock and this might save us RPC on later STAT. For
* directories, it also let negative dentry cache start
* working for this dir. */
if (ma->ma_valid & MA_INODE &&
ma->ma_attr.la_valid & LA_CTIME &&
info->mti_mdt->mdt_namespace->ns_ctime_age_limit +
ma->ma_attr.la_ctime < cfs_time_current_sec())
child_bits |= MDS_INODELOCK_UPDATE;
But unfortunately, this is FID revalidate(no name in the request) so no PDO lock. then it will be blocked by the lock above (holded by client1 request handler). Hmm, I guess we can just add "touch $DIR2" before client1 "mkdir $IR1/$tfile", then this ctime_age_limit will not be triggered, i.e. UPDATE lock will not be required during revalidate /, the problem should be fixed. |
| Comment by Gerrit Updater [ 20/Jun/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15366 |
| Comment by Gerrit Updater [ 09/Jul/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15366/ |
| Comment by Peter Jones [ 09/Jul/15 ] |
|
Landed for 2.8 |