[LU-14383] race between lookup and migrate Created: 29/Jan/21  Updated: 12/Jan/22  Resolved: 12/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following race between mkdir and migrate may end with failure of mkdir:

       mkdir $DIR1/$tdir

       mkdir $DIR2/$tdir/dir2 &
       $LFS migrate -m 1 $DIR1/$tdir

Please confirm whether it is expected behavior has to be fixed.



 Comments   
Comment by Gerrit Updater [ 29/Jan/21 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/41364
Subject: LU-14383 tests: lookup and migrate race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9a235febb1ab331929ef57c84e95cbe479596e1

Comment by Lai Siyao [ 05/Feb/21 ]

This looks to be expected in current implementation: "mkdir $DIR2/$tdir/dir2" tried to create "dir2" under current layout, however if server found the layout is changed, it may return error. To solve this race, client needs to refresh layout of $tdir and try again.

Comment by Vladimir Saveliev [ 09/Feb/21 ]

This looks to be expected in current implementation: "mkdir $DIR2/$tdir/dir2" tried to create "dir2" under current layout, however if server found the layout is changed, it may return error.

ok

Btw, the same applies to stat:

@@ -4631,12 +4631,13 @@ test_80c() {
        [ $MDSCOUNT -lt 2 ] && skip "needs >= 2 MDTs" && return

        mkdir $DIR1/$tdir
+       touch $DIR1/$tdir/file
        #define OBD_FAIL_MDS_OBJECT_LOCK_DELAY  0x16b
        do_facet mds1 $LCTL set_param fail_loc=0x8000016b
-       mkdir $DIR2/$tdir/dir2 &
-       MKDIRPID=$!
+       stat $DIR2/$tdir/file &
+       STATPID=$!
        $LFS migrate -m 1 $DIR1/$tdir
-       wait $MKDIRPID
+       wait $STATPID
        [ $? -eq 0 ] || error "stat failed"
 }

This results in:

== sanityn test 80c: Lookup and migrate race ========================================================= 18:04:30 (1612796670)
fail_loc=0x8000016b
stat: cannot stat '/mnt/lustre2/d80c.sanityn/file': No such file or directory

If this is ok as well, then sanityn.sh:test_80b() should not break its "accessing the migrating directory" loop with:

                stat $migrate_dir2/file5 > /dev/null || {
                        echo "stat file5 fails"
			break
                }

To solve this race, client needs to refresh layout of $tdir and try again.

WIth -ENOENT clients are not able to decide whether they are to try again. Would it be possible to return -EAGAIN in case of race with migrate?

Comment by Lai Siyao [ 09/Feb/21 ]

This issue should be fixed by FID map, which is tracked under LU-7607. The patch is on https://review.whamcloud.com/#/c/38233/, but I don't have time to update it recently.

Generated at Sat Feb 10 03:09:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.