Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0, Lustre 2.5.5
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/0859146e-f35b-11e4-9186-5254006e85c2.

      The sub-test test_40a failed with the following error:

      create is blocked
      

      Please provide additional information about the failure here.

      Info required for matching: sanityn 40a

      Attachments

        Issue Links

          Activity

            [LU-6570] sanityn test_40a: create is blocked
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15366/
            Subject: LU-6570 tests: fix 40a in sanityn.sh
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2083f8a34b556c9e64e09a4529df7d9bdbdd7532

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15366/ Subject: LU-6570 tests: fix 40a in sanityn.sh Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2083f8a34b556c9e64e09a4529df7d9bdbdd7532

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15366
            Subject: LU-6570 tests: fix 40a in sanityn.sh
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 459bb83d92ddc50aa5caa71b629ba29e53266120

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15366 Subject: LU-6570 tests: fix 40a in sanityn.sh Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 459bb83d92ddc50aa5caa71b629ba29e53266120
            di.wang Di Wang added a comment -

            Just check the debug log, the failure seems caused by the test-script itself.

            On the client1, it mkdir $DIR1/$tfile

            0000080:00200000:1.0:1432173251.411997:0:5365:0:(namei.c:1153:ll_mkdir()) VFS Op:name=f40a.sanityn, dir=[0x200000007:0x1:0x0](ffff88006b8c0a98)

            on the server side, because of OBD_FAIL_MDS_PDO_LOCK, it will be scheduled timeout for 15 seconds. Note: during timeout, it actually holds the update lock of DIR1 (root)

            Then on client2, which will do touch touch $DIR2/$tfile-2, first it will try to revalidate DIR2 , usually client only requires LOOKUP lock for this revalidation.

            00000002:00010000:0.0:1432173252.415097:0:5367:0:(mdc_locks.c:1148:mdc_intent_lock()) (name: ,[0x200000007:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: lookup flags 00
            

            But it seems on the MDT side, UPDATE lock is added back because of

                                  /* If the file has not been changed for some time, we
                                     * return not only a LOOKUP lock, but also an UPDATE
                                     * lock and this might save us RPC on later STAT. For
                                     * directories, it also let negative dentry cache start
                                     * working for this dir. */
                                    if (ma->ma_valid & MA_INODE &&
                                        ma->ma_attr.la_valid & LA_CTIME &&
                                        info->mti_mdt->mdt_namespace->ns_ctime_age_limit +
                                            ma->ma_attr.la_ctime < cfs_time_current_sec())
                                            child_bits |= MDS_INODELOCK_UPDATE;
            

            But unfortunately, this is FID revalidate(no name in the request) so no PDO lock. then it will be blocked by the lock above (holded by client1 request handler).

            Hmm, I guess we can just add "touch $DIR2" before client1 "mkdir $IR1/$tfile", then this ctime_age_limit will not be triggered, i.e. UPDATE lock will not be required during revalidate /, the problem should be fixed.

            di.wang Di Wang added a comment - Just check the debug log, the failure seems caused by the test-script itself. On the client1, it mkdir $DIR1/$tfile 0000080:00200000:1.0:1432173251.411997:0:5365:0:(namei.c:1153:ll_mkdir()) VFS Op:name=f40a.sanityn, dir=[0x200000007:0x1:0x0](ffff88006b8c0a98) on the server side, because of OBD_FAIL_MDS_PDO_LOCK, it will be scheduled timeout for 15 seconds. Note: during timeout, it actually holds the update lock of DIR1 (root) Then on client2, which will do touch touch $DIR2/$tfile-2, first it will try to revalidate DIR2 , usually client only requires LOOKUP lock for this revalidation. 00000002:00010000:0.0:1432173252.415097:0:5367:0:(mdc_locks.c:1148:mdc_intent_lock()) (name: ,[0x200000007:0x1:0x0]) in obj [0x200000007:0x1:0x0], intent: lookup flags 00 But it seems on the MDT side, UPDATE lock is added back because of /* If the file has not been changed for some time, we * return not only a LOOKUP lock, but also an UPDATE * lock and this might save us RPC on later STAT. For * directories, it also let negative dentry cache start * working for this dir. */ if (ma->ma_valid & MA_INODE && ma->ma_attr.la_valid & LA_CTIME && info->mti_mdt->mdt_namespace->ns_ctime_age_limit + ma->ma_attr.la_ctime < cfs_time_current_sec()) child_bits |= MDS_INODELOCK_UPDATE; But unfortunately, this is FID revalidate(no name in the request) so no PDO lock. then it will be blocked by the lock above (holded by client1 request handler). Hmm, I guess we can just add "touch $DIR2" before client1 "mkdir $IR1/$tfile", then this ctime_age_limit will not be triggered, i.e. UPDATE lock will not be required during revalidate /, the problem should be fixed.
            di.wang Di Wang added a comment -

            No, there are no protocol changes for the patch. The failures on 2.5.5 is because of mistakes they made when porting the patch http://review.whamcloud.com/#/c/14495/ from master to b2_5. I am not sure about the failure on master, and it is not even about DNE because all of operations in sanityn 40a are being done under non-striped directory, not sure why it is related with 14495.

            di.wang Di Wang added a comment - No, there are no protocol changes for the patch. The failures on 2.5.5 is because of mistakes they made when porting the patch http://review.whamcloud.com/#/c/14495/ from master to b2_5. I am not sure about the failure on master, and it is not even about DNE because all of operations in sanityn 40a are being done under non-striped directory, not sure why it is related with 14495.
            yujian Jian Yu added a comment -

            On master branch, this is a regression failure introduced by patch http://review.whamcloud.com/14495.

            yujian Jian Yu added a comment - On master branch, this is a regression failure introduced by patch http://review.whamcloud.com/14495 .

            It looks like this test may be racy under load:

            # check that pid exists hence second operation wasn't blocked by first one
            # if it is so then there is no conflict, return 0
            # else second operation is conflicting with first one, return 1
            check_pdo_conflict() {
                    local pid=$1
                    local conflict=0
                    sleep 1 # to ensure OP1 is finished on client if OP2 is blocked by OP1
                    if [[ `ps --pid $pid | wc -l` == 1 ]]; then
                            conflict=1
                            echo "Conflict"
                    else
                            echo "No conflict"
                    fi
                    return $conflict
            }
            

            If a process is around for 1s then it would cause a failure? In any case, this has only ever happened once, so it isn't clear if it was just a very small race condition or something else. I looked through the other sanityn tests for the patch http://review.whamcloud.com/14100 where this was hit, but no other cases of this failure.

            adilger Andreas Dilger added a comment - It looks like this test may be racy under load: # check that pid exists hence second operation wasn't blocked by first one # if it is so then there is no conflict, return 0 # else second operation is conflicting with first one, return 1 check_pdo_conflict() { local pid=$1 local conflict=0 sleep 1 # to ensure OP1 is finished on client if OP2 is blocked by OP1 if [[ `ps --pid $pid | wc -l` == 1 ]]; then conflict=1 echo "Conflict" else echo "No conflict" fi return $conflict } If a process is around for 1s then it would cause a failure? In any case, this has only ever happened once, so it isn't clear if it was just a very small race condition or something else. I looked through the other sanityn tests for the patch http://review.whamcloud.com/14100 where this was hit, but no other cases of this failure.

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: