Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13128

a race between glimpse and lock cancel is not handled correctly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      A customer reported their applications see zero files size for a file just written by another client.

      test details are : one client app (bash script) prepares temporary files for further stat & read and by other applications (bash scripts), each of the temp files is read by many computing threads and one or two fail to get correct file size.

      the issue was rootcaused as a race between glimpse and lock cancel, so glimpse fails to get file attrs from client, but the ost object is not wrtten yet to OST, so server can't fetch correct file (ost object) size from disk.

      The problems can be illustrated by the following reproducer:

      diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
      index a728bef..7d19efc 100644
      --- a/lustre/include/obd_support.h
      +++ b/lustre/include/obd_support.h
      @@ -416,6 +416,7 @@ extern char obd_jobid_var[];
       #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM 0x413
       #define OBD_FAIL_OSC_DELAY_IO            0x414
       #define OBD_FAIL_OSC_NO_SIZE_DATA        0x415
      +#define OBD_FAIL_OSC_DELAY_CANCEL        0x416
       
       #define OBD_FAIL_PTLRPC                  0x500
       #define OBD_FAIL_PTLRPC_ACK              0x501
      diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c
      index 85ab132..e1c1f1a 100644
      --- a/lustre/osc/osc_lock.c
      +++ b/lustre/osc/osc_lock.c
      @@ -430,6 +430,8 @@ static int osc_dlm_blocking_ast0(const struct lu_env *env,
       
              unlock_res_and_lock(dlmlock);
       
      +       OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_CANCEL, 5);
      +
              /* if l_ast_data is NULL, the dlmlock was enqueued by AGL or
               * the object has been destroyed. */
              if (obj != NULL) {
      diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh
      index db37781..649e172 100755
      --- a/lustre/tests/sanityn.sh
      +++ b/lustre/tests/sanityn.sh
      @@ -4875,6 +4875,17 @@ test_104() {
       }
       run_test 104 "Verify that MDS stores atime/mtime/ctime during close"
       
      +test_105() {
      +       test_mkdir -p $DIR/$tdir
      +       echo test > $DIR/$tdir/$tfile
      +       $LCTL set_param fail_loc=0x416
      +       cancel_lru_locks osc & sleep 1
      +       stat $DIR2/$tdir/$tfile
      +       wait
      +       stat $DIR2/$tdir/$tfile
      +}
      +run_test 105 "Test size correctness"
      +
       log "cleanup: ======================================================"
       
       # kill and wait in each test only guarentee script finish, but command in script
      

      test output:

      ...
      == sanityn test 105: Test size correctness =========================================================== 13:42:25 (1578912145)
      striped dir -i1 -c2 /mnt/lustre/d105.sanityn
      fail_loc=0x416
        File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2020-01-13 13:42:26.000000000 +0300
      Modify: 2020-01-13 13:42:26.000000000 +0300
      Change: 2020-01-13 13:42:26.000000000 +0300
       Birth: -
        File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
        Size: 5         	Blocks: 8          IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2020-01-13 13:42:26.000000000 +0300
      Modify: 2020-01-13 13:42:26.000000000 +0300
      Change: 2020-01-13 13:42:26.000000000 +0300
       Birth: -
      Resetting fail_loc on all nodes...done.
      PASS 105 (6s)
      cleanup: ======================================================
      ...
      

      please ignore PASS mark, the first stat command returns "Size: 0 Blocks: 0 ", that shouldn't happen since a client has written the file.

      Attachments

        Issue Links

          Activity

            [LU-13128] a race between glimpse and lock cancel is not handled correctly
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37215/
            Subject: LU-13128 osc: glimpse and lock cancel race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7c99f67d9d39e8a037e830cf08a9df305e6d8da2

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37215/ Subject: LU-13128 osc: glimpse and lock cancel race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c99f67d9d39e8a037e830cf08a9df305e6d8da2

            A run of the same reproducer with https://review.whamcloud.com/37215 applied:

            == sanityn test 105: Test size correctness =========================================================== 14:14:20 (1578914060)
            striped dir -i1 -c2 /mnt/lustre/d105.sanityn
            fail_loc=0x416
              File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
              Size: 5         	Blocks: 1          IO Block: 4194304 regular file
            Device: 2c54f966h/743766374d	Inode: 144115373044662275  Links: 1
            Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
            Access: 2020-01-13 14:14:20.000000000 +0300
            Modify: 2020-01-13 14:14:20.000000000 +0300
            Change: 2020-01-13 14:14:20.000000000 +0300
             Birth: -
              File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
              Size: 5         	Blocks: 8          IO Block: 4194304 regular file
            Device: 2c54f966h/743766374d	Inode: 144115373044662275  Links: 1
            Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
            Access: 2020-01-13 14:14:20.000000000 +0300
            Modify: 2020-01-13 14:14:20.000000000 +0300
            Change: 2020-01-13 14:14:20.000000000 +0300
             Birth: -
            Resetting fail_loc on all nodes...done.
            PASS 105 (5s)
            cleanup: ======================================================
            
            zamcray Alexander Zarochentsev (Inactive) added a comment - A run of the same reproducer with https://review.whamcloud.com/37215 applied: == sanityn test 105: Test size correctness =========================================================== 14:14:20 (1578914060) striped dir -i1 -c2 /mnt/lustre/d105.sanityn fail_loc=0x416 File: '/mnt/lustre2/d105.sanityn/f105.sanityn' Size: 5 Blocks: 1 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115373044662275 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-01-13 14:14:20.000000000 +0300 Modify: 2020-01-13 14:14:20.000000000 +0300 Change: 2020-01-13 14:14:20.000000000 +0300 Birth: - File: '/mnt/lustre2/d105.sanityn/f105.sanityn' Size: 5 Blocks: 8 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115373044662275 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-01-13 14:14:20.000000000 +0300 Modify: 2020-01-13 14:14:20.000000000 +0300 Change: 2020-01-13 14:14:20.000000000 +0300 Birth: - Resetting fail_loc on all nodes...done. PASS 105 (5s) cleanup: ======================================================

            Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37215
            Subject: LU-13128 osc: glimpse and lock cancel race
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a01e5519b65a3cde9c00ecdea60bec164163573b

            gerrit Gerrit Updater added a comment - Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37215 Subject: LU-13128 osc: glimpse and lock cancel race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a01e5519b65a3cde9c00ecdea60bec164163573b

            People

              zam Alexander Zarochentsev
              zamcray Alexander Zarochentsev (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: