Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13128

a race between glimpse and lock cancel is not handled correctly

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.14.0
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      A customer reported their applications see zero files size for a file just written by another client.

      test details are : one client app (bash script) prepares temporary files for further stat & read and by other applications (bash scripts), each of the temp files is read by many computing threads and one or two fail to get correct file size.

      the issue was rootcaused as a race between glimpse and lock cancel, so glimpse fails to get file attrs from client, but the ost object is not wrtten yet to OST, so server can't fetch correct file (ost object) size from disk.

      The problems can be illustrated by the following reproducer:

      diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
      index a728bef..7d19efc 100644
      --- a/lustre/include/obd_support.h
      +++ b/lustre/include/obd_support.h
      @@ -416,6 +416,7 @@ extern char obd_jobid_var[];
       #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM 0x413
       #define OBD_FAIL_OSC_DELAY_IO            0x414
       #define OBD_FAIL_OSC_NO_SIZE_DATA        0x415
      +#define OBD_FAIL_OSC_DELAY_CANCEL        0x416
       
       #define OBD_FAIL_PTLRPC                  0x500
       #define OBD_FAIL_PTLRPC_ACK              0x501
      diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c
      index 85ab132..e1c1f1a 100644
      --- a/lustre/osc/osc_lock.c
      +++ b/lustre/osc/osc_lock.c
      @@ -430,6 +430,8 @@ static int osc_dlm_blocking_ast0(const struct lu_env *env,
       
              unlock_res_and_lock(dlmlock);
       
      +       OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_CANCEL, 5);
      +
              /* if l_ast_data is NULL, the dlmlock was enqueued by AGL or
               * the object has been destroyed. */
              if (obj != NULL) {
      diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh
      index db37781..649e172 100755
      --- a/lustre/tests/sanityn.sh
      +++ b/lustre/tests/sanityn.sh
      @@ -4875,6 +4875,17 @@ test_104() {
       }
       run_test 104 "Verify that MDS stores atime/mtime/ctime during close"
       
      +test_105() {
      +       test_mkdir -p $DIR/$tdir
      +       echo test > $DIR/$tdir/$tfile
      +       $LCTL set_param fail_loc=0x416
      +       cancel_lru_locks osc & sleep 1
      +       stat $DIR2/$tdir/$tfile
      +       wait
      +       stat $DIR2/$tdir/$tfile
      +}
      +run_test 105 "Test size correctness"
      +
       log "cleanup: ======================================================"
       
       # kill and wait in each test only guarentee script finish, but command in script
      

      test output:

      ...
      == sanityn test 105: Test size correctness =========================================================== 13:42:25 (1578912145)
      striped dir -i1 -c2 /mnt/lustre/d105.sanityn
      fail_loc=0x416
        File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2020-01-13 13:42:26.000000000 +0300
      Modify: 2020-01-13 13:42:26.000000000 +0300
      Change: 2020-01-13 13:42:26.000000000 +0300
       Birth: -
        File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
        Size: 5         	Blocks: 8          IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2020-01-13 13:42:26.000000000 +0300
      Modify: 2020-01-13 13:42:26.000000000 +0300
      Change: 2020-01-13 13:42:26.000000000 +0300
       Birth: -
      Resetting fail_loc on all nodes...done.
      PASS 105 (6s)
      cleanup: ======================================================
      ...
      

      please ignore PASS mark, the first stat command returns "Size: 0 Blocks: 0 ", that shouldn't happen since a client has written the file.

        Attachments

          Activity

            People

            • Assignee:
              zam Alexander Zarochentsev
              Reporter:
              zamcray Alexander Zarochentsev
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: