[LU-13128] a race between glimpse and lock cancel is not handled correctly Created: 13/Jan/20  Updated: 23/Nov/20  Resolved: 08/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Alexander Zarochentsev (Inactive) Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13645 Various data corruptions possible in ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A customer reported their applications see zero files size for a file just written by another client.

test details are : one client app (bash script) prepares temporary files for further stat & read and by other applications (bash scripts), each of the temp files is read by many computing threads and one or two fail to get correct file size.

the issue was rootcaused as a race between glimpse and lock cancel, so glimpse fails to get file attrs from client, but the ost object is not wrtten yet to OST, so server can't fetch correct file (ost object) size from disk.

The problems can be illustrated by the following reproducer:

diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
index a728bef..7d19efc 100644
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -416,6 +416,7 @@ extern char obd_jobid_var[];
 #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM 0x413
 #define OBD_FAIL_OSC_DELAY_IO            0x414
 #define OBD_FAIL_OSC_NO_SIZE_DATA        0x415
+#define OBD_FAIL_OSC_DELAY_CANCEL        0x416
 
 #define OBD_FAIL_PTLRPC                  0x500
 #define OBD_FAIL_PTLRPC_ACK              0x501
diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c
index 85ab132..e1c1f1a 100644
--- a/lustre/osc/osc_lock.c
+++ b/lustre/osc/osc_lock.c
@@ -430,6 +430,8 @@ static int osc_dlm_blocking_ast0(const struct lu_env *env,
 
        unlock_res_and_lock(dlmlock);
 
+       OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_CANCEL, 5);
+
        /* if l_ast_data is NULL, the dlmlock was enqueued by AGL or
         * the object has been destroyed. */
        if (obj != NULL) {
diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh
index db37781..649e172 100755
--- a/lustre/tests/sanityn.sh
+++ b/lustre/tests/sanityn.sh
@@ -4875,6 +4875,17 @@ test_104() {
 }
 run_test 104 "Verify that MDS stores atime/mtime/ctime during close"
 
+test_105() {
+       test_mkdir -p $DIR/$tdir
+       echo test > $DIR/$tdir/$tfile
+       $LCTL set_param fail_loc=0x416
+       cancel_lru_locks osc & sleep 1
+       stat $DIR2/$tdir/$tfile
+       wait
+       stat $DIR2/$tdir/$tfile
+}
+run_test 105 "Test size correctness"
+
 log "cleanup: ======================================================"
 
 # kill and wait in each test only guarentee script finish, but command in script

test output:

...
== sanityn test 105: Test size correctness =========================================================== 13:42:25 (1578912145)
striped dir -i1 -c2 /mnt/lustre/d105.sanityn
fail_loc=0x416
  File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-13 13:42:26.000000000 +0300
Modify: 2020-01-13 13:42:26.000000000 +0300
Change: 2020-01-13 13:42:26.000000000 +0300
 Birth: -
  File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
  Size: 5         	Blocks: 8          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115339490230275  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-13 13:42:26.000000000 +0300
Modify: 2020-01-13 13:42:26.000000000 +0300
Change: 2020-01-13 13:42:26.000000000 +0300
 Birth: -
Resetting fail_loc on all nodes...done.
PASS 105 (6s)
cleanup: ======================================================
...

please ignore PASS mark, the first stat command returns "Size: 0 Blocks: 0 ", that shouldn't happen since a client has written the file.



 Comments   
Comment by Gerrit Updater [ 13/Jan/20 ]

Alexander Zarochentsev (c17826@cray.com) uploaded a new patch: https://review.whamcloud.com/37215
Subject: LU-13128 osc: glimpse and lock cancel race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a01e5519b65a3cde9c00ecdea60bec164163573b

Comment by Alexander Zarochentsev (Inactive) [ 13/Jan/20 ]

A run of the same reproducer with https://review.whamcloud.com/37215 applied:

== sanityn test 105: Test size correctness =========================================================== 14:14:20 (1578914060)
striped dir -i1 -c2 /mnt/lustre/d105.sanityn
fail_loc=0x416
  File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
  Size: 5         	Blocks: 1          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115373044662275  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-13 14:14:20.000000000 +0300
Modify: 2020-01-13 14:14:20.000000000 +0300
Change: 2020-01-13 14:14:20.000000000 +0300
 Birth: -
  File: '/mnt/lustre2/d105.sanityn/f105.sanityn'
  Size: 5         	Blocks: 8          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115373044662275  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-13 14:14:20.000000000 +0300
Modify: 2020-01-13 14:14:20.000000000 +0300
Change: 2020-01-13 14:14:20.000000000 +0300
 Birth: -
Resetting fail_loc on all nodes...done.
PASS 105 (5s)
cleanup: ======================================================
Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37215/
Subject: LU-13128 osc: glimpse and lock cancel race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c99f67d9d39e8a037e830cf08a9df305e6d8da2

Comment by Peter Jones [ 08/Feb/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:58:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.