Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
A customer reported their applications see zero files size for a file just written by another client.
test details are : one client app (bash script) prepares temporary files for further stat & read and by other applications (bash scripts), each of the temp files is read by many computing threads and one or two fail to get correct file size.
the issue was rootcaused as a race between glimpse and lock cancel, so glimpse fails to get file attrs from client, but the ost object is not wrtten yet to OST, so server can't fetch correct file (ost object) size from disk.
The problems can be illustrated by the following reproducer:
diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h index a728bef..7d19efc 100644 --- a/lustre/include/obd_support.h +++ b/lustre/include/obd_support.h @@ -416,6 +416,7 @@ extern char obd_jobid_var[]; #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM 0x413 #define OBD_FAIL_OSC_DELAY_IO 0x414 #define OBD_FAIL_OSC_NO_SIZE_DATA 0x415 +#define OBD_FAIL_OSC_DELAY_CANCEL 0x416 #define OBD_FAIL_PTLRPC 0x500 #define OBD_FAIL_PTLRPC_ACK 0x501 diff --git a/lustre/osc/osc_lock.c b/lustre/osc/osc_lock.c index 85ab132..e1c1f1a 100644 --- a/lustre/osc/osc_lock.c +++ b/lustre/osc/osc_lock.c @@ -430,6 +430,8 @@ static int osc_dlm_blocking_ast0(const struct lu_env *env, unlock_res_and_lock(dlmlock); + OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_CANCEL, 5); + /* if l_ast_data is NULL, the dlmlock was enqueued by AGL or * the object has been destroyed. */ if (obj != NULL) { diff --git a/lustre/tests/sanityn.sh b/lustre/tests/sanityn.sh index db37781..649e172 100755 --- a/lustre/tests/sanityn.sh +++ b/lustre/tests/sanityn.sh @@ -4875,6 +4875,17 @@ test_104() { } run_test 104 "Verify that MDS stores atime/mtime/ctime during close" +test_105() { + test_mkdir -p $DIR/$tdir + echo test > $DIR/$tdir/$tfile + $LCTL set_param fail_loc=0x416 + cancel_lru_locks osc & sleep 1 + stat $DIR2/$tdir/$tfile + wait + stat $DIR2/$tdir/$tfile +} +run_test 105 "Test size correctness" + log "cleanup: ======================================================" # kill and wait in each test only guarentee script finish, but command in script
test output:
... == sanityn test 105: Test size correctness =========================================================== 13:42:25 (1578912145) striped dir -i1 -c2 /mnt/lustre/d105.sanityn fail_loc=0x416 File: '/mnt/lustre2/d105.sanityn/f105.sanityn' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 144115339490230275 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-01-13 13:42:26.000000000 +0300 Modify: 2020-01-13 13:42:26.000000000 +0300 Change: 2020-01-13 13:42:26.000000000 +0300 Birth: - File: '/mnt/lustre2/d105.sanityn/f105.sanityn' Size: 5 Blocks: 8 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115339490230275 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-01-13 13:42:26.000000000 +0300 Modify: 2020-01-13 13:42:26.000000000 +0300 Change: 2020-01-13 13:42:26.000000000 +0300 Birth: - Resetting fail_loc on all nodes...done. PASS 105 (6s) cleanup: ====================================================== ...
please ignore PASS mark, the first stat command returns "Size: 0 Blocks: 0 ", that shouldn't happen since a client has written the file.
Attachments
Issue Links
- is related to
-
LU-13645 Various data corruptions possible in lustre.
- Resolved