[LU-8284] i_size updates from BRW writes are not atomic Created: 15/Jun/16 Updated: 13/Sep/16 Resolved: 13/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andrew Perepechko | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
There is a race window in osd_write_commit() between i_size_read() check and i_size_write(). A test case showing the issue and a fix will be uploaded shortly. |
| Comments |
| Comment by Gerrit Updater [ 15/Jun/16 ] |
|
Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/20815 |
| Comment by Andrew Perepechko [ 15/Jun/16 ] |
|
The test case cannot be properly integrated with the fix because it would require sleeping under spinlock or some other meaningless operation simply for the sake of having a test case. The test case is uploaded only for the purpose of showing there is an issue with the current code and it is not supposed to be landed. |
| Comment by Gerrit Updater [ 15/Jun/16 ] |
|
Andrew Perepechko (andrew.perepechko@seagate.com) uploaded a new patch: http://review.whamcloud.com/20816 |
| Comment by Andrew Perepechko [ 15/Jun/16 ] |
|
That's how the issue is reproduced locally with the above test case: [root@panda-testbox tests]# REFORMAT=yes ONLY=258 bash sanity.sh Logging to shared log directory: /tmp/test_logs/1466023647 Client: Lustre version: 2.8.54_60_g2a55f34 MDS: Lustre version: 2.8.54_60_g2a55f34 OSS: Lustre version: 2.8.54_60_g2a55f34 Stopping clients: panda-testbox /mnt/lustre (opts:) Stopping clients: panda-testbox /mnt/lustre2 (opts:) Loading modules from /mnt/nfs/xyratex/lustre-release/lustre detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported Formatting mgs, mds, osts Format mds1: /dev/mapper/vg_livecd-mdt Format ost1: /dev/mapper/vg_livecd-ost1 Format ost2: /dev/mapper/vg_livecd-ost2 Format ost3: /dev/mapper/vg_livecd-ost3 Checking servers environments Checking clients panda-testbox environments Loading modules from /mnt/nfs/xyratex/lustre-release/lustre detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported Setup mgs, mdt, osts Starting mds1: /dev/mapper/vg_livecd-mdt /mnt/lustre-mds1 Started lustre-MDT0000 Starting ost1: /dev/mapper/vg_livecd-ost1 /mnt/lustre-ost1 Started lustre-OST0000 Starting ost2: /dev/mapper/vg_livecd-ost2 /mnt/lustre-ost2 Started lustre-OST0001 Starting ost3: /dev/mapper/vg_livecd-ost3 /mnt/lustre-ost3 Started lustre-OST0002 Starting client: panda-testbox: -o user_xattr,flock panda-testbox@tcp:/lustre /mnt/lustre Starting client panda-testbox: -o user_xattr,flock panda-testbox@tcp:/lustre /mnt/lustre Started clients panda-testbox: panda-testbox@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,flock) Using TIMEOUT=20 seting jobstats to procname_uid Setting lustre.sys.jobid_var from disable to procname_uid warning: 'lctl conf_param' is deprecated, use 'lctl set_param -P' instead Waiting 90 secs for update Updated after 3s: wanted 'procname_uid' got 'procname_uid' disable quota as required osd-ldiskfs.track_declares_assert=1 running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f10215] excepting tests: 76 101g 42a 42b 42c 42d 45 68b skipping tests SLOW=no: 24D 27m 64b 68 71 115 300o preparing for tests involving mounts mke2fs 1.42.13.wc4 (28-Nov-2015) debug=-1 resend_count is set to 4 4 4 resend_count is set to 4 4 4 resend_count is set to 4 4 4 resend_count is set to 4 4 4 resend_count is set to 4 4 4 == sanity test 258: i_size updates from BRW should be atomic == 00:48:16 (1466023696) 2+0 records in 2+0 records out 2097152 bytes (2.1 MB) copied, 1.09159 s, 1.9 MB/s fail_loc=0x80000237 2+0 records in 2+0 records out 2097152 bytes (2.1 MB) copied, 0.387469 s, 5.4 MB/s cmp: EOF on /mnt/lustre/f258.sanity sanity test_258: @@@@@@ FAIL: files differ Trace dump: = /mnt/nfs/xyratex/lustre-release/lustre/tests/test-framework.sh:4780:error() = sanity.sh:13995:test_258() = /mnt/nfs/xyratex/lustre-release/lustre/tests/test-framework.sh:5045:run_one() = /mnt/nfs/xyratex/lustre-release/lustre/tests/test-framework.sh:5084:run_one_logged() = /mnt/nfs/xyratex/lustre-release/lustre/tests/test-framework.sh:4882:run_test() = sanity.sh:14000:main() Dumping lctl log to /tmp/test_logs/1466023647/sanity.test_258.*.1466023700.log Dumping logs only on local client. Resetting fail_loc on all nodes...done. FAIL 258 (6s) == sanity test complete, duration 55 sec == 00:48:22 (1466023702) sanity: FAIL: test_258 files differ Stopping clients: panda-testbox /mnt/lustre (opts:-f) Stopping client panda-testbox /mnt/lustre opts:-f Stopping clients: panda-testbox /mnt/lustre2 (opts:-f) Stopping /mnt/lustre-mds1 (opts:-f) on panda-testbox Stopping /mnt/lustre-ost1 (opts:-f) on panda-testbox Stopping /mnt/lustre-ost2 (opts:-f) on panda-testbox Stopping /mnt/lustre-ost3 (opts:-f) on panda-testbox waited 0 for 10 ST ost OSS OSS_uuid 0 modules unloaded. |
| Comment by Gerrit Updater [ 11/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20816/ |
| Comment by Joseph Gmitter (Inactive) [ 13/Jul/16 ] |
|
Patch landed to master for 2.9.0 |
| Comment by Gerrit Updater [ 24/Aug/16 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22103 |
| Comment by Gerrit Updater [ 13/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22103/ |
| Comment by Peter Jones [ 13/Sep/16 ] |
|
Landed for 2.9 |