[LU-7956] Over softquota file corruption Created: 30/Mar/16 Updated: 14/Jul/16 Resolved: 14/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 4 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We have a confirmed case of over softquota causing file corruptions. We had other reports in the past but couldn't confirm until now. The test job writes 418 hdf4 file and about 10-15 will be corrupted. We confirmed this using 2 different filesystem. The files are single striped and are written from a single mpi rank. In the attached debug file one of the corrupted files was written to nocache-OST0047 at 1459304302 |
| Comments |
| Comment by Niu Yawei (Inactive) [ 30/Mar/16 ] |
|
Could you verify that if the grace expired and some writes failed with -EDQUOT? I did't find anything abnormal in the provided log, could you collect logs on client, MDT and other OSS as well? Thanks. |
| Comment by Mahmoud Hanafi [ 30/Mar/16 ] |
|
The grace is not expired, during my testing I was setting softquota and it had 2 week left. The code doesn't get a -EDQUOT. I know this to be the case, because I tested lowering my hardquota and the code did fail to write and exited with an Error. |
| Comment by Mahmoud Hanafi [ 30/Mar/16 ] |
|
These test where ran on 2.5.3 clients and 2.7.1 server. Attaching logs from all job clients and mds and oss. There is only one OSS with 6 osts. The corrupted file is called vr014.hdf [mhanafi@pfe24:/fsnocache/mhanafi/mas_fstest_template]$ lfs getstripe vr014.hdf [mhanafi@pfe24:/fsnocache/mhanafi/mas_fstest_template]$ lfs path2fid vr014.hdf |
| Comment by Niu Yawei (Inactive) [ 31/Mar/16 ] |
|
When over soft quota, client may turned to sync write. I guess there could be some defects in the sync write path (either in Lustre or app) may cause corruption? Could you unset the soft limit but set fail_loc as 0x411 (that will force sync write on client) on all clients to see if the problem can be reproduced as well? Thanks. |
| Comment by Mahmoud Hanafi [ 31/Mar/16 ] |
|
setting fail_loc didn't produce any corrupted files. We have narrowed down the write that is not making to disk. From the attached strace it would be second to the last write. 53385 lseek(14, -4194300, SEEK_CUR) = 4
53385 write(14, "\0\20\0\0\0\0\0\36\0\1\0\0\0\312\0\0\0\\\2\276\0\2\0\0\1&\3\230\207P\0j"..., 198) = 198 <<<<<<------ THIS ONE
53385 lseek(14, 58720256, SEEK_SET) = 58720256
The code tries to re-write/update the data header in the file and is not making to the disk. And from the attached debug it is this section 00000080:00200000:0.0:1459365869.218185:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1) 00000080:00200000:0.0:1459365869.218189:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198 00000020:00200000:0.0:1459365869.218190:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash] 00000020:00200000:0.0:1459365869.218191:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0 00000080:00200000:0.0:1459365869.218192:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00020000:00000002:0.0:1459365869.218198:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)] 00020000:00200000:0.0:1459365869.218201:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202) 00000080:00200000:0.0:1459365869.218202:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0] 00000080:00200000:0.0:1459365869.218205:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202) 00000080:00200000:0.0:1459365869.218206:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability 00000080:00200000:0.0:1459365869.218207:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138) 00000080:00200000:0.0:1459365869.218209:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768 00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash] 00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write |
| Comment by Niu Yawei (Inactive) [ 01/Apr/16 ] |
|
From the trace we can see the header [4, 202) was updated twice, first time it's written into cache (showed in above log you posted), second time it's synced written back. Perhaps the second write overwrite the first one? Is there anything wrong that first write not making to disk? |
| Comment by Mahmoud Hanafi [ 01/Apr/16 ] |
|
The file is create and an empty header is written and the file is closed. The file is reopened the data is written and finally the header is updated with pointer to the written data. But we end up with an empty header and data in the file. I don't see a second write when the file is re-opened the second time. |
| Comment by Niu Yawei (Inactive) [ 01/Apr/16 ] |
|
This is the first write to [4, 202): 00000080:00200000:0.0:1459365868.977428:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1) 00000080:00200000:0.0:1459365868.977431:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198 00000020:00200000:0.0:1459365868.977433:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 3, [0x200000bd1:0x4df:0x0] hash] 00000020:00200000:0.0:1459365868.977433:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0 00000080:00200000:0.0:1459365868.977434:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00020000:00000002:0.0:1459365868.977439:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)] 00020000:00200000:0.0:1459365868.977441:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202) 00000080:00200000:0.0:1459365868.977442:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0] 00000080:00200000:0.0:1459365868.977446:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202) 00000080:00200000:0.0:1459365868.977446:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability 00000080:00200000:0.0:1459365868.977447:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138) 00000080:00200000:0.0:1459365868.977449:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768 00000080:00000002:0.0:1459365868.977452:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 3, [0x200000bd1:0x4df:0x0] hash] 00000080:00000002:0.0:1459365868.977452:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write 00000008:00000002:0.0:1459365868.977453:0:62879:0:(osc_io.c:211:osc_page_touch_at()) stripe KMS not increasing 294->202 294 00000080:00200000:0.0:1459365868.977459:0:62879:0:(vvp_io.c:133:vvp_io_fini()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00000080:00200000:0.0:1459365868.977460:0:62879:0:(file.c:1242:ll_file_io_generic()) iotype: 1, result: 198 We can see it's a cached write. This is the second write to [4, 202): 00000080:00200000:0.0:1459365869.218185:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1) 00000080:00200000:0.0:1459365869.218189:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198 00000020:00200000:0.0:1459365869.218190:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash] 00000020:00200000:0.0:1459365869.218191:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0 00000080:00200000:0.0:1459365869.218192:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00020000:00000002:0.0:1459365869.218198:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)] 00020000:00200000:0.0:1459365869.218201:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202) 00000080:00200000:0.0:1459365869.218202:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0] 00000080:00200000:0.0:1459365869.218205:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202) 00000080:00200000:0.0:1459365869.218206:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability 00000080:00200000:0.0:1459365869.218207:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138) 00000080:00200000:0.0:1459365869.218209:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768 00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash] 00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write 00020000:00000002:0.0:1459365869.218214:0:62879:0:(lov_merge.c:75:lov_merge_lvb_kms()) MDT ID 0x4df:3025 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000002:0.0:1459365869.218215:0:62879:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT ID 0x4df:3025 on OST[47]: s=60436399 m=1459365869 a=1459365869 c=1459365869 b=88072 00000001:04000000:0.0:1459365869.218216:0:62879:0:(osc_quota.c:64:osc_quota_chkdq()) chkdq found noquota for user 11312 00000008:04000000:20.0:1459365869.219182:0:5838:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1100 00000008:00000002:20.0:1459365869.219185:0:5838:0:(osc_request.c:1960:brw_interpret()) request ffff880f230c2400 aa ffff880f230c26e0 rc 0 00000020:00000002:20.0:1459365869.219335:0:5838:0:(obdo.c:70:obdo_from_inode()) valid 70c, new time 1459365869/1459365869 00000008:00000002:20.0:1459365869.219346:0:5838:0:(osc_request.c:2225:osc_build_rpc()) @@@ 1 pages, aa ffff880bdb641ae0. now 0r/9w in flight req@ffff880bdb641800 x1528598526163360/t0(0) o4->nocache-OST002f-osc-ffff88082f125000@10.151.26.123@o2ib:6/4 lens 488/448 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000100:00100000:20.0:1459365869.219356:0:5838:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_29:09656603-58f6-edb5-effe-53c13e5a5741:5838:1528598526163252:10.151.26.123@o2ib:4 00000100:00100000:24.0:1459365869.219357:0:5829:0:(client.c:1489:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_20:09656603-58f6-edb5-effe-53c13e5a5741:5829:1528598526163360:10.151.26.123@o2ib:4 00000008:04000000:37.0:1459365869.219641:0:5837:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000 00000001:04000000:37.0:1459365869.219644:0:5837:0:(osc_quota.c:133:osc_quota_setdq()) nocache-OST002f-osc-ffff88082f125000: setdq to remove for user 11312 (00000000deadbeef) 00000008:00000002:37.0:1459365869.219646:0:5837:0:(osc_request.c:1960:brw_interpret()) request ffff881031f65000 aa ffff881031f652e0 rc 0 00000020:00000002:37.0:1459365869.220048:0:5837:0:(obdo.c:70:obdo_from_inode()) valid 70c, new time 1459365869/1459365869 00000008:04000000:22.0:1459365869.220489:0:5830:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000 00000008:00000002:22.0:1459365869.220493:0:5830:0:(osc_request.c:1960:brw_interpret()) request ffff8810461abc00 aa ffff8810461abee0 rc 0 00000100:00100000:22.0:1459365869.220696:0:5830:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_21:09656603-58f6-edb5-effe-53c13e5a5741:5830:1528598526163268:10.151.26.123@o2ib:4 00000008:04000000:22.0:1459365869.220702:0:5830:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000 00000008:00000002:22.0:1459365869.220704:0:5830:0:(osc_request.c:1960:brw_interpret()) request ffff8810461ab400 aa ffff8810461ab6e0 rc 0 00000008:00000002:37.0:1459365869.220712:0:5837:0:(osc_request.c:2225:osc_build_rpc()) @@@ 254 pages, aa ffff880f230c26e0. now 0r/8w in flight req@ffff880f230c2400 x1528598526163364/t0(0) o4->nocache-OST002f-osc-ffff88082f125000@10.151.26.123@o2ib:6/4 lens 504/448 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 00000100:00100000:33.0:1459365869.220722:0:5846:0:(client.c:1489:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_37:09656603-58f6-edb5-effe-53c13e5a5741:5846:1528598526163364:10.151.26.123@o2ib:4 00000100:00100000:24.0:1459365869.220734:0:5829:0:(client.c:2523:ptlrpc_free_committed()) nocache-OST002f-osc-ffff88082f125000: committing for last_committed 8589936700 gen 1 00000008:04000000:24.0:1459365869.220751:0:5829:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000 00000008:00000002:24.0:1459365869.220754:0:5829:0:(osc_request.c:1960:brw_interpret()) request ffff880bdb641800 aa ffff880bdb641ae0 rc 0 00000008:00000002:0.0:1459365869.220764:0:62879:0:(osc_io.c:211:osc_page_touch_at()) stripe KMS not increasing 60436399->202 60436399 00000100:00100000:24.0:1459365869.220765:0:5829:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_20:09656603-58f6-edb5-effe-53c13e5a5741:5829:1528598526163360:10.151.26.123@o2ib:4 00000080:00200000:0.0:1459365869.220777:0:62879:0:(vvp_io.c:133:vvp_io_fini()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0 00000080:00200000:0.0:1459365869.220778:0:62879:0:(file.c:1242:ll_file_io_generic()) iotype: 1, result: 198 We can see NOQUOTA flags is set and client turned to sync write (see the 1 page request "req@ffff880bdb641800"). I didn't see any error on both client and OST logs. |
| Comment by Mahmoud Hanafi [ 01/Apr/16 ] |
|
You can see that after the first write the file is closed at 00800000:00000002:0.0:1459365868.977548:0:62879:0:(lmv_obd.c:1725:lmv_close()) CLOSE [0x200000bd1:0x4df:0x0] then reopened at 00000080:00200000:0.0:1459365868.977899:0:62879:0:(namei.c:527:ll_lookup_it()) VFS Op:name=vr014.hdf,dir=144115238826934470/33554443(ffff88061af76678),intent=open If the file is close after the first write, I wouldn't think the 2 writes could clobber each other. |
| Comment by Mahmoud Hanafi [ 06/Apr/16 ] |
|
It was pointed out to me that when the file is closed it doesn't necessary mean the file is synced. So the file could still be setting in cache. May be the second sync write is not invaliding the cached data and when the file is close it over writes the header from cache. |
| Comment by Niu Yawei (Inactive) [ 07/Apr/16 ] |
The second sync write will update the page cache first, then write back, so the second write won't be overwritten like you described. By reading code, I suspect that second sync write didn't write the correct range of the page, but I failed to reproduce it. I posted a tentative patch and asked clio experts to take a look. Thank you. |
| Comment by Jay Lan (Inactive) [ 08/Apr/16 ] |
|
This note is for documentation purpose. Digged from an email from Peter Jones pointing to a potential fix: |
| Comment by Jinshan Xiong (Inactive) [ 10/Apr/16 ] |
|
Hi Mahmoud, When you saw an empty file header, is the file range [4, 202), or [0, 4095) empty? Now that you have a reproducer, I would suggest to dump the file header, say a few bytes from object [4, 202) in the ptlrpc layer to make sure that correct content is in the page. |
| Comment by Niu Yawei (Inactive) [ 11/Apr/16 ] |
|
I reproduced the problem locally, and the fix on http://review.whamcloud.com/#/c/19329/ has been updated according to Jinshan's suggestion. |
| Comment by Mahmoud Hanafi [ 11/Apr/16 ] |
|
Is there a possibility of this corruption happing outside of the soft quota limit? If for example the user does a sync write in their code. |
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ] |
|
That wouldn't be possible from what I have seen. The bug is that once a sync write occurred due to running of quota, the later cached write on the same page would use stale page information therefore partial data could be written. |
| Comment by Niu Yawei (Inactive) [ 12/Apr/16 ] |
Mahmoud, you needn't worry about sync write from applications, that goes different code path. This kind of corruption could happen when over quota or run out of grant on client, and the io pattern to trigger the corruption is (we can see it exactly from the log):
|
| Comment by Jay Lan (Inactive) [ 19/Apr/16 ] |
|
Niu, http://review.whamcloud.com/#/c/19329/ |
| Comment by Niu Yawei (Inactive) [ 22/Apr/16 ] |
|
The problem is b2_5_fe only. That part of code has been heavily changed by |
| Comment by Ann Koehler (Inactive) [ 29/Apr/16 ] |
|
We may have hit this same corruption on 2.5.2. Can http://review.whamcloud.com/#/c/19329 be made public so I can verify? Thanks. |
| Comment by Mahmoud Hanafi [ 14/Jul/16 ] |
|
You can close this ticket |
| Comment by Peter Jones [ 14/Jul/16 ] |
|
Thanks Mahmoud |