[LU-7956] Over softquota file corruption Created: 30/Mar/16  Updated: 14/Jul/16  Resolved: 14/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File debug.corruption.1.found.gz     File strace.corruptedfile     Text File vr014.corruption.tgz    
Issue Links:
Related
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

We have a confirmed case of over softquota causing file corruptions. We had other reports in the past but couldn't confirm until now. The test job writes 418 hdf4 file and about 10-15 will be corrupted. We confirmed this using 2 different filesystem. The files are single striped and are written from a single mpi rank.

In the attached debug file one of the corrupted files was written to nocache-OST0047 at 1459304302
Also attaching strace of the corrupted file IO.



 Comments   
Comment by Niu Yawei (Inactive) [ 30/Mar/16 ]

Could you verify that if the grace expired and some writes failed with -EDQUOT? I did't find anything abnormal in the provided log, could you collect logs on client, MDT and other OSS as well? Thanks.

Comment by Mahmoud Hanafi [ 30/Mar/16 ]

The grace is not expired, during my testing I was setting softquota and it had 2 week left. The code doesn't get a -EDQUOT. I know this to be the case, because I tested lowering my hardquota and the code did fail to write and exited with an Error.

Comment by Mahmoud Hanafi [ 30/Mar/16 ]

These test where ran on 2.5.3 clients and 2.7.1 server.

Attaching logs from all job clients and mds and oss. There is only one OSS with 6 osts. The corrupted file is called vr014.hdf

[mhanafi@pfe24:/fsnocache/mhanafi/mas_fstest_template]$ lfs getstripe vr014.hdf
vr014.hdf
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 47
obdidx objid objid group
47 71081 0x115a9 0

[mhanafi@pfe24:/fsnocache/mhanafi/mas_fstest_template]$ lfs path2fid vr014.hdf
[0x200000bd1:0x4df:0x0]

Comment by Niu Yawei (Inactive) [ 31/Mar/16 ]

When over soft quota, client may turned to sync write. I guess there could be some defects in the sync write path (either in Lustre or app) may cause corruption?

Could you unset the soft limit but set fail_loc as 0x411 (that will force sync write on client) on all clients to see if the problem can be reproduced as well? Thanks.

Comment by Mahmoud Hanafi [ 31/Mar/16 ]

setting fail_loc didn't produce any corrupted files. We have narrowed down the write that is not making to disk. From the attached strace it would be second to the last write.

53385 lseek(14, -4194300, SEEK_CUR)     = 4
53385 write(14, "\0\20\0\0\0\0\0\36\0\1\0\0\0\312\0\0\0\\\2\276\0\2\0\0\1&\3\230\207P\0j"..., 198) = 198   <<<<<<------ THIS ONE
53385 lseek(14, 58720256, SEEK_SET)     = 58720256

The code tries to re-write/update the data header in the file and is not making to the disk.

And from the attached debug it is this section

00000080:00200000:0.0:1459365869.218185:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1)
00000080:00200000:0.0:1459365869.218189:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198
00000020:00200000:0.0:1459365869.218190:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash]
00000020:00200000:0.0:1459365869.218191:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0
00000080:00200000:0.0:1459365869.218192:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00020000:00000002:0.0:1459365869.218198:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)]
00020000:00200000:0.0:1459365869.218201:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202)
00000080:00200000:0.0:1459365869.218202:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0]
00000080:00200000:0.0:1459365869.218205:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202)
00000080:00200000:0.0:1459365869.218206:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability
00000080:00200000:0.0:1459365869.218207:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138)
00000080:00200000:0.0:1459365869.218209:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768
00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash]
00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write
Comment by Niu Yawei (Inactive) [ 01/Apr/16 ]

From the trace we can see the header [4, 202) was updated twice, first time it's written into cache (showed in above log you posted), second time it's synced written back. Perhaps the second write overwrite the first one? Is there anything wrong that first write not making to disk?

Comment by Mahmoud Hanafi [ 01/Apr/16 ]

The file is create and an empty header is written and the file is closed. The file is reopened the data is written and finally the header is updated with pointer to the written data. But we end up with an empty header and data in the file. I don't see a second write when the file is re-opened the second time.

Comment by Niu Yawei (Inactive) [ 01/Apr/16 ]

This is the first write to [4, 202):

00000080:00200000:0.0:1459365868.977428:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1)
00000080:00200000:0.0:1459365868.977431:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198
00000020:00200000:0.0:1459365868.977433:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 3, [0x200000bd1:0x4df:0x0] hash]
00000020:00200000:0.0:1459365868.977433:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0
00000080:00200000:0.0:1459365868.977434:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00020000:00000002:0.0:1459365868.977439:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)]
00020000:00200000:0.0:1459365868.977441:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202)
00000080:00200000:0.0:1459365868.977442:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0]
00000080:00200000:0.0:1459365868.977446:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202)
00000080:00200000:0.0:1459365868.977446:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability
00000080:00200000:0.0:1459365868.977447:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138)
00000080:00200000:0.0:1459365868.977449:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768
00000080:00000002:0.0:1459365868.977452:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 3, [0x200000bd1:0x4df:0x0] hash]
00000080:00000002:0.0:1459365868.977452:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write
00000008:00000002:0.0:1459365868.977453:0:62879:0:(osc_io.c:211:osc_page_touch_at()) stripe KMS not increasing 294->202 294
00000080:00200000:0.0:1459365868.977459:0:62879:0:(vvp_io.c:133:vvp_io_fini()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00000080:00200000:0.0:1459365868.977460:0:62879:0:(file.c:1242:ll_file_io_generic()) iotype: 1, result: 198

We can see it's a cached write.

This is the second write to [4, 202):

00000080:00200000:0.0:1459365869.218185:0:62879:0:(file.c:2754:ll_file_seek()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), to=4=0x4(1)
00000080:00200000:0.0:1459365869.218189:0:62879:0:(file.c:1142:ll_file_io_generic()) file: vr014.hdf, type: 1 ppos: 4, count: 198
00000020:00200000:0.0:1459365869.218190:0:62879:0:(cl_io.c:236:cl_io_rw_init()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash]
00000020:00200000:0.0:1459365869.218191:0:62879:0:(cl_io.c:236:cl_io_rw_init()) io range: 1 [4, 202) 0 0
00000080:00200000:0.0:1459365869.218192:0:62879:0:(vvp_io.c:1177:vvp_io_init()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00020000:00000002:0.0:1459365869.218198:0:62879:0:(lov_offset.c:233:lov_stripe_intersects()) [4->201] -> [(0) 4->201 (0)]
00020000:00200000:0.0:1459365869.218201:0:62879:0:(lov_io.c:429:lov_io_iter_init()) shrink: 0 [4, 202)
00000080:00200000:0.0:1459365869.218202:0:62879:0:(lcommon_cl.c:733:ccc_io_one_lock_index()) lock: 2 [0, 0]
00000080:00200000:0.0:1459365869.218205:0:62879:0:(vvp_io.c:615:vvp_io_write_start()) write: [4, 202)
00000080:00200000:0.0:1459365869.218206:0:62879:0:(xattr.c:461:ll_getxattr()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138), xattr security.capability
00000080:00200000:0.0:1459365869.218207:0:62879:0:(xattr.c:302:ll_getxattr_common()) VFS Op:inode=144115238826935519/33554443(ffff880707b8c138)
00000080:00200000:0.0:1459365869.218209:0:62879:0:(rw.c:202:ll_cl_init()) 0@[0x200000bd1:0x4df:0x0] -> 0 ffff880207b47530 ffff88080f594768
00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) header@ffff8802aa599070[0x0, 14766, [0x200000bd1:0x4df:0x0] hash]
00000080:00000002:0.0:1459365869.218212:0:62879:0:(vvp_io.c:1015:vvp_io_commit_write()) commiting page write
00020000:00000002:0.0:1459365869.218214:0:62879:0:(lov_merge.c:75:lov_merge_lvb_kms()) MDT ID 0x4df:3025 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0
00020000:00000002:0.0:1459365869.218215:0:62879:0:(lov_merge.c:109:lov_merge_lvb_kms()) MDT ID 0x4df:3025 on OST[47]: s=60436399 m=1459365869 a=1459365869 c=1459365869 b=88072
00000001:04000000:0.0:1459365869.218216:0:62879:0:(osc_quota.c:64:osc_quota_chkdq()) chkdq found noquota for user 11312
00000008:04000000:20.0:1459365869.219182:0:5838:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1100
00000008:00000002:20.0:1459365869.219185:0:5838:0:(osc_request.c:1960:brw_interpret()) request ffff880f230c2400 aa ffff880f230c26e0 rc 0
00000020:00000002:20.0:1459365869.219335:0:5838:0:(obdo.c:70:obdo_from_inode()) valid 70c, new time 1459365869/1459365869
00000008:00000002:20.0:1459365869.219346:0:5838:0:(osc_request.c:2225:osc_build_rpc()) @@@ 1 pages, aa ffff880bdb641ae0. now 0r/9w in flight  req@ffff880bdb641800 x1528598526163360/t0(0) o4->nocache-OST002f-osc-ffff88082f125000@10.151.26.123@o2ib:6/4 lens 488/448 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
00000100:00100000:20.0:1459365869.219356:0:5838:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_29:09656603-58f6-edb5-effe-53c13e5a5741:5838:1528598526163252:10.151.26.123@o2ib:4
00000100:00100000:24.0:1459365869.219357:0:5829:0:(client.c:1489:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_20:09656603-58f6-edb5-effe-53c13e5a5741:5829:1528598526163360:10.151.26.123@o2ib:4
00000008:04000000:37.0:1459365869.219641:0:5837:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000
00000001:04000000:37.0:1459365869.219644:0:5837:0:(osc_quota.c:133:osc_quota_setdq()) nocache-OST002f-osc-ffff88082f125000: setdq to remove for user 11312 (00000000deadbeef)
00000008:00000002:37.0:1459365869.219646:0:5837:0:(osc_request.c:1960:brw_interpret()) request ffff881031f65000 aa ffff881031f652e0 rc 0
00000020:00000002:37.0:1459365869.220048:0:5837:0:(obdo.c:70:obdo_from_inode()) valid 70c, new time 1459365869/1459365869
00000008:04000000:22.0:1459365869.220489:0:5830:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000
00000008:00000002:22.0:1459365869.220493:0:5830:0:(osc_request.c:1960:brw_interpret()) request ffff8810461abc00 aa ffff8810461abee0 rc 0
00000100:00100000:22.0:1459365869.220696:0:5830:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_21:09656603-58f6-edb5-effe-53c13e5a5741:5830:1528598526163268:10.151.26.123@o2ib:4
00000008:04000000:22.0:1459365869.220702:0:5830:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000
00000008:00000002:22.0:1459365869.220704:0:5830:0:(osc_request.c:1960:brw_interpret()) request ffff8810461ab400 aa ffff8810461ab6e0 rc 0
00000008:00000002:37.0:1459365869.220712:0:5837:0:(osc_request.c:2225:osc_build_rpc()) @@@ 254 pages, aa ffff880f230c26e0. now 0r/8w in flight  req@ffff880f230c2400 x1528598526163364/t0(0) o4->nocache-OST002f-osc-ffff88082f125000@10.151.26.123@o2ib:6/4 lens 504/448 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
00000100:00100000:33.0:1459365869.220722:0:5846:0:(client.c:1489:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_37:09656603-58f6-edb5-effe-53c13e5a5741:5846:1528598526163364:10.151.26.123@o2ib:4
00000100:00100000:24.0:1459365869.220734:0:5829:0:(client.c:2523:ptlrpc_free_committed()) nocache-OST002f-osc-ffff88082f125000: committing for last_committed 8589936700 gen 1
00000008:04000000:24.0:1459365869.220751:0:5829:0:(osc_request.c:1533:osc_brw_fini_request()) setdq for [11312 1125] with valid 0x6f184fb9, flags 1000
00000008:00000002:24.0:1459365869.220754:0:5829:0:(osc_request.c:1960:brw_interpret()) request ffff880bdb641800 aa ffff880bdb641ae0 rc 0
00000008:00000002:0.0:1459365869.220764:0:62879:0:(osc_io.c:211:osc_page_touch_at()) stripe KMS not increasing 60436399->202 60436399
00000100:00100000:24.0:1459365869.220765:0:5829:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_20:09656603-58f6-edb5-effe-53c13e5a5741:5829:1528598526163360:10.151.26.123@o2ib:4
00000080:00200000:0.0:1459365869.220777:0:62879:0:(vvp_io.c:133:vvp_io_fini()) [0x200000bd1:0x4df:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00000080:00200000:0.0:1459365869.220778:0:62879:0:(file.c:1242:ll_file_io_generic()) iotype: 1, result: 198

We can see NOQUOTA flags is set and client turned to sync write (see the 1 page request "req@ffff880bdb641800"). I didn't see any error on both client and OST logs.

Comment by Mahmoud Hanafi [ 01/Apr/16 ]

You can see that after the first write the file is closed at

00800000:00000002:0.0:1459365868.977548:0:62879:0:(lmv_obd.c:1725:lmv_close()) CLOSE [0x200000bd1:0x4df:0x0]

then reopened at

00000080:00200000:0.0:1459365868.977899:0:62879:0:(namei.c:527:ll_lookup_it()) VFS Op:name=vr014.hdf,dir=144115238826934470/33554443(ffff88061af76678),intent=open

If the file is close after the first write, I wouldn't think the 2 writes could clobber each other.

Comment by Mahmoud Hanafi [ 06/Apr/16 ]

It was pointed out to me that when the file is closed it doesn't necessary mean the file is synced. So the file could still be setting in cache. May be the second sync write is not invaliding the cached data and when the file is close it over writes the header from cache.

Comment by Niu Yawei (Inactive) [ 07/Apr/16 ]

It was pointed out to me that when the file is closed it doesn't necessary mean the file is synced. So the file could still be setting in cache. May be the second sync write is not invaliding the cached data and when the file is close it over writes the header from cache.

The second sync write will update the page cache first, then write back, so the second write won't be overwritten like you described.

By reading code, I suspect that second sync write didn't write the correct range of the page, but I failed to reproduce it. I posted a tentative patch and asked clio experts to take a look. Thank you.

Comment by Jay Lan (Inactive) [ 08/Apr/16 ]

This note is for documentation purpose.

Digged from an email from Peter Jones pointing to a potential fix:
" there is a candidate fix (http://review.whamcloud.com/#/c/19329/) undergoing testing and reviews"

Comment by Jinshan Xiong (Inactive) [ 10/Apr/16 ]

Hi Mahmoud,

When you saw an empty file header, is the file range [4, 202), or [0, 4095) empty?

Now that you have a reproducer, I would suggest to dump the file header, say a few bytes from object [4, 202) in the ptlrpc layer to make sure that correct content is in the page.

Comment by Niu Yawei (Inactive) [ 11/Apr/16 ]

I reproduced the problem locally, and the fix on http://review.whamcloud.com/#/c/19329/ has been updated according to Jinshan's suggestion.

Comment by Mahmoud Hanafi [ 11/Apr/16 ]

Is there a possibility of this corruption happing outside of the soft quota limit? If for example the user does a sync write in their code.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ]

That wouldn't be possible from what I have seen. The bug is that once a sync write occurred due to running of quota, the later cached write on the same page would use stale page information therefore partial data could be written.

Comment by Niu Yawei (Inactive) [ 12/Apr/16 ]

Is there a possibility of this corruption happing outside of the soft quota limit? If for example the user does a sync write in their code.

Mahmoud, you needn't worry about sync write from applications, that goes different code path. This kind of corruption could happen when over quota or run out of grant on client, and the io pattern to trigger the corruption is (we can see it exactly from the log):

  • A Write to extend file, because over quota (or out of grant), the write is turned into 'sync' write internally;
  • More writes to bump the file size further; (no matter sync or cached write)
  • Write to the same page of the first write, and this write is turned into 'sync' write too. Because of the bug, stale page offset from the first write will be used, and corruption happens.
Comment by Jay Lan (Inactive) [ 19/Apr/16 ]

Niu,

http://review.whamcloud.com/#/c/19329/
was written for b2_5_fe. The mod affected only two lines, but the function the changes fell on was vvp_io_commit_write() and that function does not exist in b2_7_fe. We are moving to b2_7_fe next month. Could you provide a b2_7_fe back port? Thanks!

Comment by Niu Yawei (Inactive) [ 22/Apr/16 ]

The problem is b2_5_fe only. That part of code has been heavily changed by LU-3321, and the bug was fixed the changes. The bug doesn't exist in b2_7_fe, EE2.4, EE3.0 and master.

Comment by Ann Koehler (Inactive) [ 29/Apr/16 ]

We may have hit this same corruption on 2.5.2. Can http://review.whamcloud.com/#/c/19329 be made public so I can verify? Thanks.

Comment by Mahmoud Hanafi [ 14/Jul/16 ]

You can close this ticket

Comment by Peter Jones [ 14/Jul/16 ]

Thanks Mahmoud

Generated at Sat Feb 10 02:13:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.