[LU-13899] IOR data corruption detected during automated lnet fofb testing Created: 10/Aug/20 Updated: 01/Sep/20 Resolved: 01/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Seeing data corruption on two IOR jobs while running lnet failover test with " Regression write/verify DNE2 PFL". Both the failures showup data from immediate previous run. Likely cache related issue. fsync related warnings are also logged. CL_IOR_all_wr_20iter_1666K_rand access = file-per-process pattern = strided (33 segments) ordering in a file = random offsets ordering inter file=constant task offsets = 1 clients = 48 (4 per node) repetitions = 20 xfersize = 1.63 MiB blocksize = 27.66 MiB aggregate filesize = 42.78 GiB read 1316.54 28322 1666.00 0.001729 33.27 0.010429 33.28 16 XXCEL Using Time Stamp 1594461626 (0x5f098dba) for Data Signature Commencing write performance test. Sat Jul 11 05:00:26 2020read 1852.87 28322 1666.00 0.001450 23.64 0.004490 23.64 17 XXCEL Using Time Stamp 1594461701 (0x5f098e05) for Data Signature Commencing write performance test. Sat Jul 11 05:01:41 2020WARNING: cannot perform fsync on file. WARNING: cannot perform fsync on file. write 197.50 28322 1666.00 0.001447 221.82 0.421875 221.82 18 XXCEL Verifying contents of the file(s) just written. Sat Jul 11 05:05:23 2020 [6] At transfer buffer #3, index #0 (file byte offset 235425792): [6] Expected: 0x0000000e5f098e05 [6] Actual: 0x0000000e5f098dba [6] At transfer buffer #3, index #2 (file byte offset 235425808): [6] Expected: 0x0000000e5f098e05 [6] Actual: 0x0000000e5f098dbaIOR job: -a POSIX -i 20 -w -r -W -t 1666K -b 28322K -C -e -k -vv -E -F -q -s 33 -x -z File in question:/lus/snx11205/ostest.vers/alsorun.20200711032404.2543.pollux-p4/CL_IOR_all_wr_20iter_1666K_rand.2.9t3jzC.1594459731/CL_IOR_all_wr_20iter_1666K_rand/IORfile.00000004 |
| Comments |
| Comment by Alexander Zarochentsev [ 10/Aug/20 ] |
|
Further debugging showed that crash-7.2.8> kmem -p | grep ffff8803f2c10c50|awk '{print $7}'|sort|uniq -c
491 error,referenced,uptodate,lru,private
1 error,uptodate,lru,active,private
232875 referenced,uptodate,lru,private
11 referenced,uptodate,private
279 uptodate,lru,active,private
crash-7.2.8>
2. A debug patch gave us error code -ESTALE: 18282:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) page@ffff88017556f200[2 ffff8801f8663
0e8 0 1 (null)]
00000020:00000001:28.0:1595864091.118282:0:3566:0:(cl_page.c:840:cl_req_type_state()) Process leaving (rc=2 : 2 : 2)
00000080:00008000:25.0:1595864091.118283:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) completing WRITE with -116
00000020:00000001:28.0:1595864091.118283:0:3566:0:(cl_page.c:925:cl_page_completion()) Process entered
00000080:00040000:25.0:1595864091.118284:0:3564:0:(vvp_page.c:246:vvp_vmpage_error()) LBUG
00000020:00000001:28.0:1595864091.118285:0:3566:0:(cl_page.c:926:cl_page_completion()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1
(null)]
00000020:00000001:28.0:1595864091.118286:0:3566:0:(cl_page.c:926:cl_page_completion()) 1 0
00000020:00000001:28.0:1595864091.118287:0:3566:0:(cl_page.c:341:cl_page_state_set0()) Process entered
00000020:00000001:28.0:1595864091.118288:0:3566:0:(cl_page.c:344:cl_page_state_set0()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1
(null)]
3. and finally, the -ESTALE comes from (process_req_last_xid): 00000100:00000200:15.0:1596013191.408113:0:11072:0:(service.c:2233:ptlrpc_server_handle_request()) got req 1673498761985920 00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:706:tgt_request_handle()) Process entered 00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:637:process_req_last_xid()) Process entered 00000020:00000001:15.0:1596013191.408115:0:11072:0:(tgt_handler.c:679:process_req_last_xid()) Process leaving via out (rc=18446744073709551500 : -116 : 0xffffffffffffff8c) 00000020:00000001:15.0:1596013191.408116:0:11072:0:(tgt_handler.c:691:process_req_last_xid()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c) It appears the -ESTALE is not handled properly at client. |
| Comment by Gerrit Updater [ 10/Aug/20 ] |
|
Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/39612 |
| Comment by Cory Spitz [ 12/Aug/20 ] |
|
zam, shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0. |
| Comment by Etienne Aujames [ 20/Aug/20 ] |
|
Hello, |
| Comment by Alexander Zarochentsev [ 20/Aug/20 ] |
|
> shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0. |
| Comment by Gerrit Updater [ 01/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39612/ |
| Comment by Peter Jones [ 01/Sep/20 ] |
|
Landed for 2.14 |