[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.3.0, Lustre 2.1.1
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/commits/2.1.1-llnl

Severity:
3
Rank (Obsolete):
4520

Description

A data integrity test run periodically run by our storage group found two occurrences of corrupt files written to Lustre. The original files contain 300 MB of random data. The corrupt copies contain several 4096B regions of zeros aligned on 1MiB boundaries. The two corrupt files were written to the same filesystem from two different login nodes on the same cluster within five minutes of each other. The stripe count is 100.

The client application is a parallel ftp client reading data out of our storage archive into Lustre. The test checks for differences between the restored files and the original copies. For a 300MB file it uses 4 threads which issue 4 64MB pwrite()'s and 1 44MB pwrite(). It is possible that the pwrite() gets restarted due to SIGUSR2 from a master process, though we don't know if this occurred in the corrupting case. This test has seen years of widespread use on all of our clusters, and this is the first reported incidence of this type of corruption, so we can characterize the frequency as rare.

When I examine an OST object containing a corrupt region, I see there is no block allocated for the corrupt region (in this case, logical block 256 is missing).

# pigs58 /root > debugfs -c -R "dump_extents /O/0/d$((30205348 % 32))/30205348" /dev/sdb
debugfs 1.41.12 (17-May-2010)
/dev/sdb: catastrophic mode - not reading inode or group bitmaps
Level Entries       Logical              Physical Length Flags
 0/ 0   1/  3     0 -   255 813140480 - 813140735    256
 0/ 0   2/  3   257 -   511 813142528 - 813142782    255
 0/ 0   3/  3   512 -   767 813143040 - 813143295    256

Finally, the following server-side console messages appeared at the same time one of the corrupted files was written, and mention the NID of the implicated client. The consoles of the OSTs containing the corrupt objects were quiet at the time.

May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0
May 17 01:06:08 pigs-mds1 kernel: Lustre: 20418:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5
May 17 01:06:08 pigs-mds1 kernel: Lustre: All locks stolen from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-1442.lustre.log.sierra654.gz
2.55 MB
28/May/12 1:38 AM
LU-1442.lustre.log.sierra972.gz
2.43 MB
28/May/12 1:38 AM
qualify.ftp
0.9 kB
28/Jun/12 12:31 PM

Issue Links

is duplicated by

LU-1680 LBUG cl_lock.c:1949:discard_cb()) (ORI-726)

Resolved

is related to

LU-1458 lustre-rsync-test test_2b: old lustre_rsync does not work with new llog_changelog_ext_rec remove changelog

Resolved

is related to

LU-1703 b2_1 can't pass acc-sm test

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros

Christopher Morrone (Inactive) added a comment - 02/Aug/12 7:26 PM

Its been on our test systems and hasn't caused any problems that I am aware of. It is not installed in production yet. It might make it into a production release in a couple of weeks.

We've seen the ~~LU-1680~~ failures on the orion branch, and just recently pulled this ~~LU-1442~~ patch into there. We'll keep an eye out for failures on orion when we upgrade to that version.

Christopher Morrone (Inactive) added a comment - 02/Aug/12 7:26 PM Its been on our test systems and hasn't caused any problems that I am aware of. It is not installed in production yet. It might make it into a production release in a couple of weeks. We've seen the LU-1680 failures on the orion branch, and just recently pulled this LU-1442 patch into there. We'll keep an eye out for failures on orion when we upgrade to that version.

Jinshan Xiong (Inactive) added a comment - 02/Aug/12 12:53 PM

Hi Chris Gearing, Sorry I meant to say Christopher Morrone because LLNL is verifying if the patch can fix the data corruption problem. This is a rarely occurred problem so it may take months to verify it.

Jinshan Xiong (Inactive) added a comment - 02/Aug/12 12:53 PM Hi Chris Gearing, Sorry I meant to say Christopher Morrone because LLNL is verifying if the patch can fix the data corruption problem. This is a rarely occurred problem so it may take months to verify it.

Chris Gearing (Inactive) added a comment - 02/Aug/12 10:33 AM

I don't know, any pushes to gerrit that have rebased on this patch will have been run with this patch. I cannot know who has rebased and pushed.

Chris Gearing (Inactive) added a comment - 02/Aug/12 10:33 AM I don't know, any pushes to gerrit that have rebased on this patch will have been run with this patch. I cannot know who has rebased and pushed.

Jinshan Xiong (Inactive) added a comment - 31/Jul/12 7:51 PM

Hi Chris, how long have you been running the test with this patch?

Jinshan Xiong (Inactive) added a comment - 31/Jul/12 7:51 PM Hi Chris, how long have you been running the test with this patch?

Ian Colle (Inactive) added a comment - 31/Jul/12 6:57 PM

Master version of patch has been merged - http://review.whamcloud.com/#change,3447

Ian Colle (Inactive) added a comment - 31/Jul/12 6:57 PM Master version of patch has been merged - http://review.whamcloud.com/#change,3447

Christopher Morrone (Inactive) added a comment - 06/Jul/12 7:54 PM

Jinshan, I've added http://review.whamcloud.com/3194 to our branch to include in the next round of testing.

Christopher Morrone (Inactive) added a comment - 06/Jul/12 7:54 PM Jinshan, I've added http://review.whamcloud.com/3194 to our branch to include in the next round of testing.

Ned Bass (Inactive) added a comment - 28/Jun/12 12:31 PM - edited

Attached test script. No cache flush or other operations are done between write and verification. The entire test is run on the same client. The process is basically:

write random patterns to a file
ftp file to archvial storage
retrieve copy from archival storage
compare copy to original with 'cmp' to check for corruption

Ned Bass (Inactive) added a comment - 28/Jun/12 12:31 PM - edited Attached test script. No cache flush or other operations are done between write and verification. The entire test is run on the same client. The process is basically: write random patterns to a file ftp file to archvial storage retrieve copy from archival storage compare copy to original with 'cmp' to check for corruption

Jinshan Xiong (Inactive) added a comment - 28/Jun/12 9:34 AM

Hi Ned, Do you know how the application detected the data corruption issue? Do they just read the data back on the same client or some operations, for example flush caching pages, were done between the write and verification?

Jinshan Xiong (Inactive) added a comment - 28/Jun/12 9:34 AM Hi Ned, Do you know how the application detected the data corruption issue? Do they just read the data back on the same client or some operations, for example flush caching pages, were done between the write and verification?

Jinshan Xiong (Inactive) added a comment - 28/Jun/12 9:29 AM

When I was trying to reproduce this with fail_loc today I found something new. Actually, though a dirty page will not be added into osc's cache if osc_page_cache_add() is interrupted by signal, the page will still be written back by kernel flush daemon. Saying that, the IO pattern can be exactly what you have seen(a page of gap in block allocation), but data corruption is unexpected. I still need to investigate this problem, but I will focus on if there exists a code path causing dirty page discarded.

I'm pretty sure the issue you have seen is related to the problem I found in the patch, so please apply this patch and do test intensely. Maybe we can find new clues. Thanks,

Jinshan Xiong (Inactive) added a comment - 28/Jun/12 9:29 AM When I was trying to reproduce this with fail_loc today I found something new. Actually, though a dirty page will not be added into osc's cache if osc_page_cache_add() is interrupted by signal, the page will still be written back by kernel flush daemon. Saying that, the IO pattern can be exactly what you have seen(a page of gap in block allocation), but data corruption is unexpected. I still need to investigate this problem, but I will focus on if there exists a code path causing dirty page discarded. I'm pretty sure the issue you have seen is related to the problem I found in the patch, so please apply this patch and do test intensely. Maybe we can find new clues. Thanks,

Ned Bass (Inactive) added a comment - 27/Jun/12 12:53 PM

Great, we'll give the patch a try. We've had three corruptions in about two months, and we haven't found a way to easily reproduce it. So it may take a few months with no new corruptions to gain some confidence in the fix.

Ned Bass (Inactive) added a comment - 27/Jun/12 12:53 PM Great, we'll give the patch a try. We've had three corruptions in about two months, and we haven't found a way to easily reproduce it. So it may take a few months with no new corruptions to gain some confidence in the fix.

Jinshan Xiong (Inactive) added a comment - 26/Jun/12 11:19 PM

Hi Ned, will you please try this patch: http://review.whamcloud.com/3194, this patch may fix corruption issue.

After the corruption issue is fixed, I'll start to work on wrong opc issue if it's bothering you guys.

Jinshan Xiong (Inactive) added a comment - 26/Jun/12 11:19 PM Hi Ned, will you please try this patch: http://review.whamcloud.com/3194 , this patch may fix corruption issue. After the corruption issue is fixed, I'll start to work on wrong opc issue if it's bothering you guys.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 25/May/12 6:46 PM

Updated:: 16/Aug/16 4:36 PM

Resolved:: 16/Aug/16 4:36 PM