[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.3.0, Lustre 2.1.1
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/commits/2.1.1-llnl

Severity:
3
Rank (Obsolete):
4520

Description

A data integrity test run periodically run by our storage group found two occurrences of corrupt files written to Lustre. The original files contain 300 MB of random data. The corrupt copies contain several 4096B regions of zeros aligned on 1MiB boundaries. The two corrupt files were written to the same filesystem from two different login nodes on the same cluster within five minutes of each other. The stripe count is 100.

The client application is a parallel ftp client reading data out of our storage archive into Lustre. The test checks for differences between the restored files and the original copies. For a 300MB file it uses 4 threads which issue 4 64MB pwrite()'s and 1 44MB pwrite(). It is possible that the pwrite() gets restarted due to SIGUSR2 from a master process, though we don't know if this occurred in the corrupting case. This test has seen years of widespread use on all of our clusters, and this is the first reported incidence of this type of corruption, so we can characterize the frequency as rare.

When I examine an OST object containing a corrupt region, I see there is no block allocated for the corrupt region (in this case, logical block 256 is missing).

# pigs58 /root > debugfs -c -R "dump_extents /O/0/d$((30205348 % 32))/30205348" /dev/sdb
debugfs 1.41.12 (17-May-2010)
/dev/sdb: catastrophic mode - not reading inode or group bitmaps
Level Entries       Logical              Physical Length Flags
 0/ 0   1/  3     0 -   255 813140480 - 813140735    256
 0/ 0   2/  3   257 -   511 813142528 - 813142782    255
 0/ 0   3/  3   512 -   767 813143040 - 813143295    256

Finally, the following server-side console messages appeared at the same time one of the corrupted files was written, and mention the NID of the implicated client. The consoles of the OSTs containing the corrupt objects were quiet at the time.

May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0
May 17 01:06:08 pigs-mds1 kernel: Lustre: 20418:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5
May 17 01:06:08 pigs-mds1 kernel: Lustre: All locks stolen from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-1442.lustre.log.sierra654.gz
2.55 MB
28/May/12 1:38 AM
LU-1442.lustre.log.sierra972.gz
2.43 MB
28/May/12 1:38 AM
qualify.ftp
0.9 kB
28/Jun/12 12:31 PM

Issue Links

is duplicated by

LU-1680 LBUG cl_lock.c:1949:discard_cb()) (ORI-726)

Resolved

is related to

LU-1458 lustre-rsync-test test_2b: old lustre_rsync does not work with new llog_changelog_ext_rec remove changelog

Resolved

is related to

LU-1703 b2_1 can't pass acc-sm test

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros

Artem Blagodarenko (Inactive) added a comment - 23/Apr/13 8:53 PM - edited

Artem Blagodarenko (Inactive) added a comment - 23/Apr/13 8:53 PM - edited .

Artem Blagodarenko (Inactive) added a comment - 27/Mar/13 1:16 PM - edited

Moved unrelated comments to new bug ~~LU-3219~~.

Artem Blagodarenko (Inactive) added a comment - 27/Mar/13 1:16 PM - edited Moved unrelated comments to new bug LU-3219 .

Jinshan Xiong (Inactive) added a comment - 07/Mar/13 7:50 PM

I'm lowering the prio to major and will continue working on this after getting new clues.

Jinshan Xiong (Inactive) added a comment - 07/Mar/13 7:50 PM I'm lowering the prio to major and will continue working on this after getting new clues.

Ned Bass (Inactive) added a comment - 07/Dec/12 1:37 PM

Hi Jinshan,

This node has been up for about 70 days and has run many other jobs, so we can't say for sure whether the missing page is reflected in the stats.

I like the idea of a debug patch to catch this case. We discussed that approach during our last meeting but I couldn't easily figure out where to put the debugging.

Ned Bass (Inactive) added a comment - 07/Dec/12 1:37 PM Hi Jinshan, This node has been up for about 70 days and has run many other jobs, so we can't say for sure whether the missing page is reflected in the stats. I like the idea of a debug patch to catch this case. We discussed that approach during our last meeting but I couldn't easily figure out where to put the debugging.

Jinshan Xiong (Inactive) added a comment - 07/Dec/12 1:23 PM

Have you ever run other jobs in this cluster because I saw some non 1M RPCs in the roc_stats?

From the symptom, I guess somehow the 1st page of a stripe was not included in the RPC. If that is true, then that RPC must have only 255 pages and we should see a tally in the stats. Of course this way only works if that job is the only one in the cluster.

We'll make a debug patch for this purpose.

Jinshan Xiong (Inactive) added a comment - 07/Dec/12 1:23 PM Have you ever run other jobs in this cluster because I saw some non 1M RPCs in the roc_stats? From the symptom, I guess somehow the 1st page of a stripe was not included in the RPC. If that is true, then that RPC must have only 255 pages and we should see a tally in the stats. Of course this way only works if that job is the only one in the cluster. We'll make a debug patch for this purpose.

Ned Bass (Inactive) added a comment - 06/Dec/12 1:00 PM

The rpc_stats for the involved OST looked like this:

                            read                    write
pages per rpc           rpcs   %   cum % |   rpcs  %  cum %
1:                        0    0     0   |      3  0    0   
2:                        0    0     0   |      0  0    0   
4:                        2    0     0   |      0  0    0   
8:                        0    0     0   |      0  0    0   
16:                       0    0     0   |      0  0    0   
32:                       0    0     0   |      1  0    0   
64:                       0    0     0   |      3  0    0   
128:                      0    0     0   |      3  0    0   
256:                   4771   99   100   |   4768 99  100 

                            read                    write
offset                  rpcs   %   cum % |   rpcs  %  cum %
1:                     4772   99    99   |   4760 99   99  
2:                        0    0    99   |      4  0   99  
4:                        2    0    99   |      0  0   99  
8:                        0    0    99   |      0  0   99  
16:                       0    0    99   |      0  0   99  
32:                       0    0    99   |      1  0   99  
64:                       1    0   100   |      6  0   99  
128:                      0    0   100   |      7  0  100

Ned Bass (Inactive) added a comment - 06/Dec/12 1:00 PM The rpc_stats for the involved OST looked like this: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 3 0 0 2: 0 0 0 | 0 0 0 4: 2 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 0 0 0 | 1 0 0 64: 0 0 0 | 3 0 0 128: 0 0 0 | 3 0 0 256: 4771 99 100 | 4768 99 100 read write offset rpcs % cum % | rpcs % cum % 1: 4772 99 99 | 4760 99 99 2: 0 0 99 | 4 0 99 4: 2 0 99 | 0 0 99 8: 0 0 99 | 0 0 99 16: 0 0 99 | 0 0 99 32: 0 0 99 | 1 0 99 64: 1 0 100 | 6 0 99 128: 0 0 100 | 7 0 100

Christopher Morrone (Inactive) added a comment - 06/Dec/12 11:55 AM

No, both client and server are x86_64.

Christopher Morrone (Inactive) added a comment - 06/Dec/12 11:55 AM No, both client and server are x86_64.

Jinshan Xiong (Inactive) added a comment - 06/Dec/12 11:53 AM

I can't think of what kind of problem can cause this symptom, there is even no lock contention. One more question is: is the client ppc64?

Jinshan Xiong (Inactive) added a comment - 06/Dec/12 11:53 AM I can't think of what kind of problem can cause this symptom, there is even no lock contention. One more question is: is the client ppc64?

Ned Bass (Inactive) added a comment - 06/Dec/12 1:56 AM

The stripe patterns are not always the same. The first case involved a 300 MB file with a stripe count of 100. This most recent case involved a 10 GB file with a stripe count of 2. The only consistent pattern so far is that it always seems to be logical block 256 missing from an OST object.

I'll check the rpc_stats tomorrow.

Ned Bass (Inactive) added a comment - 06/Dec/12 1:56 AM The stripe patterns are not always the same. The first case involved a 300 MB file with a stripe count of 100. This most recent case involved a 10 GB file with a stripe count of 2. The only consistent pattern so far is that it always seems to be logical block 256 missing from an OST object. I'll check the rpc_stats tomorrow.

Zhenyu Xu added a comment - 05/Dec/12 9:58 PM

Ned Bass,

Would you mind checking whether the stripe patterns are always the same for the error cases? Also please collect and upload client's rpc_stats of clients info (e.g. /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88003de72800/rpc_stats data).

We'd like to know whether clients have always written 1M RPC to OST w/o partial write, so that we'd know whether client's issue or OST's.

Zhenyu Xu added a comment - 05/Dec/12 9:58 PM Ned Bass, Would you mind checking whether the stripe patterns are always the same for the error cases? Also please collect and upload client's rpc_stats of clients info (e.g. /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88003de72800/rpc_stats data). We'd like to know whether clients have always written 1M RPC to OST w/o partial write, so that we'd know whether client's issue or OST's.

Ned Bass (Inactive) added a comment - 03/Dec/12 5:16 PM

Looking at the on-disk data, it is again logical block 256 that is unallocated in the OST object.

Ned Bass (Inactive) added a comment - 03/Dec/12 5:16 PM Looking at the on-disk data, it is again logical block 256 that is unallocated in the OST object.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 25/May/12 6:46 PM

Updated:: 16/Aug/16 4:36 PM

Resolved:: 16/Aug/16 4:36 PM