[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros Created: 25/May/12  Updated: 16/Aug/16  Resolved: 16/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.1
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

https://github.com/chaos/lustre/commits/2.1.1-llnl


Attachments: File LU-1442.lustre.log.sierra654.gz     File LU-1442.lustre.log.sierra972.gz     File qualify.ftp    
Issue Links:
Duplicate
is duplicated by LU-1680 LBUG cl_lock.c:1949:discard_cb()) (OR... Resolved
Related
is related to LU-1703 b2_1 can't pass acc-sm test Resolved
is related to LU-1458 lustre-rsync-test test_2b: old lustre... Resolved
Severity: 3
Rank (Obsolete): 4520

 Description   

A data integrity test run periodically run by our storage group found two occurrences of corrupt files written to Lustre. The original files contain 300 MB of random data. The corrupt copies contain several 4096B regions of zeros aligned on 1MiB boundaries. The two corrupt files were written to the same filesystem from two different login nodes on the same cluster within five minutes of each other. The stripe count is 100.

The client application is a parallel ftp client reading data out of our storage archive into Lustre. The test checks for differences between the restored files and the original copies. For a 300MB file it uses 4 threads which issue 4 64MB pwrite()'s and 1 44MB pwrite(). It is possible that the pwrite() gets restarted due to SIGUSR2 from a master process, though we don't know if this occurred in the corrupting case. This test has seen years of widespread use on all of our clusters, and this is the first reported incidence of this type of corruption, so we can characterize the frequency as rare.

When I examine an OST object containing a corrupt region, I see there is no block allocated for the corrupt region (in this case, logical block 256 is missing).

# pigs58 /root > debugfs -c -R "dump_extents /O/0/d$((30205348 % 32))/30205348" /dev/sdb
debugfs 1.41.12 (17-May-2010)
/dev/sdb: catastrophic mode - not reading inode or group bitmaps
Level Entries       Logical              Physical Length Flags
 0/ 0   1/  3     0 -   255 813140480 - 813140735    256
 0/ 0   2/  3   257 -   511 813142528 - 813142782    255
 0/ 0   3/  3   512 -   767 813143040 - 813143295    256

Finally, the following server-side console messages appeared at the same time one of the corrupted files was written, and mention the NID of the implicated client. The consoles of the OSTs containing the corrupt objects were quiet at the time.

May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0
May 17 01:06:08 pigs-mds1 kernel: Lustre: 20418:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5
May 17 01:06:08 pigs-mds1 kernel: Lustre: All locks stolen from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5


 Comments   
Comment by Ned Bass [ 25/May/12 ]

The client logs were also quiet at the time of the corruption.

I should also mention that the corruption offsets are not at 64MiB boundaries, so it is less likely that the application accidentally wrote a sparse file by miscalculating the offset argument to pwrite(). The first file had corruption as MiB offsets 159, 162, 173, 189. In the second file the MiB offsets were 158 and 190.

Comment by Peter Jones [ 25/May/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 27/May/12 ]

Would you please upload debug log of MDS and client as much as possible?

May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0

the operation request from client is abnormal (its opc was 0, which is abnormal), the client opc saved on MDS is stored in target_send_reply()

target_send_reply()
        /* disable reply scheduling while I'm setting up */
        rs->rs_scheduled = 1;
        rs->rs_on_net    = 1;
        rs->rs_xid       = req->rq_xid;
        rs->rs_transno   = req->rq_transno;
        rs->rs_export    = exp;
        rs->rs_opc       = lustre_msg_get_opc(rs->rs_msg);
lustre_msg_get_opc()
__u32 lustre_msg_get_opc(struct lustre_msg *msg)
{
        switch (msg->lm_magic) {
        case LUSTRE_MSG_MAGIC_V2: {
                struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
                if (!pb) {
                        CERROR("invalid msg %p: no ptlrpc body!\n", msg);
                        return 0;
                }
                return pb->pb_opc;
        }
        default:
                CERROR("incorrect message magic: %08x(msg:%p)\n", msg->lm_magic, msg);
                LBUG();
                return 0;
        }
}
Comment by Ned Bass [ 28/May/12 ]

Hi,

Would you please upload debug log of MDS and client as much as possible?

Unfortunately the MDS was rebooted since this event and the debug log was lost. I dumped the debug logs for both clients involved, but I don't think you'll find anything useful in them. I'll attach them just in case.

Comment by Andreas Dilger [ 29/May/12 ]

Ned,
can you please check the client NID that reported this problem and check the Lustre version. If the client is a bit older, it is possible that it hit LU-721, which was only fixed in Lustre 1.8.7, 2.1.1, and 2.2+. AFAIK, you have 1.8.5 clients still in use.

Failing that, is there any chance that the client was evicted from the OST, but the console messages related to client eviction have been turned off? There were a lot of patches to quiet console errors, so I'm wondering if a client eviction could happen without any visible console error messages.

The write was never sent to the filesystem (evidenced by the unallocated block), so it is unlikely to be corruption at the underlying block layer or in ldiskfs. It is much more likely that the RPC was not handled at all, which could be caused by eviction, or something like LU-721.

Comment by Christopher Morrone [ 29/May/12 ]

Andreas, the clients that are hitting this are running 2.1.1-11chaos, and the servers are all running 2.1.1-4chaos or newer. So we definitely already have the LU-721 patch that is landed in 2.1.1 on those clients.

We have 1.8.5-?chaos on some x86_64 clients (again, all server are 2.1.1+) still, but none of them exhibit this problem. These clients will be upgraded to 2.1.1-*chaos in time. But again, only the 2.1.1+ clients are seeing this bug.

Comment by Christopher Morrone [ 30/May/12 ]

We are NOT silencing the eviction messages. We absolutely need those.

But there are no messages about eviction or anything else on the clients in question for hours on either side of the problem.

Comment by Christopher Morrone [ 31/May/12 ]

We have confirmation from the pftp developers that the parent process will send SIGUSER2 to all four of the children (probably when the first child completes). Here is exactly what we've been told:

"Near the end of the transfer, the parent process send SIGUSR2 to all four children. Anywhere from 0 to 3 of the 4 children exit pwrite with a return value of -1 and errno of 4 (interrupted system call). All interrupted pwrites are reissued and then succeed."

Comment by Ned Bass [ 01/Jun/12 ]

Andreas,

The write was never sent to the filesystem (evidenced by the unallocated block), so it is unlikely to be corruption at the underlying block layer or in ldiskfs. It is much more likely that the RPC was not handled at all, which could be caused by eviction, or something like LU-721.

One thing that puzzles me about this explanation is that the size of the hole is only 4k. I would expect the client to have been sending 1MB write RPCs since the pwrite() size was a multiple of 1MB. If the write was never sent or handled, shouldn't the hole be 1MB?

Comment by Jinshan Xiong (Inactive) [ 05/Jun/12 ]

This bug may be related to LU-1299 where sleeping on a cl_lock is interrupted by the signal, so the lock is wrongly marked as error. As a result, the lock canceling will abort and this causes some dirty pages are not flushed.

There is a patch at http://review.whamcloud.com/2574 can you please try if it can fix your issue?

Comment by Ned Bass [ 05/Jun/12 ]

Hi Jinshan,

We have patchset 7 of that patch in our current tag. We will update to the latest. However we don't have a reliable reproducer for this bug so it will be difficult to say if it fixes the issue.

Comment by Christopher Morrone [ 05/Jun/12 ]

I've updated the 2.1.1-llnl branch to have patch set 10 from http://review.whamcloud.com/2574.

Comment by Peter Jones [ 11/Jun/12 ]

ok I will close this as a duplicate of LU-1299 and we can reopen if that is proven to not be the case

Comment by Ned Bass [ 15/Jun/12 ]

We had a new occurrence of this bug this morning on a client running patchset 7 of http://review.whamcloud.com/2574. Does this change your view that this is related to LU-1299, or do we need patchset 10 to prevent the corruption?

Comment by Ned Bass [ 15/Jun/12 ]

The new occurrence again coincided with messages on the MDS like these:

Resent req xid ... has mismatched opc: new 101 old 0
Stealing 1 locks from rs ...
All locks stolen from rs ...

Comment by Jinshan Xiong (Inactive) [ 18/Jun/12 ]

Hi Ned, it seems this is not related to LU-1299. Did the processes write their portion of file exclusively, i.e. there is no extent lock confliction?

Comment by Ned Bass [ 19/Jun/12 ]

Did the processes write their portion of file exclusively, i.e. there is no extent lock confliction?

I'm not sure how to determine the answer to that. The pwrite() calls from one strace of the four involved processes looked like this:

6621:pwrite(8, "...", 67108864, 0) = 67108864                    
6621:--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
6621:pwrite(8, "...", 46137344, 268435456) = 46137344                

6622:pwrite(4, "...", 67108864, 67108864) = 67108864
6622:--- SIGUSR2 (User defined signal 2) @ 0 (0) ---

6623:pwrite(4, "...", 67108864, 134217728) = ? ERESTARTSYS (To be restarted)
6623:--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
6623:pwrite(4, "...", 67108864, 134217728) = 67108864                

6624:pwrite(4, "...", 67108864, 201326592) = ? ERESTARTSYS (To be restarted)
6624:--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
6624:pwrite(4, "...", 67108864, 201326592) = 67108864   

(This was not a corrupting case, just a sample run of the test program.)

It may be worth noting that in each occurrence we've had so far, the corrupted regions have all been at offsets between 100-200 MB. Since the test uses a stripe count of 100 that's always in the second stripe of the file.

Comment by Christopher Morrone [ 21/Jun/12 ]

Did the processes write their portion of file exclusively, i.e. there is no extent lock confliction?

The files write to their own ranges in the files, and do not overlap with any of the other writers. Furthermore, the writers are all on one node, so there is no ldlm lock contention with any other lustre clients.

Comment by Peter Jones [ 25/Jun/12 ]

It seems like this ticket should be reopened.

Comment by Andreas Dilger [ 25/Jun/12 ]

Jinshan,
I would investigate the RPC opcode issue. It would be possible to add debugging on the client to verify the opcode of the RPC is non-zero when it is put into the replay list, and print a message at that point if the opcode is bad, and also verify the opcode on the client before the RPC is resent. If only the second message triggers, it means some kind of in-memory corruption is happening on the client.

What is strange is that there are messages on the MDS, instead of on the OSS where one would expect any errors related to file IO to happen. The only thing I can think of where an MDS error would cause bad data is that some of the client threads are somehow opening the incorrect file and the data is going to the wrong location?

Comment by Jinshan Xiong (Inactive) [ 25/Jun/12 ]

I'm working on this issue.

Verifying RPC opcode would be the only way to start with. The strange thing is that only one page in the middle was missing, I can't connect this with MDS replay error.

Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ]

Hi Ned, did you notice that when this issue happened, pwrite() kept returning -1 with errno ERESTARTSYS, or sometimes short write occurred?

Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ]

just in case, did you use DIRECT_IO to write the file?

Comment by Ned Bass [ 26/Jun/12 ]

did you notice that when this issue happened, pwrite() kept returning -1 with errno ERESTARTSYS, or sometimes short write occurred?

I don't have a trace of the process when the issue happened. All I know is that the parent process will signal its children with SIGUSR2 which will cause pwrite() to return -1 with errno ERESTARTSYS. I've never seen this happen more than once in my tests, but it's possible this pattern will repeat if pwrite() is slow. I don't have the pftp source code to check, but I can ask our HPSS group to look into it.

just in case, did you use DIRECT_IO to write the file?

No, strace shows pftp just uses O_WRONLY|O_CREAT.

Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ]

Hi Ned, will you please try this patch: http://review.whamcloud.com/3194, this patch may fix corruption issue.

After the corruption issue is fixed, I'll start to work on wrong opc issue if it's bothering you guys.

Comment by Ned Bass [ 27/Jun/12 ]

Great, we'll give the patch a try. We've had three corruptions in about two months, and we haven't found a way to easily reproduce it. So it may take a few months with no new corruptions to gain some confidence in the fix.

Comment by Jinshan Xiong (Inactive) [ 28/Jun/12 ]

When I was trying to reproduce this with fail_loc today I found something new. Actually, though a dirty page will not be added into osc's cache if osc_page_cache_add() is interrupted by signal, the page will still be written back by kernel flush daemon. Saying that, the IO pattern can be exactly what you have seen(a page of gap in block allocation), but data corruption is unexpected. I still need to investigate this problem, but I will focus on if there exists a code path causing dirty page discarded.

I'm pretty sure the issue you have seen is related to the problem I found in the patch, so please apply this patch and do test intensely. Maybe we can find new clues. Thanks,

Comment by Jinshan Xiong (Inactive) [ 28/Jun/12 ]

Hi Ned, Do you know how the application detected the data corruption issue? Do they just read the data back on the same client or some operations, for example flush caching pages, were done between the write and verification?

Comment by Ned Bass [ 28/Jun/12 ]

Attached test script. No cache flush or other operations are done between write and verification. The entire test is run on the same client. The process is basically:

  1. write random patterns to a file
  2. ftp file to archvial storage
  3. retrieve copy from archival storage
  4. compare copy to original with 'cmp' to check for corruption
Comment by Christopher Morrone [ 06/Jul/12 ]

Jinshan, I've added http://review.whamcloud.com/3194 to our branch to include in the next round of testing.

Comment by Ian Colle (Inactive) [ 31/Jul/12 ]

Master version of patch has been merged - http://review.whamcloud.com/#change,3447

Comment by Jinshan Xiong (Inactive) [ 31/Jul/12 ]

Hi Chris, how long have you been running the test with this patch?

Comment by Chris Gearing (Inactive) [ 02/Aug/12 ]

I don't know, any pushes to gerrit that have rebased on this patch will have been run with this patch. I cannot know who has rebased and pushed.

Comment by Jinshan Xiong (Inactive) [ 02/Aug/12 ]

Hi Chris Gearing, Sorry I meant to say Christopher Morrone because LLNL is verifying if the patch can fix the data corruption problem. This is a rarely occurred problem so it may take months to verify it.

Comment by Christopher Morrone [ 02/Aug/12 ]

Its been on our test systems and hasn't caused any problems that I am aware of. It is not installed in production yet. It might make it into a production release in a couple of weeks.

We've seen the LU-1680 failures on the orion branch, and just recently pulled this LU-1442 patch into there. We'll keep an eye out for failures on orion when we upgrade to that version.

Comment by Christopher Morrone [ 02/Aug/12 ]

Oh, in other words, I don't know if it fixes the problem because we don't know hot to reproduce it in our test environment. But it hasn't caused any new problems that I know of.

Comment by Jay Lan (Inactive) [ 02/Aug/12 ]

Are the messages below evidence of data corruption? We have quite a number of these on our 2.1.2 mds:

LustreError: 4869:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1407363869661366 has mismatched opc: new 101 old 0^M
Lustre: 4869:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff8802a0930000 x1407363869661366.t210874408121 o0 NID 10.151.26.25@o2ib^M
Lustre: 4265:0:(service.c:1865:ptlrpc_handle_rs()) All locks stolen from rs ffff8802a0930000 x1407363869661366.t210874408121 o0 NID 10.151.26.25@o2ib^M

Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ]

Hi Jay Lan,

This error messages are not related as data corruption happened on OST and the message showed something wrong with MDT.

Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ]

I'm going to set this issue as fixed otherwise we can't release 2.3.

Please reopen this issue if it occurs again.

Comment by Jay Lan (Inactive) [ 06/Aug/12 ]

Hi Jinshan, but our console log messages look the same as the server console log messages shown in the Description section of the ticket. Are you suggesting that other problems can produce the same log messages and thus the messages alone not sufficient evidence? Please advise. Thanks!

Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ]

This message may not introduce any problem, probably just too aggressive console output. I think it can be ignored if you didn't see any real problem.

Comment by Jinshan Xiong (Inactive) [ 07/Aug/12 ]

Hi Jay Lan, Lu-1717 was just created to address/understand this MDS console error message.

Comment by Jay Lan (Inactive) [ 07/Aug/12 ]

Thanks, Jinshan!

Comment by Yusong Wang [ 12/Sep/12 ]

Hi,

We got a similar issue (the written file has a big hole with a lot of zeros) occasionally and caused wrong simulation results.

As we have Luster 1.8.8, I wonder if our file system could be affected by this bug.

Thanks,
Yusong

Comment by Ned Bass [ 03/Dec/12 ]

Our test program detected this type of corruption again, so we should re-open this issue. Client was running 2.1.2-4chaos, which includes the patch "LU-1442 llite: cleanup if a page failed to add into cache".

https://github.com/chaos/lustre/commits/2.1.2-4chaos

Note that this test runs regularly on our production systems, so this bug is very rare to reproduce.

Comment by Ned Bass [ 03/Dec/12 ]

Looking at the on-disk data, it is again logical block 256 that is unallocated in the OST object.

Comment by Zhenyu Xu [ 05/Dec/12 ]

Ned Bass,

Would you mind checking whether the stripe patterns are always the same for the error cases? Also please collect and upload client's rpc_stats of clients info (e.g. /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88003de72800/rpc_stats data).

We'd like to know whether clients have always written 1M RPC to OST w/o partial write, so that we'd know whether client's issue or OST's.

Comment by Ned Bass [ 06/Dec/12 ]

The stripe patterns are not always the same. The first case involved a 300 MB file with a stripe count of 100. This most recent case involved a 10 GB file with a stripe count of 2. The only consistent pattern so far is that it always seems to be logical block 256 missing from an OST object.

I'll check the rpc_stats tomorrow.

Comment by Jinshan Xiong (Inactive) [ 06/Dec/12 ]

I can't think of what kind of problem can cause this symptom, there is even no lock contention. One more question is: is the client ppc64?

Comment by Christopher Morrone [ 06/Dec/12 ]

No, both client and server are x86_64.

Comment by Ned Bass [ 06/Dec/12 ]

The rpc_stats for the involved OST looked like this:

                            read                    write
pages per rpc           rpcs   %   cum % |   rpcs  %  cum %
1:                        0    0     0   |      3  0    0   
2:                        0    0     0   |      0  0    0   
4:                        2    0     0   |      0  0    0   
8:                        0    0     0   |      0  0    0   
16:                       0    0     0   |      0  0    0   
32:                       0    0     0   |      1  0    0   
64:                       0    0     0   |      3  0    0   
128:                      0    0     0   |      3  0    0   
256:                   4771   99   100   |   4768 99  100 

                            read                    write
offset                  rpcs   %   cum % |   rpcs  %  cum %
1:                     4772   99    99   |   4760 99   99  
2:                        0    0    99   |      4  0   99  
4:                        2    0    99   |      0  0   99  
8:                        0    0    99   |      0  0   99  
16:                       0    0    99   |      0  0   99  
32:                       0    0    99   |      1  0   99  
64:                       1    0   100   |      6  0   99  
128:                      0    0   100   |      7  0  100 
Comment by Jinshan Xiong (Inactive) [ 07/Dec/12 ]

Have you ever run other jobs in this cluster because I saw some non 1M RPCs in the roc_stats?

From the symptom, I guess somehow the 1st page of a stripe was not included in the RPC. If that is true, then that RPC must have only 255 pages and we should see a tally in the stats. Of course this way only works if that job is the only one in the cluster.

We'll make a debug patch for this purpose.

Comment by Ned Bass [ 07/Dec/12 ]

Hi Jinshan,

This node has been up for about 70 days and has run many other jobs, so we can't say for sure whether the missing page is reflected in the stats.

I like the idea of a debug patch to catch this case. We discussed that approach during our last meeting but I couldn't easily figure out where to put the debugging.

Comment by Jinshan Xiong (Inactive) [ 07/Mar/13 ]

I'm lowering the prio to major and will continue working on this after getting new clues.

Comment by Artem Blagodarenko (Inactive) [ 27/Mar/13 ]

Moved unrelated comments to new bug LU-3219.

Comment by Artem Blagodarenko (Inactive) [ 23/Apr/13 ]

.

Comment by Jodi Levi (Inactive) [ 24/Apr/13 ]

Reclosing this ticket. The new issue mentioned is tracked in LU-3219.

Comment by Ned Bass [ 24/Apr/13 ]

Jodi, afaik this issue is not resolved, just rare. Are there still any plans to make a debug patch?

Comment by Peter Jones [ 24/Apr/13 ]

Sorry for the confusion Ned - things just got a little muddled due to two somewhat similar issues becoming intertwined. Yes, this is still on Jinshan's plate to dig into further at some point.

Comment by James A Simmons [ 16/Aug/16 ]

Old ticket for unsupported version

Generated at Sat Feb 10 01:16:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.