[LU-1442] File corrupt with 1MiB-aligned 4k regions of zeros Created: 25/May/12 Updated: 16/Aug/16 Resolved: 16/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 4520 | ||||||||||||||||||||
| Description |
|
A data integrity test run periodically run by our storage group found two occurrences of corrupt files written to Lustre. The original files contain 300 MB of random data. The corrupt copies contain several 4096B regions of zeros aligned on 1MiB boundaries. The two corrupt files were written to the same filesystem from two different login nodes on the same cluster within five minutes of each other. The stripe count is 100. The client application is a parallel ftp client reading data out of our storage archive into Lustre. The test checks for differences between the restored files and the original copies. For a 300MB file it uses 4 threads which issue 4 64MB pwrite()'s and 1 44MB pwrite(). It is possible that the pwrite() gets restarted due to SIGUSR2 from a master process, though we don't know if this occurred in the corrupting case. This test has seen years of widespread use on all of our clusters, and this is the first reported incidence of this type of corruption, so we can characterize the frequency as rare. When I examine an OST object containing a corrupt region, I see there is no block allocated for the corrupt region (in this case, logical block 256 is missing). # pigs58 /root > debugfs -c -R "dump_extents /O/0/d$((30205348 % 32))/30205348" /dev/sdb debugfs 1.41.12 (17-May-2010) /dev/sdb: catastrophic mode - not reading inode or group bitmaps Level Entries Logical Physical Length Flags 0/ 0 1/ 3 0 - 255 813140480 - 813140735 256 0/ 0 2/ 3 257 - 511 813142528 - 813142782 255 0/ 0 3/ 3 512 - 767 813143040 - 813143295 256 Finally, the following server-side console messages appeared at the same time one of the corrupted files was written, and mention the NID of the implicated client. The consoles of the OSTs containing the corrupt objects were quiet at the time. May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0 May 17 01:06:08 pigs-mds1 kernel: Lustre: 20418:0:(mdt_recovery.c:1022:mdt_steal_ack_locks()) Stealing 1 locks from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5 May 17 01:06:08 pigs-mds1 kernel: Lustre: All locks stolen from rs ffff880410f62000 x1402165306385077.t125822723745 o0 NID 192.168.114.155@o2ib5 |
| Comments |
| Comment by Ned Bass [ 25/May/12 ] |
|
The client logs were also quiet at the time of the corruption. I should also mention that the corruption offsets are not at 64MiB boundaries, so it is less likely that the application accidentally wrote a sparse file by miscalculating the offset argument to pwrite(). The first file had corruption as MiB offsets 159, 162, 173, 189. In the second file the MiB offsets were 158 and 190. |
| Comment by Peter Jones [ 25/May/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 27/May/12 ] |
|
Would you please upload debug log of MDS and client as much as possible? May 17 01:06:08 pigs-mds1 kernel: LustreError: 20418:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1402165306385077 has mismatched opc: new 101 old 0 the operation request from client is abnormal (its opc was 0, which is abnormal), the client opc saved on MDS is stored in target_send_reply() target_send_reply() /* disable reply scheduling while I'm setting up */
rs->rs_scheduled = 1;
rs->rs_on_net = 1;
rs->rs_xid = req->rq_xid;
rs->rs_transno = req->rq_transno;
rs->rs_export = exp;
rs->rs_opc = lustre_msg_get_opc(rs->rs_msg);
lustre_msg_get_opc() __u32 lustre_msg_get_opc(struct lustre_msg *msg)
{
switch (msg->lm_magic) {
case LUSTRE_MSG_MAGIC_V2: {
struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
if (!pb) {
CERROR("invalid msg %p: no ptlrpc body!\n", msg);
return 0;
}
return pb->pb_opc;
}
default:
CERROR("incorrect message magic: %08x(msg:%p)\n", msg->lm_magic, msg);
LBUG();
return 0;
}
}
|
| Comment by Ned Bass [ 28/May/12 ] |
|
Hi,
Unfortunately the MDS was rebooted since this event and the debug log was lost. I dumped the debug logs for both clients involved, but I don't think you'll find anything useful in them. I'll attach them just in case. |
| Comment by Andreas Dilger [ 29/May/12 ] |
|
Ned, Failing that, is there any chance that the client was evicted from the OST, but the console messages related to client eviction have been turned off? There were a lot of patches to quiet console errors, so I'm wondering if a client eviction could happen without any visible console error messages. The write was never sent to the filesystem (evidenced by the unallocated block), so it is unlikely to be corruption at the underlying block layer or in ldiskfs. It is much more likely that the RPC was not handled at all, which could be caused by eviction, or something like |
| Comment by Christopher Morrone [ 29/May/12 ] |
|
Andreas, the clients that are hitting this are running 2.1.1-11chaos, and the servers are all running 2.1.1-4chaos or newer. So we definitely already have the We have 1.8.5-?chaos on some x86_64 clients (again, all server are 2.1.1+) still, but none of them exhibit this problem. These clients will be upgraded to 2.1.1-*chaos in time. But again, only the 2.1.1+ clients are seeing this bug. |
| Comment by Christopher Morrone [ 30/May/12 ] |
|
We are NOT silencing the eviction messages. We absolutely need those. But there are no messages about eviction or anything else on the clients in question for hours on either side of the problem. |
| Comment by Christopher Morrone [ 31/May/12 ] |
|
We have confirmation from the pftp developers that the parent process will send SIGUSER2 to all four of the children (probably when the first child completes). Here is exactly what we've been told: "Near the end of the transfer, the parent process send SIGUSR2 to all four children. Anywhere from 0 to 3 of the 4 children exit pwrite with a return value of -1 and errno of 4 (interrupted system call). All interrupted pwrites are reissued and then succeed." |
| Comment by Ned Bass [ 01/Jun/12 ] |
|
Andreas,
One thing that puzzles me about this explanation is that the size of the hole is only 4k. I would expect the client to have been sending 1MB write RPCs since the pwrite() size was a multiple of 1MB. If the write was never sent or handled, shouldn't the hole be 1MB? |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/12 ] |
|
This bug may be related to There is a patch at http://review.whamcloud.com/2574 can you please try if it can fix your issue? |
| Comment by Ned Bass [ 05/Jun/12 ] |
|
Hi Jinshan, We have patchset 7 of that patch in our current tag. We will update to the latest. However we don't have a reliable reproducer for this bug so it will be difficult to say if it fixes the issue. |
| Comment by Christopher Morrone [ 05/Jun/12 ] |
|
I've updated the 2.1.1-llnl branch to have patch set 10 from http://review.whamcloud.com/2574. |
| Comment by Peter Jones [ 11/Jun/12 ] |
|
ok I will close this as a duplicate of |
| Comment by Ned Bass [ 15/Jun/12 ] |
|
We had a new occurrence of this bug this morning on a client running patchset 7 of http://review.whamcloud.com/2574. Does this change your view that this is related to |
| Comment by Ned Bass [ 15/Jun/12 ] |
|
The new occurrence again coincided with messages on the MDS like these: Resent req xid ... has mismatched opc: new 101 old 0 |
| Comment by Jinshan Xiong (Inactive) [ 18/Jun/12 ] |
|
Hi Ned, it seems this is not related to |
| Comment by Ned Bass [ 19/Jun/12 ] |
I'm not sure how to determine the answer to that. The pwrite() calls from one strace of the four involved processes looked like this: 6621:pwrite(8, "...", 67108864, 0) = 67108864 6621:--- SIGUSR2 (User defined signal 2) @ 0 (0) --- 6621:pwrite(8, "...", 46137344, 268435456) = 46137344 6622:pwrite(4, "...", 67108864, 67108864) = 67108864 6622:--- SIGUSR2 (User defined signal 2) @ 0 (0) --- 6623:pwrite(4, "...", 67108864, 134217728) = ? ERESTARTSYS (To be restarted) 6623:--- SIGUSR2 (User defined signal 2) @ 0 (0) --- 6623:pwrite(4, "...", 67108864, 134217728) = 67108864 6624:pwrite(4, "...", 67108864, 201326592) = ? ERESTARTSYS (To be restarted) 6624:--- SIGUSR2 (User defined signal 2) @ 0 (0) --- 6624:pwrite(4, "...", 67108864, 201326592) = 67108864 (This was not a corrupting case, just a sample run of the test program.) It may be worth noting that in each occurrence we've had so far, the corrupted regions have all been at offsets between 100-200 MB. Since the test uses a stripe count of 100 that's always in the second stripe of the file. |
| Comment by Christopher Morrone [ 21/Jun/12 ] |
The files write to their own ranges in the files, and do not overlap with any of the other writers. Furthermore, the writers are all on one node, so there is no ldlm lock contention with any other lustre clients. |
| Comment by Peter Jones [ 25/Jun/12 ] |
|
It seems like this ticket should be reopened. |
| Comment by Andreas Dilger [ 25/Jun/12 ] |
|
Jinshan, What is strange is that there are messages on the MDS, instead of on the OSS where one would expect any errors related to file IO to happen. The only thing I can think of where an MDS error would cause bad data is that some of the client threads are somehow opening the incorrect file and the data is going to the wrong location? |
| Comment by Jinshan Xiong (Inactive) [ 25/Jun/12 ] |
|
I'm working on this issue. Verifying RPC opcode would be the only way to start with. The strange thing is that only one page in the middle was missing, I can't connect this with MDS replay error. |
| Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ] |
|
Hi Ned, did you notice that when this issue happened, pwrite() kept returning -1 with errno ERESTARTSYS, or sometimes short write occurred? |
| Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ] |
|
just in case, did you use DIRECT_IO to write the file? |
| Comment by Ned Bass [ 26/Jun/12 ] |
I don't have a trace of the process when the issue happened. All I know is that the parent process will signal its children with SIGUSR2 which will cause pwrite() to return -1 with errno ERESTARTSYS. I've never seen this happen more than once in my tests, but it's possible this pattern will repeat if pwrite() is slow. I don't have the pftp source code to check, but I can ask our HPSS group to look into it.
No, strace shows pftp just uses O_WRONLY|O_CREAT. |
| Comment by Jinshan Xiong (Inactive) [ 26/Jun/12 ] |
|
Hi Ned, will you please try this patch: http://review.whamcloud.com/3194, this patch may fix corruption issue. After the corruption issue is fixed, I'll start to work on wrong opc issue if it's bothering you guys. |
| Comment by Ned Bass [ 27/Jun/12 ] |
|
Great, we'll give the patch a try. We've had three corruptions in about two months, and we haven't found a way to easily reproduce it. So it may take a few months with no new corruptions to gain some confidence in the fix. |
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/12 ] |
|
When I was trying to reproduce this with fail_loc today I found something new. Actually, though a dirty page will not be added into osc's cache if osc_page_cache_add() is interrupted by signal, the page will still be written back by kernel flush daemon. Saying that, the IO pattern can be exactly what you have seen(a page of gap in block allocation), but data corruption is unexpected. I still need to investigate this problem, but I will focus on if there exists a code path causing dirty page discarded. I'm pretty sure the issue you have seen is related to the problem I found in the patch, so please apply this patch and do test intensely. Maybe we can find new clues. Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 28/Jun/12 ] |
|
Hi Ned, Do you know how the application detected the data corruption issue? Do they just read the data back on the same client or some operations, for example flush caching pages, were done between the write and verification? |
| Comment by Ned Bass [ 28/Jun/12 ] |
|
Attached test script. No cache flush or other operations are done between write and verification. The entire test is run on the same client. The process is basically:
|
| Comment by Christopher Morrone [ 06/Jul/12 ] |
|
Jinshan, I've added http://review.whamcloud.com/3194 to our branch to include in the next round of testing. |
| Comment by Ian Colle (Inactive) [ 31/Jul/12 ] |
|
Master version of patch has been merged - http://review.whamcloud.com/#change,3447 |
| Comment by Jinshan Xiong (Inactive) [ 31/Jul/12 ] |
|
Hi Chris, how long have you been running the test with this patch? |
| Comment by Chris Gearing (Inactive) [ 02/Aug/12 ] |
|
I don't know, any pushes to gerrit that have rebased on this patch will have been run with this patch. I cannot know who has rebased and pushed. |
| Comment by Jinshan Xiong (Inactive) [ 02/Aug/12 ] |
|
Hi Chris Gearing, Sorry I meant to say Christopher Morrone because LLNL is verifying if the patch can fix the data corruption problem. This is a rarely occurred problem so it may take months to verify it. |
| Comment by Christopher Morrone [ 02/Aug/12 ] |
|
Its been on our test systems and hasn't caused any problems that I am aware of. It is not installed in production yet. It might make it into a production release in a couple of weeks. We've seen the |
| Comment by Christopher Morrone [ 02/Aug/12 ] |
|
Oh, in other words, I don't know if it fixes the problem because we don't know hot to reproduce it in our test environment. But it hasn't caused any new problems that I know of. |
| Comment by Jay Lan (Inactive) [ 02/Aug/12 ] |
|
Are the messages below evidence of data corruption? We have quite a number of these on our 2.1.2 mds: LustreError: 4869:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1407363869661366 has mismatched opc: new 101 old 0^M |
| Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ] |
|
Hi Jay Lan, This error messages are not related as data corruption happened on OST and the message showed something wrong with MDT. |
| Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ] |
|
I'm going to set this issue as fixed otherwise we can't release 2.3. Please reopen this issue if it occurs again. |
| Comment by Jay Lan (Inactive) [ 06/Aug/12 ] |
|
Hi Jinshan, but our console log messages look the same as the server console log messages shown in the Description section of the ticket. Are you suggesting that other problems can produce the same log messages and thus the messages alone not sufficient evidence? Please advise. Thanks! |
| Comment by Jinshan Xiong (Inactive) [ 06/Aug/12 ] |
|
This message may not introduce any problem, probably just too aggressive console output. I think it can be ignored if you didn't see any real problem. |
| Comment by Jinshan Xiong (Inactive) [ 07/Aug/12 ] |
|
Hi Jay Lan, Lu-1717 was just created to address/understand this MDS console error message. |
| Comment by Jay Lan (Inactive) [ 07/Aug/12 ] |
|
Thanks, Jinshan! |
| Comment by Yusong Wang [ 12/Sep/12 ] |
|
Hi, We got a similar issue (the written file has a big hole with a lot of zeros) occasionally and caused wrong simulation results. As we have Luster 1.8.8, I wonder if our file system could be affected by this bug. Thanks, |
| Comment by Ned Bass [ 03/Dec/12 ] |
|
Our test program detected this type of corruption again, so we should re-open this issue. Client was running 2.1.2-4chaos, which includes the patch " https://github.com/chaos/lustre/commits/2.1.2-4chaos Note that this test runs regularly on our production systems, so this bug is very rare to reproduce. |
| Comment by Ned Bass [ 03/Dec/12 ] |
|
Looking at the on-disk data, it is again logical block 256 that is unallocated in the OST object. |
| Comment by Zhenyu Xu [ 05/Dec/12 ] |
|
Ned Bass, Would you mind checking whether the stripe patterns are always the same for the error cases? Also please collect and upload client's rpc_stats of clients info (e.g. /proc/fs/lustre/osc/lustre-OST0000-osc-ffff88003de72800/rpc_stats data). We'd like to know whether clients have always written 1M RPC to OST w/o partial write, so that we'd know whether client's issue or OST's. |
| Comment by Ned Bass [ 06/Dec/12 ] |
|
The stripe patterns are not always the same. The first case involved a 300 MB file with a stripe count of 100. This most recent case involved a 10 GB file with a stripe count of 2. The only consistent pattern so far is that it always seems to be logical block 256 missing from an OST object. I'll check the rpc_stats tomorrow. |
| Comment by Jinshan Xiong (Inactive) [ 06/Dec/12 ] |
|
I can't think of what kind of problem can cause this symptom, there is even no lock contention. One more question is: is the client ppc64? |
| Comment by Christopher Morrone [ 06/Dec/12 ] |
|
No, both client and server are x86_64. |
| Comment by Ned Bass [ 06/Dec/12 ] |
|
The rpc_stats for the involved OST looked like this: read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 3 0 0
2: 0 0 0 | 0 0 0
4: 2 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 1 0 0
64: 0 0 0 | 3 0 0
128: 0 0 0 | 3 0 0
256: 4771 99 100 | 4768 99 100
read write
offset rpcs % cum % | rpcs % cum %
1: 4772 99 99 | 4760 99 99
2: 0 0 99 | 4 0 99
4: 2 0 99 | 0 0 99
8: 0 0 99 | 0 0 99
16: 0 0 99 | 0 0 99
32: 0 0 99 | 1 0 99
64: 1 0 100 | 6 0 99
128: 0 0 100 | 7 0 100
|
| Comment by Jinshan Xiong (Inactive) [ 07/Dec/12 ] |
|
Have you ever run other jobs in this cluster because I saw some non 1M RPCs in the roc_stats? From the symptom, I guess somehow the 1st page of a stripe was not included in the RPC. If that is true, then that RPC must have only 255 pages and we should see a tally in the stats. Of course this way only works if that job is the only one in the cluster. We'll make a debug patch for this purpose. |
| Comment by Ned Bass [ 07/Dec/12 ] |
|
Hi Jinshan, This node has been up for about 70 days and has run many other jobs, so we can't say for sure whether the missing page is reflected in the stats. I like the idea of a debug patch to catch this case. We discussed that approach during our last meeting but I couldn't easily figure out where to put the debugging. |
| Comment by Jinshan Xiong (Inactive) [ 07/Mar/13 ] |
|
I'm lowering the prio to major and will continue working on this after getting new clues. |
| Comment by Artem Blagodarenko (Inactive) [ 27/Mar/13 ] |
|
Moved unrelated comments to new bug |
| Comment by Artem Blagodarenko (Inactive) [ 23/Apr/13 ] |
|
. |
| Comment by Jodi Levi (Inactive) [ 24/Apr/13 ] |
|
Reclosing this ticket. The new issue mentioned is tracked in |
| Comment by Ned Bass [ 24/Apr/13 ] |
|
Jodi, afaik this issue is not resolved, just rare. Are there still any plans to make a debug patch? |
| Comment by Peter Jones [ 24/Apr/13 ] |
|
Sorry for the confusion Ned - things just got a little muddled due to two somewhat similar issues becoming intertwined. Yes, this is still on Jinshan's plate to dig into further at some point. |
| Comment by James A Simmons [ 16/Aug/16 ] |
|
Old ticket for unsupported version |