[LU-12786] io hard read fails due to data verification fails Created: 19/Sep/19 Updated: 01/Sep/22 Resolved: 13/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara | Assignee: | Dongyang Li |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master |
||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Got the following error with IOR. this is ior_hard_read (non-aligned single shared file) workload from multiple clients (10N240P) IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O Began : Thu Sep 19 12:27:39 2019 Command line : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -r -R -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall Machine : Linux c11 TestID : 0 StartTime : Thu Sep 19 12:27:39 2019 Path : /scratch0/io500.out/ior_hard FS : 52.4 TiB Used FS: 22.3% Inodes: 423.7 Mi Used Inodes: 14.9% Options: api : POSIX apiVersion : test filename : /scratch0/io500.out/ior_hard/IOR_file access : single-shared-file type : independent segments : 132000 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 240 clients per node : 24 repetitions : 1 xfersize : 47008 bytes blocksize : 47008 bytes aggregate filesize : 1.35 TiB Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- [14] FAILED comparison of buffer containing 8-byte ints: [14] File name = /scratch0/io500.out/ior_hard/IOR_file [14] In transfer 0, 5120 errors between buffer indices 328 and 5447. [14] File byte offset = 508297412608: [14] Expected: 0x000000260000001b 0000000000000a48 000000260000001b 0000000000000a58 [14] Actual: 0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 [27] FAILED comparison of buffer containing 8-byte ints: [27] File name = /scratch0/io500.out/ior_hard/IOR_file [27] In transfer 0, 5668 errors between buffer indices 0 and 5667. [27] File byte offset = 542967361248: [27] Expected: 0x000000330000001b 0000000000000008 000000330000001b 0000000000000018 [27] Actual: 0x000000890000001b 0000000000006ee8 000000890000001b 0000000000006ef8 WARNING: incorrect data on read (10788 errors found). Used Time Stamp 27 (0x1b) for Data Signature read 29032 45.91 45.91 0.112576 48.89 0.109625 48.92 0 Max Read: 29031.84 MiB/sec (30442.09 MB/sec) Even it's tested on several times using different clients, IOR claims same mpi rank reads incorrect data from expected. Overstriping is enabled on this directory. set 240 OverStriping against 8 OSTs. # lfs getstripe /scratch0/io500.out/ior_hard /scratch0/io500.out/ior_hard stripe_count: 240 stripe_size: 16777216 pattern: raid0,overstriped stripe_offset: -1 /scratch0/io500.out/ior_hard/stonewall lmm_stripe_count: 240 lmm_stripe_size: 16777216 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 2 obdidx objid objid group |
| Comments |
| Comment by Shuichi Ihara [ 19/Sep/19 ] |
|
This happens even after write. I added -W (verification option) after write operation, then it detects incorrect data. IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O Began : Thu Sep 19 13:03:17 2019 Command line : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W Machine : Linux c01 TestID : 0 StartTime : Thu Sep 19 13:03:17 2019 Path : /scratch0/io500.out/ior_hard FS : 52.4 TiB Used FS: 19.7% Inodes: 423.7 Mi Used Inodes: 14.9% Options: api : POSIX apiVersion : test filename : /scratch0/io500.out/ior_hard/IOR_file access : single-shared-file type : independent segments : 132000 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 240 clients per node : 24 repetitions : 1 xfersize : 47008 bytes blocksize : 47008 bytes aggregate filesize : 1.35 TiB stonewallingTime : 300 stoneWallingWearOut : 1 Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- stonewalling pairs accessed min: 70161 max: 132000 -- min data: 3.1 GiB mean data: 4.5 GiB time: 300.0s write 3194.11 45.91 45.91 0.151870 444.61 0.147815 444.64 0 [15] FAILED comparison of buffer containing 8-byte ints: [15] File name = /scratch0/io500.out/ior_hard/IOR_file [15] In transfer 28394, 5428 errors between buffer indices 0 and 5427. [15] File byte offset = 320330516064: [15] Expected: 0x0000003f0000001b 0000000000000008 0000003f0000001b 0000000000000018 [15] Actual: 0x000000ea0000001b 00000000000fc668 000000ea0000001b 00000000000fc678 [207] FAILED comparison of buffer containing 8-byte ints: [207] File name = /scratch0/io500.out/ior_hard/IOR_file [207] In transfer 80530, 5632 errors between buffer indices 244 and 5875. [207] File byte offset = 908522442752: [207] Expected: 0x0000000f0000001b 00000000000007a8 0000000f0000001b 00000000000007b8 [207] Actual: 0x000000a50000001b 0000000000085008 000000a50000001b 0000000000085018 stonewalling pairs accessed min: 132000 max: 132000 -- min data: 5.8 GiB mean data: 5.8 GiB time: 83.0s WARNING: incorrect data on write (11060 errors found). Used Time Stamp 27 (0x1b) for Data Signature Max Write: 3194.11 MiB/sec (3349.26 MB/sec) |
| Comment by Wang Shilong (Inactive) [ 19/Sep/19 ] |
|
Still not sure this is client or Server side problem, since this is regression between 2.12 to 2.13, if we could reproduce the problem realiable, maybe git bisect is good idea to find which patch break this. |
| Comment by Peter Jones [ 19/Sep/19 ] |
|
Dongyang Could you please investigate? Thanks Peter |
| Comment by Li Xi [ 19/Sep/19 ] |
|
I agree with Shilong that git bisect might be able to find the problem quickly, because I feel the problem might be introduced by one of the recent patch. There are a couple of related patches. |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
Yeah, I think a bisect would be a very good idea here. Even though the problem is reproducible, Ihara said it seemed to be happening without any error messages on client or server, and it's randomly during the middle of a pretty large job, so it's going to be tricky at best to get debug data. Bisect is definitely easier if possible... |
| Comment by Cory Spitz [ 19/Sep/19 ] |
|
If this test relies on overstriping then it may not be possible to git bisect the range that you need. Is it reproducible without overstriping? |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
Haven't tried that - The underlying benchmark is a lot slower without overstriping, so it's going to be a bit tricky. |
| Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ] |
|
I wanted to share a little analysis of this... Conclusion: Reasoning: So first off, the bad data is on disk, not just in the client cache, as Ihara confirmed. That's important. Second, the data on disk varies, but it's mostly garbage, and not zeroes. That's also important - It indicates that either the client wrote bad data (but not zeroes - garbage), or, (more likely in my experience) the write from the client did not happen here. Basically, the write was lost, and we are seeing what was already on disk. Third, the size is always less than a full write. For example: IOR looks at 8 bytes each time, so: Finally, let's look at the offset of a few of the writes. I would expect the bad data to start or end on a boundary - a stripe boundary, a page boundary, block boundary... etc. Very strangely, it doesn't seem to be. This is really unusual - Someone should check the math here. [14] In transfer 0, 5120 errors between buffer indices 328 and 5447. Notice the errors start at index 328, so indices 0-327 were fine. If we look at the offset, here is where the corruption ends (note again there are This isn't on any particular boundary - It's not on a page or a MiB boundary, which means it's not on a stripe boundary. I'll be curious to see whatever anyone finds here. |
| Comment by Alexey Lyashkov [ 20/Sep/19 ] |
|
> Second, the data on disk varies, but it's mostly garbage, it's not a garbage, it looks data from previous runs. So blocks is allocated but data is lost. as about other.. lets enable a ldlm trace for OST, you should found a ldlm lock prolong messages for such region. with additional ptlrpc trace you should able to trace - how to finish write to this region. @lhara - what is backend device for OST? |
| Comment by Peter Jones [ 17/Oct/19 ] |
|
This no longer reproduces on latest master |
| Comment by Shuichi Ihara [ 23/Oct/19 ] |
|
reproeuced IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O Began : Wed Oct 23 14:10:44 2019 Command line : /work/tools/bin/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W Machine : Linux c01 TestID : 0 StartTime : Wed Oct 23 14:10:44 2019 Path : /scratch0/io500.out/ior_hard FS : 29.6 TiB Used FS: 0.0% Inodes: 31.7 Mi Used Inodes: 0.0% Options: api : POSIX apiVersion : test filename : /scratch0/io500.out/ior_hard/IOR_file access : single-shared-file type : independent segments : 132000 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 240 clients per node : 24 repetitions : 1 xfersize : 47008 bytes blocksize : 47008 bytes aggregate filesize : 1.35 TiB stonewallingTime : 300 stoneWallingWearOut : 1 Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- stonewalling pairs accessed min: 61292 max: 132000 -- min data: 2.7 GiB mean data: 4.1 GiB time: 300.0s write 3090.73 45.91 45.91 0.140233 459.46 0.141559 459.51 0 [15] FAILED comparison of buffer containing 8-byte ints: [15] File name = /scratch0/io500.out/ior_hard/IOR_file [15] In transfer 17219, 4724 errors between buffer indices 0 and 4723. [15] File byte offset = 194255060064: [15] Expected: 0x0000003f0000001b 0000000000000008 0000003f0000001b 0000000000000018 [15] Actual: 0x000000b70000001b 0000000000006b08 000000b70000001b 0000000000006b18 [131] FAILED comparison of buffer containing 8-byte ints: [131] File name = /scratch0/io500.out/ior_hard/IOR_file [131] In transfer 27742, 5876 errors between buffer indices 0 and 5875. [131] File byte offset = 312980157152: [131] Expected: 0x000000b30000001b 0000000000000008 000000b30000001b 0000000000000018 [131] Actual: 0x000000170000001b 000000000000b388 000000170000001b 000000000000b398 |
| Comment by Shuichi Ihara [ 23/Oct/19 ] |
|
I haven't seen problem without overstriping. when the problem occured, overstriping was always enabled. IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O Began : Wed Oct 23 20:32:11 2019 Command line : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -w -s 10000 -a POSIX -i 3 -C -Q 1 -g -G 27 -k -e -t 49152 -b 49152 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W Machine : Linux c01 TestID : 0 StartTime : Wed Oct 23 20:32:11 2019 Path : /scratch0/io500.out/ior_hard FS : 29.6 TiB Used FS: 1.1% Inodes: 31.7 Mi Used Inodes: 0.0% TestID : 0 StartTime : Wed Oct 23 20:32:11 2019 Path : /scratch0/io500.out/ior_hard FS : 29.6 TiB Used FS: 1.1% Inodes: 31.7 Mi Used Inodes: 0.0% Options: api : POSIX apiVersion : test filename : /scratch0/io500.out/ior_hard/IOR_file access : single-shared-file type : independent segments : 10000 ordering in a file : sequential ordering inter file : constant task offset task offset : 1 tasks : 768 clients per node : 24 repetitions : 3 xfersize : 49152 bytes blocksize : 49152 bytes aggregate filesize : 351.56 GiB stonewallingTime : 300 stoneWallingWearOut : 1 Results: access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 60.0s write 5948 48.00 48.00 0.141523 60.47 0.137858 60.53 0 stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 139.7s stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 68.3s write 5231 48.00 48.00 0.142380 68.80 0.140756 68.83 1 stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 145.4s stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 69.1s write 5168 48.00 48.00 0.143842 69.62 0.142189 69.66 2 stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 142.2s Max Write: 5947.60 MiB/sec (6236.51 MB/sec) |
| Comment by Patrick Farrell [ 23/Oct/19 ] |
|
Hmm. Glancing at it again, I still can't make any particular sense of the data beyond that - it seems likely to be related to lock cancellation. (This observation is drawn from the observation that reducing lru size makes the problem go away.) Due to the length/size of the job, it's almost impossible to get debug logs on this, but the desired logs would be dlmtrace. If you can reproduce the problem with dlmtrace enabled (on the clients), which is not guaranteed, then you could perhaps setup a periodic log dump (something like, lctl dk every 30 seconds, to a different file), then hopefully see if the failed write (because what is happening here seems to be part of a write is not reaching disk) is indeed from a lock cancellation. This would just be peeling the onion - It probably wouldn't give a root cause. But it would be a first step. |
| Comment by Shuichi Ihara [ 31/Oct/19 ] |
|
There is an update. When single OSS has only single OST, the problem never reproduced even overstriping was enabled. |
| Comment by Wang Shilong (Inactive) [ 31/Oct/19 ] |
|
paf0186 So will it make any differences from LDLM lock side(one ost per OSS vs multiple osts per OSS) with overstripping enabled? I am not aware of any, but this looks strange for me. Also, as Ihara's testing, with page-aligned testing, we could not reproduce problems either. |
| Comment by Patrick Farrell [ 31/Oct/19 ] |
|
Shilong, No, it should not make any difference. That's why it's sort of hard to see overstriping as being involved other than by creating more activity. Overstriping is just ... > 1 stripe per OST. Almost all of Lustre doesn't really care where a stripe is, and no layers of Lustre do anything different if two stripes are on the same OST. (Or, they shouldn't.) This is why overstriping was so easy to implement, the vast majority of Lustre code is totally unaffected by it, LDLM included. (Overstriping took a while because of non-core things, such as issues with extended attribute size for very large layouts, but the core is extremely simple.) With page aligned testing, you're also going to get far fewer imperative lock cancellations. The same thing with small lru_size limits - you're not going to pile up locks in the same way, they're going to be cleaned up as you go, which makes a 'cleaner' lock map, so locks can be larger (because there are not as many old locks in the way), so there are fewer total locks. (You can see this by looking at the ldlm_cbd stats and seeing how many total cancellations there are on a client - It should be much lower with either lru_size limits or page aligned i/o.) |
| Comment by Patrick Farrell [ 31/Oct/19 ] |
|
The thing which really bothers me is that what is happening is part of a write (always part of a single write) is not reaching disk (we know this because we see data from previous runs there). It's not possible for a single write to use multiple LDLM locks unless it crosses a stripe boundary (and the missing writes are not aligned on a stripe boundary, so that is not the case here) - At the OSC/LDLM level (so, after a write has been split between stripes), a write must fit entirely inside a single LDLM lock. If it does not, the match fails, and a new LDLM lock is requested. (Even if a client has two adjacent LDLM locks on that stripe which together cover the full region, they will be not be used. A new one is requested which will cancel the existing ones.) This means that it's not usually possible for lock cancellation to push out part of a write (except if the write spans multiple stripes, and, again, our bad data isn't aligned with a stripe boundary), because a write is always under a single lock. (The only possible case is where part of the data has already been written out due to normal writeback, so only part of the write is still in cache, but that's a fairly unlikely case. It's not impossible, but it seems unlikely.) So... that makes it really confusing that we're seeing these odd data sizes. It makes me wonder if maybe, just maybe, there's a hardware problem? But again, the data ranges for the missing data are so strange. And I'm just blaming hardware because I have no ideas, which isn't great. We really need to try to get some debug data if possible. This doesn't match up with any obvious boundaries or patterns I've seen before, so... I guess basically I don't think I can get anywhere by just thinking about it |
| Comment by Patrick Farrell [ 31/Oct/19 ] |
|
If doing the periodic debug dumps doesn't work well or doesn't seem practical, you could for example instrument a single dlmtrace call by setting to a different debug setting (like, create a new debug type, and set one of the DLMTRACE prints to that). For example, this one in ldlm_handle_bl_callback might be a decent choice: And, maybe also this in osc_build_rpc? That should get a much smaller amount of debug data. |
| Comment by Gerrit Updater [ 01/Nov/19 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/36638 |
| Comment by Wang Shilong (Inactive) [ 01/Nov/19 ] |
|
Thanks paf0186 for sharing your thoughts here. Yup, agreed that enable some debugging could make us proceed, so I pushed a debug patch which could collect what we want here but also limit output. Since we have run out 32 bits debug, I reused snapshot debug, since our testing won't trigger snapshot. sihara would you help collect that? |
| Comment by Patrick Farrell [ 01/Nov/19 ] |
|
Thanks Shilong! Ihara, you may still need to increase the debug mb, even with just those messages. There will be a lot of lock cancels and a lot of RPCs sent. |
| Comment by Shuichi Ihara [ 13/Nov/19 ] |
|
We have been testing same workload on same hardare and software stack at a copule of differnt sites, but haven't reproduced problem at this point. original issue might be something hw issue..
|