[LU-12786] io hard read fails due to data verification fails Created: 19/Sep/19  Updated: 01/Sep/22  Resolved: 13/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Shuichi Ihara Assignee: Dongyang Li
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

master


Issue Links:
Related
is related to LU-12681 Data corruption - due incorrect KMS w... Resolved
is related to LU-12832 soft lockup in ldlm_bl_xx threads at ... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Got the following error with IOR. this is ior_hard_read (non-aligned single shared file) workload from multiple clients (10N240P)

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Sep 19 12:27:39 2019
Command line        : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -r -R -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall
Machine             : Linux c11
TestID              : 0
StartTime           : Thu Sep 19 12:27:39 2019
Path                : /scratch0/io500.out/ior_hard
FS                  : 52.4 TiB   Used FS: 22.3%   Inodes: 423.7 Mi   Used Inodes: 14.9%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : /scratch0/io500.out/ior_hard/IOR_file
access              : single-shared-file
type                : independent
segments            : 132000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 240
clients per node    : 24
repetitions         : 1
xfersize            : 47008 bytes
blocksize           : 47008 bytes
aggregate filesize  : 1.35 TiB

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
[14] FAILED comparison of buffer containing 8-byte ints:
[14]   File name = /scratch0/io500.out/ior_hard/IOR_file
[14]   In transfer 0, 5120 errors between buffer indices 328 and 5447.
[14]   File byte offset = 508297412608:
[14]     Expected: 0x000000260000001b 0000000000000a48 000000260000001b 0000000000000a58 
[14]     Actual:   0x0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[27] FAILED comparison of buffer containing 8-byte ints:
[27]   File name = /scratch0/io500.out/ior_hard/IOR_file
[27]   In transfer 0, 5668 errors between buffer indices 0 and 5667.
[27]   File byte offset = 542967361248:
[27]     Expected: 0x000000330000001b 0000000000000008 000000330000001b 0000000000000018 
[27]     Actual:   0x000000890000001b 0000000000006ee8 000000890000001b 0000000000006ef8 
WARNING: incorrect data on read (10788 errors found).
Used Time Stamp 27 (0x1b) for Data Signature
read      29032      45.91      45.91      0.112576   48.89      0.109625   48.92      0   
Max Read:  29031.84 MiB/sec (30442.09 MB/sec)

 Even it's tested on several times using different clients, IOR claims same mpi rank reads incorrect data from expected.
it seems to be it happens at write and incorrect data was stored in the file.

Overstriping is enabled on this directory. set 240 OverStriping against 8 OSTs.

# lfs getstripe /scratch0/io500.out/ior_hard           
/scratch0/io500.out/ior_hard
stripe_count:  240 stripe_size:   16777216 pattern:       raid0,overstriped stripe_offset: -1

/scratch0/io500.out/ior_hard/stonewall
lmm_stripe_count:  240
lmm_stripe_size:   16777216
lmm_pattern:       raid0,overstriped
lmm_layout_gen:    0
lmm_stripe_offset: 2
	obdidx		 objid		 objid		 group


 Comments   
Comment by Shuichi Ihara [ 19/Sep/19 ]

This happens even after write. I added -W (verification option) after write operation, then it detects incorrect data.

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Sep 19 13:03:17 2019
Command line        : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W
Machine             : Linux c01
TestID              : 0
StartTime           : Thu Sep 19 13:03:17 2019
Path                : /scratch0/io500.out/ior_hard
FS                  : 52.4 TiB   Used FS: 19.7%   Inodes: 423.7 Mi   Used Inodes: 14.9%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : /scratch0/io500.out/ior_hard/IOR_file
access              : single-shared-file
type                : independent
segments            : 132000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 240
clients per node    : 24
repetitions         : 1
xfersize            : 47008 bytes
blocksize           : 47008 bytes
aggregate filesize  : 1.35 TiB
stonewallingTime    : 300
stoneWallingWearOut : 1

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
stonewalling pairs accessed min: 70161 max: 132000 -- min data: 3.1 GiB mean data: 4.5 GiB time: 300.0s
write     3194.11    45.91      45.91      0.151870   444.61     0.147815   444.64     0   
[15] FAILED comparison of buffer containing 8-byte ints:
[15]   File name = /scratch0/io500.out/ior_hard/IOR_file
[15]   In transfer 28394, 5428 errors between buffer indices 0 and 5427.
[15]   File byte offset = 320330516064:
[15]     Expected: 0x0000003f0000001b 0000000000000008 0000003f0000001b 0000000000000018 
[15]     Actual:   0x000000ea0000001b 00000000000fc668 000000ea0000001b 00000000000fc678 
[207] FAILED comparison of buffer containing 8-byte ints:
[207]   File name = /scratch0/io500.out/ior_hard/IOR_file
[207]   In transfer 80530, 5632 errors between buffer indices 244 and 5875.
[207]   File byte offset = 908522442752:
[207]     Expected: 0x0000000f0000001b 00000000000007a8 0000000f0000001b 00000000000007b8 
[207]     Actual:   0x000000a50000001b 0000000000085008 000000a50000001b 0000000000085018 
stonewalling pairs accessed min: 132000 max: 132000 -- min data: 5.8 GiB mean data: 5.8 GiB time: 83.0s
WARNING: incorrect data on write (11060 errors found).
Used Time Stamp 27 (0x1b) for Data Signature
Max Write: 3194.11 MiB/sec (3349.26 MB/sec)
Comment by Wang Shilong (Inactive) [ 19/Sep/19 ]

Still not sure this is client or Server side problem, since this is regression between 2.12 to 2.13, if we could reproduce the problem realiable, maybe git bisect is good idea to find which patch break this.

Comment by Peter Jones [ 19/Sep/19 ]

Dongyang

Could you please investigate?

Thanks

Peter

Comment by Li Xi [ 19/Sep/19 ]

I agree with Shilong that git bisect might be able to find the problem quickly, because I feel the problem might be introduced by one of the recent patch. There are a couple of related patches.

Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ]

Yeah, I think a bisect would be a very good idea here.  Even though the problem is reproducible, Ihara said it seemed to be happening without any error messages on client or server, and it's randomly during the middle of a pretty large job, so it's going to be tricky at best to get debug data.

Bisect is definitely easier if possible...

Comment by Cory Spitz [ 19/Sep/19 ]

If this test relies on overstriping then it may not be possible to git bisect the range that you need. Is it reproducible without overstriping?

Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ]

Haven't tried that - The underlying benchmark is a lot slower without overstriping, so it's going to be a bit tricky.

Comment by Patrick Farrell (Inactive) [ 19/Sep/19 ]

I wanted to share a little analysis of this...

Conclusion:
We're losing all of a write, except for the part that overlaps adjacent pages.

Reasoning:

So first off, the bad data is on disk, not just in the client cache, as Ihara confirmed.  That's important.

Second, the data on disk varies, but it's mostly garbage, and not zeroes.  That's also important - It indicates that either the client wrote bad data (but not zeroes - garbage), or, (more likely in my experience) the write from the client did not happen here.  Basically, the write was lost, and we are seeing what was already on disk.

Third, the size is always less than a full write.  For example:
In transfer 0, 5120 errors between buffer indices 328 and 5447

IOR looks at 8 bytes each time, so:
5120*8 = 40960, and our transfer size is 47008.

Finally, let's look at the offset of a few of the writes.

I would expect the bad data to start or end on a boundary - a stripe boundary, a page boundary, block boundary... etc.

Very strangely, it doesn't seem to be. This is really unusual - Someone should check the math here.

[14] In transfer 0, 5120 errors between buffer indices 328 and 5447.
(Note that 47008/8 = 5876, so there are 5876 indices in each write. So this is part of a write.)
[14] File byte offset = 508297412608

Notice the errors start at index 328, so indices 0-327 were fine.

If we look at the offset, here is where the corruption ends (note again there are
508297412608 + 5120*8

This isn't on any particular boundary - It's not on a page or a MiB boundary, which means it's not on a stripe boundary.
To say the least, that's extremely confusing, especially as it also spans several pages.

I'll be curious to see whatever anyone finds here.

Comment by Alexey Lyashkov [ 20/Sep/19 ]

> Second, the data on disk varies, but it's mostly garbage,

it's not a garbage, it looks data from previous runs. So blocks is allocated but data is lost. as about other.. lets enable a ldlm trace for OST, you should found a ldlm lock prolong messages for such region. with additional ptlrpc trace you should able to trace - how to finish write to this region.

@lhara - what is backend device for OST?

Comment by Peter Jones [ 17/Oct/19 ]

This no longer reproduces on latest master

Comment by Shuichi Ihara [ 23/Oct/19 ]

reproeuced LU-12786 again after prevent issue LU-12832 with small lru_max_age. I'm trying to narrow down..

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Wed Oct 23 14:10:44 2019
Command line        : /work/tools/bin/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W
Machine             : Linux c01
TestID              : 0
StartTime           : Wed Oct 23 14:10:44 2019
Path                : /scratch0/io500.out/ior_hard
FS                  : 29.6 TiB   Used FS: 0.0%   Inodes: 31.7 Mi   Used Inodes: 0.0%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : /scratch0/io500.out/ior_hard/IOR_file
access              : single-shared-file
type                : independent
segments            : 132000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 240
clients per node    : 24
repetitions         : 1
xfersize            : 47008 bytes
blocksize           : 47008 bytes
aggregate filesize  : 1.35 TiB
stonewallingTime    : 300
stoneWallingWearOut : 1

Results: 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s) iter
------    ---------  ---------- ---------  --------   --------   --------   -------- ----
stonewalling pairs accessed min: 61292 max: 132000 -- min data: 2.7 GiB mean data: 4.1 GiB time: 300.0s
write     3090.73    45.91      45.91      0.140233   459.46     0.141559   459.51     0   
[15] FAILED comparison of buffer containing 8-byte ints:
[15]   File name = /scratch0/io500.out/ior_hard/IOR_file
[15]   In transfer 17219, 4724 errors between buffer indices 0 and 4723.
[15]   File byte offset = 194255060064:
[15]     Expected: 0x0000003f0000001b 0000000000000008 0000003f0000001b 0000000000000018 
[15]     Actual:   0x000000b70000001b 0000000000006b08 000000b70000001b 0000000000006b18 
[131] FAILED comparison of buffer containing 8-byte ints:
[131]   File name = /scratch0/io500.out/ior_hard/IOR_file
[131]   In transfer 27742, 5876 errors between buffer indices 0 and 5875.
[131]   File byte offset = 312980157152:
[131]     Expected: 0x000000b30000001b 0000000000000008 000000b30000001b 0000000000000018 
[131]     Actual:   0x000000170000001b 000000000000b388 000000170000001b 000000000000b398 
Comment by Shuichi Ihara [ 23/Oct/19 ]

I haven't seen problem without overstriping. when the problem occured, overstriping was always enabled.
I also tested overstriping with page aligned IO size (e.g. 49152 byte for both transfer/block size), but that also didn't reproduce problem. So, it looks like the problem is related to overstriping and non-page aligned IO.

IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O
Began               : Wed Oct 23 20:32:11 2019
Command line        : /work/BMLab/Lustre/io500/io-500-dev/bin/ior -w -s 10000 -a POSIX -i 3 -C -Q 1 -g -G 27 -k -e -t 49152 -b 49152 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W
Machine             : Linux c01
TestID              : 0
StartTime           : Wed Oct 23 20:32:11 2019
Path                : /scratch0/io500.out/ior_hard
FS                  : 29.6 TiB   Used FS: 1.1%   Inodes: 31.7 Mi   Used Inodes: 0.0%
TestID              : 0
StartTime           : Wed Oct 23 20:32:11 2019
Path                : /scratch0/io500.out/ior_hard
FS                  : 29.6 TiB   Used FS: 1.1%   Inodes: 31.7 Mi   Used Inodes: 0.0%

Options:
api                 : POSIX
apiVersion          :
test filename       : /scratch0/io500.out/ior_hard/IOR_file
access              : single-shared-file
type                : independent
segments            : 10000
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
tasks               : 768
clients per node    : 24
repetitions         : 3
xfersize            : 49152 bytes
blocksize           : 49152 bytes
aggregate filesize  : 351.56 GiB
stonewallingTime    : 300
stoneWallingWearOut : 1

Results:

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 60.0s
write     5948       48.00      48.00      0.141523   60.47      0.137858   60.53      0
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 139.7s
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 68.3s
write     5231       48.00      48.00      0.142380   68.80      0.140756   68.83      1
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 145.4s
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 69.1s
write     5168       48.00      48.00      0.143842   69.62      0.142189   69.66      2
stonewalling pairs accessed min: 10000 max: 10000 -- min data: 0.5 GiB mean data: 0.5 GiB time: 142.2s
Max Write: 5947.60 MiB/sec (6236.51 MB/sec)
Comment by Patrick Farrell [ 23/Oct/19 ]

Hmm.
When I looked at this, I decided it seemed to be related to lock cancellation.

Glancing at it again, I still can't make any particular sense of the data beyond that - it seems likely to be related to lock cancellation.  (This observation is drawn from the observation that reducing lru size makes the problem go away.)

Due to the length/size of the job, it's almost impossible to get debug logs on this, but the desired logs would be dlmtrace.  If you can reproduce the problem with dlmtrace enabled (on the clients), which is not guaranteed, then you could perhaps setup a periodic log dump (something like, lctl dk every 30 seconds, to a different file), then hopefully see if the failed write (because what is happening here seems to be part of a write is not reaching disk) is indeed from a lock cancellation.

This would just be peeling the onion - It probably wouldn't give a root cause.  But it would be a first step.

Comment by Shuichi Ihara [ 31/Oct/19 ]

There is an update. When single OSS has only single OST, the problem never reproduced even overstriping was enabled.
When if single OSS has multiple OSTs(e.g. two OSTs per OSS), the problem is easy reproducible. Other than multiple OSTs or signal OST per OSS, configuration is consistent.

Comment by Wang Shilong (Inactive) [ 31/Oct/19 ]

paf0186 So will it make any differences from LDLM lock side(one ost per OSS vs multiple osts per OSS) with overstripping enabled? I am not aware of any, but this looks strange for me.

Also, as Ihara's testing, with page-aligned testing, we could not reproduce problems either.

Comment by Patrick Farrell [ 31/Oct/19 ]

Shilong,

No, it should not make any difference.  That's why it's sort of hard to see overstriping as being involved other than by creating more activity.  Overstriping is just ... > 1 stripe per OST.  Almost all of Lustre doesn't really care where a stripe is, and no layers of Lustre do anything different if two stripes are on the same OST.  (Or, they shouldn't.)

This is why overstriping was so easy to implement, the vast majority of Lustre code is totally unaffected by it, LDLM included.  (Overstriping took a while because of non-core things, such as issues with extended attribute size for very large layouts, but the core is extremely simple.)

With page aligned testing, you're also going to get far fewer imperative lock cancellations.  The same thing with small lru_size limits - you're not going to pile up locks in the same way, they're going to be cleaned up as you go, which makes a 'cleaner' lock map, so locks can be larger (because there are not as many old locks in the way), so there are fewer total locks.  (You can see this by looking at the ldlm_cbd stats and seeing how many total cancellations there are on a client - It should be much lower with either lru_size limits or page aligned i/o.)

Comment by Patrick Farrell [ 31/Oct/19 ]

The thing which really bothers me is that what is happening is part of a write (always part of a single write) is not reaching disk (we know this because we see data from previous runs there).  It's not possible for a single write to use multiple LDLM locks unless it crosses a stripe boundary (and the missing writes are not aligned on a stripe boundary, so that is not the case here) - At the OSC/LDLM level (so, after a write has been split between stripes), a write must fit entirely inside a single LDLM lock.  If it does not, the match fails, and a new LDLM lock is requested.  (Even if a client has two adjacent LDLM locks on that stripe which together cover the full region, they will be not be used.  A new one is requested which will cancel the existing ones.)

This means that it's not usually possible for lock cancellation to push out part of a write (except if the write spans multiple stripes, and, again, our bad data isn't aligned with a stripe boundary), because a write is always under a single lock.  (The only possible case is where part of the data has already been written out due to normal writeback, so only part of the write is still in cache, but that's a fairly unlikely case.  It's not impossible, but it seems unlikely.)

So...  that makes it really confusing that we're seeing these odd data sizes.

It makes me wonder if maybe, just maybe, there's a hardware problem?  But again, the data ranges for the missing data are so strange.  And I'm just blaming hardware because I have no ideas, which isn't great.

We really need to try to get some debug data if possible.  This doesn't match up with any obvious boundaries or patterns I've seen before, so...  I guess basically I don't think I can get anywhere by just thinking about it

Comment by Patrick Farrell [ 31/Oct/19 ]

If doing the periodic debug dumps doesn't work well or doesn't seem practical, you could for example instrument a single dlmtrace call by setting to a different debug setting (like, create a new debug type, and set one of the DLMTRACE prints to that).

For example, this one in ldlm_handle_bl_callback might be a decent choice:
LDLM_DEBUG(lock, "client blocking callback handler END");

And, maybe also this in osc_build_rpc?
DEBUG_REQ(D_INODE, req, "%d pages, aa %p, now %ur/%uw in flight",
page_count, aa, cli->cl_r_in_flight,
cli->cl_w_in_flight);

That should get a much smaller amount of debug data.

Comment by Gerrit Updater [ 01/Nov/19 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/36638
Subject: LU-12786 lustre: debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 706fd845753e2a573ab104623cda05324bc0727a

Comment by Wang Shilong (Inactive) [ 01/Nov/19 ]

Thanks paf0186 for sharing your thoughts here.

Yup, agreed that enable some debugging could make us proceed, so I pushed a debug patch which could collect what we want here but also limit output.

Since we have run out 32 bits debug, I reused snapshot debug, since our testing won't trigger snapshot.
just enabled by +snapshot and running test and trying to collect some data.

sihara would you help collect that?

Comment by Patrick Farrell [ 01/Nov/19 ]

Thanks Shilong!

Ihara, you may still need to increase the debug mb, even with just those messages.  There will be a lot of lock cancels and a lot of RPCs sent.

Comment by Shuichi Ihara [ 13/Nov/19 ]

 We have been testing same workload on same hardare and software stack at a copule of differnt sites, but haven't reproduced problem at this point. original issue might be something hw issue..

 

Generated at Sat Feb 10 02:55:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.