[LU-6389] read()/write() returning less than available bytes intermittently Created: 19/Mar/15  Updated: 19/Jun/20  Resolved: 18/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2, Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Bill Barth (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

CentOS 6.5 2.6.32-431.17.1.el6.x86_64. 2.5.2 client. 2.5.3 server.


Attachments: Text File lustre_bbarth.log.bz2     Text File lustre_bbarth.log.bz2     File short_io_bug.tar.gz    
Issue Links:
Duplicate
is duplicated by LU-6392 short read/write with stripe count > 1 Resolved
Related
is related to LU-6392 short read/write with stripe count > 1 Resolved
is related to LU-6545 MPIIO short reads Resolved
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

Since March 10, 2015, we have be tracking an increasing number of user reports of intermittent I/O problems with our largest Lustre filesystem on Stampede (SCRATCH). This is affecting dozens of users on multiple jobs per user. First detected in Fortran programs and reduced to a 10-line reproducer (test_break.f), we have now also generated a C reproducer (rwb.c) that does not depend on a specific Fortran runtime library. This case was designed to mimic the underlying libc calls that the Fortran case was making without the interference from the runtime library. The attached case fails with either icc or gcc on our system.

The basic case involves a long sequence of ~4MB read() or write() calls which eventually should read or write all of a large file. Intermittently, but reproducibly, one of these calls will come back short before getting to the last block of the file. I.e. a 4MB read may only read 2.5MB somewhere in the middle of the file. The number bytes read on the short call and the position in the sequence are apparently random. This issue does not occur if the file has only 1 stripe, but does consistently occur with 2 stripes or more. The problem does not occur on either of our other Lustre filesystems on Stampede, and nothing appears to have changed that is correlated in time with the start of the problems.

The short read/write does not report an error when running the C code, and subsequent reads continue as normal. Writing behaves identically. Some codes, including the Intel Fortran runtime do not tolerate short reads (though they potentially could), and the codes abort (including the attached one). No codes that I know of are designed to tolerate shorter than requested writes generally. We can find no client or server error messages associated with these short read/write events.

We would be happy to provide access to Stampede for testing and verification.



 Comments   
Comment by Peter Jones [ 19/Mar/15 ]

Bobijam

Could you please advise on this issue?

Thanks

Peter

Comment by Gerrit Updater [ 20/Mar/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14123
Subject: LU-6389 llite: restart short read/write IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 44fd6ceb9810d1ce6ce15256732e42aea11ddf1c

Comment by Zhenyu Xu [ 20/Mar/15 ]

I cannot reproduce it for read, and only reproduced the write when one OST is full. This patch will report write error.

Can you give it a try?

Comment by Aurelien Degremont (Inactive) [ 20/Mar/15 ]

Hello

At CEA we are facing the same issue and working on it for few days.
We can reproduce short read or write simply on various Lustre filesystems with IOR as a reproducer.

So far, I've track this down to can_populate_pages() and a layout lock cancel. We can easily see in Lustre debug logs that the layout lock is being canceled during I/O and so returning a short write. File layout as not changed and when lock is taken back, layout_gen is still '0'. File is accessed by only 1 node, 1 ior task and have just been created.
In log, the layout lock as no more reader and writer and so could be canceled. I'm trying to understand why lock is localy being cancelled so early.

Aurélien

Comment by Bruno Travouillon (Inactive) [ 20/Mar/15 ]

Hi bobijam,

We are currently working on the same kind of issue (short read/write). I am trying to grab relevant information to open a JIRA.

When looking at your patch in lustre/llite/file.c, I don't understand why you change the behaviour? Is it for testing purposes? Obviously, this patch will solve our issue, but this is a U-turn from "return short read/write" to "Restart io for short read/write",

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

Thanks, Bobijam. We can't try this patch against a filesystem that's in production operations. We will try to reproduce this on our test system, but I have my doubts. At best we might be able to patch on 3/31/15 when we have our next maintenance. Meanwhile, I'm interested in anyone's opinions about how this could have started with no changes in the last several months.

Bill.

Comment by Zhenyu Xu [ 20/Mar/15 ]

Bill,

Can you grab -1 logs when you reproduce it on the test system?

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

debug log

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

I have a -1 log from a production client node during a run that exhibits the problem. The file, bizip2ed is attached.

Comment by Andreas Dilger [ 20/Mar/15 ]

Bobijam, I agree with Bruno that it would be better to track down the root cause why the IO is being interrupted rather than just retrying at the top level.

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

FYI a I ran this on our smaller, unloaded test rig running identical client and server versions and could not reproduce.

Comment by Zhenyu Xu [ 20/Mar/15 ]

Bill, I couldn't find the error in the log, I think it does not cover the short read section.

$ grep ll_file_io_generic lustre_bbarth.log.txt
00000080:00000001:15.0:1426859590.895686:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859590.895760:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.062183:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.062260:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.229229:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.229306:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.396594:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.396667:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.563333:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.563417:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.730218:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.730295:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859591.896834:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859591.896904:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859592.063883:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859592.063961:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859592.233223:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859592.233301:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859592.402279:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859592.402355:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859592.571321:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859592.571398:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859592.741125:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)
00000080:00000001:15.0:1426859592.741202:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered
00000080:00000001:15.0:1426859598.853529:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004)

all of them returns the expected length (4194308).

Comment by Jinshan Xiong (Inactive) [ 20/Mar/15 ]

Hi Aurelien,

This is because layout lock can be shared one dlm lock with other inodebits lock at open. The client may lose layout lock due to this false sharing.

Jinshan

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

I have uploaded a second log file. I don't know how to produce text from the binary dump, but grepping the binary for that message only shows 4MB prints. Is it possible that the debug message is printing the size it wanted iinstead of the size it actually gave to the read()? I can guarantee that I got the short read on this run.

Bill.

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

second attempt at a log file

Comment by Zhenyu Xu [ 20/Mar/15 ]

Don't try this patch, it still has issue yet.

Comment by Bill Barth (Inactive) [ 20/Mar/15 ]

Don't worry, we didn't try it yet. I can't reproduce this in our test environment yet. Only in production.

Comment by Aurelien Degremont (Inactive) [ 21/Mar/15 ]

For your information, I think this issue is really important because some libraries like netCDF (which is widely use) does stupid things like :
https://github.com/Unidata/netcdf-c/blob/master/libsrc/posixio.c#L318

If it faces a short read, it zeroes all the unread part of the buffer, instead of doing another read, and continues and do not consider this as an error.

Some of our netCDF users hit this bug with short read.
We can easily reproduce short writes on various clusters and different lustre filesystems with a simple IOR (one thread, posix io) and even dd! You do not need to be unaligned (ior like dd do aligned writes) but IOSIZE should be greater than STRIPESIZE, so your write should at least touch 2 stripes.

We do not understand why some filesystems are impacted and other not.
@Bill: This problem appears suddendly for us too. It was first detected on our recently upgraded cluster to Lustre 2.5.3. But in fact this bug impacts all of them, and they are all now with 2.5.3. We downgrade some clients to 2.4 and the issue is still there (Layout Lock was already landed in 2.4). Servers are still in 2.5.3 though.

So far, the smallest filesystems where we can reproduce this is one with 120 OSTs and quite some load. However, some bigger FS are not impacted... for unknown reason

Comment by Aurelien Degremont (Inactive) [ 21/Mar/15 ]

Hi Jinshan!

To reproduce this bug, it is mostly a timing issue AFAIK. Layout lock should be canceled by ldlm_bl thread during I/O, exactly between 2 calls to vvp_io_write_start(). If LL is dropped between 2 writes, it will be enqueued again before doing the 2nd write and it will be OK. If your I/O does not cover several stripes, it is also fine.

Comment by Bill Barth (Inactive) [ 21/Mar/15 ]

We have also experienced problems with NetCDF-based codes, but I don't have a compact reproducer yet. In our cases, we're seeing large writes from WRF hitting asserts in posixio.c:

assert(*posp == OFF_NONE || *posp == lseek(nciop->fd, 0, SEEK_CUR));

I haven't been able to prove that this follows a short read or write, but I have had complaints of files coming out shorter than expected.

All of this also followed an update to 2.5.3 on our servers, but there is at least 2 months between our update and first known case. I would say that activity on the filesystems has been much higher this month, but not in a way that substantially increases the load reported by Linux, etc.

Comment by Bill Barth (Inactive) [ 21/Mar/15 ]

Also, for the record, we have not been able to reproduce this problem with the attached Fortran code using gfortran. strace'ing the gfortran version shows, basically, one giant read() call for the whole Fortran record rather than a long sequence of 4MB reads. This may or may not point in the right direction, but block based reads of sufficient size seem to be required. Also, changing read() to fread() in the attached C program makes the problem go away as well in my experience.

Comment by Jinshan Xiong (Inactive) [ 21/Mar/15 ]

The root cause of this problem is clear and Bobijam is working on a fix, please stay tuned.

Comment by Andreas Dilger [ 22/Mar/15 ]

Aurelien, have you considered filing a bug with the upstream NetCDF to handle short read/write? I don't think it is a replacement for a solution to this bug, but it does seem like a bug in that code that should be fixed.

Comment by Zhenyu Xu [ 23/Mar/15 ]

As Jinshan pointed, the layout lock could be lost and that would cause short read/write. Based on "man 2 read", "The system guarantees to read the number of bytes requested if the descriptor references a normal file that has that many bytes left before the end-of-file, but in no other case.", so read() should be return less than available bytes, so I think we need to retry short read/write.

Comment by Aurelien Degremont (Inactive) [ 23/Mar/15 ]

It seems this note comes from OSX man page. There is nothing like that in Linux man page.
I've looked at POSIX.1 quickly and there is nothing like this.

I think Lustre should do is best to avoid short read but user codes should not consider short read will never happen. Both should be fixed.

There is nothing like this for write, even if OSX man page

Comment by Zhenyu Xu [ 23/Mar/15 ]

I've updated #14123 to retry short read/write if user codes does not cope with short read/write.

Comment by Bill Barth (Inactive) [ 23/Mar/15 ]

Aurelien, I take the following from POSIX to be the exhaustive list of reasons a short read may occur:

The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.

Since this is a regular file, only the first two may occur. Do you know of another interpretation,or am I maybe looking in the wrong place?

Comment by Aurelien Degremont (Inactive) [ 23/Mar/15 ]

I think this is correct. That means:

  • Users can face short read, mostly due to signal interrupt, if I/O was already started. In this case, they should do another read. If they supports this, this handles also the current Lustre behaviour nicely.
  • In theory, Lustre should do a short read only at EOF or on signal. So the current behavior should be improved.
Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ]

Hi Aurelien,

In any cases, when a read or write encounters an error, such as ENOMEM or EIO, it should return the number of bytes it has already done to the application. A robust program must handle short read and write. This is just FYI. We'll make a fix for Lustre.

Comment by jacques-charles lafoucriere [ 23/Mar/15 ]

Hi Jinshan

on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP.
Even if it is allowed by posix, it is not usual for storage, so may be we whould add a mount option to allow partial read/write (if any one find an interrest / use case).

Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ]

it returns -1 only if there is no bytes having been read or written yet.

Comment by Patrick Farrell (Inactive) [ 23/Mar/15 ]

jcf,

Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data.

I think Jinshan is right and we should follow POSIX semantics as he described.

Comment by Stephane Thiell [ 23/Mar/15 ]

Some literature, like Robert Love’s Linux System Programming[1], says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”.

[1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false

Comment by Patrick Farrell (Inactive) [ 24/Mar/15 ]

Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned.

Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error.

Comment by jacques-charles lafoucriere [ 24/Mar/15 ]

I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support.

Comment by Bill Barth (Inactive) [ 24/Mar/15 ]

Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well.

Comment by Christopher Morrone [ 24/Mar/15 ]

Livermore has hit this problem in production as well. The HPSS movers are hitting this.

This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it.

Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong.

If that design choice is not recorded anywhere, it would be very good for someone to write it down this time.

We should be treating the issue in this ticket a regression.

Comment by Gerrit Updater [ 24/Mar/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160
Subject: LU-6389 llite: restart short read/write for normal IO
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 08085c551a2202594762ad999d511511be2f1c70

Comment by Jinshan Xiong (Inactive) [ 24/Mar/15 ]

Hi Bill,

Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160.

Comment by Bill Barth (Inactive) [ 24/Mar/15 ]

Thanks. We're looking at it.

Comment by Bill Barth (Inactive) [ 24/Mar/15 ]

Testing with the 2.5 patch seems to be going fine. I'll let it run all night, but no failures so far.

Two questions: How do y'all feel about this patch (i.e. are we safe to push it to production on 3/31)? And, does anyone have a clue why this might have become a problem for us recently instead of immediately after a client or server change?

Thanks,
Bill.

Comment by Jinshan Xiong (Inactive) [ 25/Mar/15 ]

Hi Bill,

It should be pretty safe to apply, especially when you will have run it on testing system several days by 3/31.

Have you ever upgrade MDT node or started a new job recently?

Comment by Bill Barth (Inactive) [ 25/Mar/15 ]

We have not updated the MDT recently.

We have thousands of new jobs start every day, so I'm not sure what you mean by a new job.

Comment by Jay Lan (Inactive) [ 25/Mar/15 ]

NASA Ames hit the same problem in production.

Can I take that the patch would do what Christopher Morrone said: "Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error?"

Comment by Gerrit Updater [ 26/Mar/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14190
Subject: LU-6389 llite: restart short read/write for normal IO
Project: fs/lustre-release
Branch: b2_4
Current Patch Set: 1
Commit: 7f12166c9fedd6d8aba3e59042142935d285d70e

Comment by Zhenyu Xu [ 26/Mar/15 ]

yes, the patch would try to restart and finish the IO from where it has accomplished until EOF or error encountered.

Comment by Bill Barth (Inactive) [ 27/Mar/15 ]

I was slightly wrong. We applied patches from LU-5062 and LU-5726 on Feb 10, 2015. Those are the only changes in our setup recently. Do you think they're related to this issue?

Also, we have been testing patchset 4 from the most recent 2.5 patch with good success. Our current plan is to deploy to production on 3/31/15.

Comment by Patrick Farrell (Inactive) [ 31/Mar/15 ]

Stepping back, is there any intent to address the underlying layout lock revocation? Is that even addressable, or is it a permanent part of the design? I'm curious to try to better understand the underlying cause.

Aurelien - I don't think I understand this comment:
"Layout lock should be canceled by ldlm_bl thread during I/O, exactly between 2 calls to vvp_io_write_start(). If LL is dropped between 2 writes, it will be enqueued again before doing the 2nd write and it will be OK. If your I/O does not cover several stripes, it is also fine. "

What's the actual race condition here? You say if it is dropped between the two writes, it is re-enqueued and all is well. So when exactly does it need to be dropped for this to be a problem? (And shouldn't the code that needs it ensure the lock is taken, rather than return?)

Comment by Jinshan Xiong (Inactive) [ 13/Apr/15 ]

Layout lock can be lost in any of the following situations: false sharing, LRU, or memory pressure; and the tricky thing is the client doesn't know if the layout is still the same after it re-enqueues and gets new layout.

Comment by Gerrit Updater [ 17/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14123/
Subject: LU-6389 llite: restart short read/write for normal IO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8badb39913b5e1c614d2fe410ef7200391099855

Comment by Peter Jones [ 18/May/15 ]

Landed for 2.8

Comment by Gerrit Updater [ 03/Feb/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/18275
Subject: LU-6389 utils: fix lustre_rsync read retry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b0c49e274f75afc4a4e2fcacfc7df9bbf88d5487

Comment by Gerrit Updater [ 20/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18275/
Subject: LU-6389 utils: fix lustre_rsync read retry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7d165f5fe357010c3b41abf1163aacb09a88816f

Generated at Sat Feb 10 01:59:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.