[LU-6389] read()/write() returning less than available bytes intermittently - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.2, Lustre 2.5.3
Labels:
- llnl
Environment:
CentOS 6.5 2.6.32-431.17.1.el6.x86_64. 2.5.2 client. 2.5.3 server.

Severity:
4
Rank (Obsolete):
9223372036854775807

Description

Since March 10, 2015, we have be tracking an increasing number of user reports of intermittent I/O problems with our largest Lustre filesystem on Stampede (SCRATCH). This is affecting dozens of users on multiple jobs per user. First detected in Fortran programs and reduced to a 10-line reproducer (test_break.f), we have now also generated a C reproducer (rwb.c) that does not depend on a specific Fortran runtime library. This case was designed to mimic the underlying libc calls that the Fortran case was making without the interference from the runtime library. The attached case fails with either icc or gcc on our system.

The basic case involves a long sequence of ~4MB read() or write() calls which eventually should read or write all of a large file. Intermittently, but reproducibly, one of these calls will come back short before getting to the last block of the file. I.e. a 4MB read may only read 2.5MB somewhere in the middle of the file. The number bytes read on the short call and the position in the sequence are apparently random. This issue does not occur if the file has only 1 stripe, but does consistently occur with 2 stripes or more. The problem does not occur on either of our other Lustre filesystems on Stampede, and nothing appears to have changed that is correlated in time with the start of the problems.

The short read/write does not report an error when running the C code, and subsequent reads continue as normal. Writing behaves identically. Some codes, including the Intel Fortran runtime do not tolerate short reads (though they potentially could), and the codes abort (including the attached one). No codes that I know of are designed to tolerate shorter than requested writes generally. We can find no client or server error messages associated with these short read/write events.

We would be happy to provide access to Stampede for testing and verification.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre_bbarth.log.bz2
0.2 kB
20/Mar/15 8:12 PM
lustre_bbarth.log.bz2
0.2 kB
20/Mar/15 1:59 PM
short_io_bug.tar.gz
2 kB
19/Mar/15 11:22 PM

Issue Links

is duplicated by

LU-6392 short read/write with stripe count > 1

Resolved

is related to

LU-6392 short read/write with stripe count > 1

Resolved

LU-6545 MPIIO short reads

Resolved

mentioned in: Page Loading...

Activity

[LU-6389] read()/write() returning less than available bytes intermittently

Bill Barth (Inactive) added a comment - 24/Mar/15 6:59 PM

Thanks. We're looking at it.

Bill Barth (Inactive) added a comment - 24/Mar/15 6:59 PM Thanks. We're looking at it.

Jinshan Xiong (Inactive) added a comment - 24/Mar/15 6:12 PM

Hi Bill,

Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160.

Jinshan Xiong (Inactive) added a comment - 24/Mar/15 6:12 PM Hi Bill, Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160 .

Gerrit Updater added a comment - 24/Mar/15 5:00 PM

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160
Subject: ~~LU-6389~~ llite: restart short read/write for normal IO
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 08085c551a2202594762ad999d511511be2f1c70

Gerrit Updater added a comment - 24/Mar/15 5:00 PM Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160 Subject: LU-6389 llite: restart short read/write for normal IO Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 08085c551a2202594762ad999d511511be2f1c70

Christopher Morrone (Inactive) added a comment - 24/Mar/15 4:56 PM - edited

Livermore has hit this problem in production as well. The HPSS movers are hitting this.

This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it.

Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong.

If that design choice is not recorded anywhere, it would be very good for someone to write it down this time.

We should be treating the issue in this ticket a regression.

Christopher Morrone (Inactive) added a comment - 24/Mar/15 4:56 PM - edited Livermore has hit this problem in production as well. The HPSS movers are hitting this. This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it. Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong. If that design choice is not recorded anywhere, it would be very good for someone to write it down this time. We should be treating the issue in this ticket a regression.

Bill Barth (Inactive) added a comment - 24/Mar/15 4:32 PM

Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well.

Bill Barth (Inactive) added a comment - 24/Mar/15 4:32 PM Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well.

jacques-charles lafoucriere added a comment - 24/Mar/15 11:01 AM

I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support.

jacques-charles lafoucriere added a comment - 24/Mar/15 11:01 AM I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support.

Patrick Farrell (Inactive) added a comment - 24/Mar/15 2:03 AM

Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned.

Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error.

Patrick Farrell (Inactive) added a comment - 24/Mar/15 2:03 AM Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned. Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error.

Stephane Thiell added a comment - 23/Mar/15 9:56 PM

Some literature, like Robert Love’s Linux System Programming[1], says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”.

[1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false

Stephane Thiell added a comment - 23/Mar/15 9:56 PM Some literature, like Robert Love’s Linux System Programming [1] , says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”. [1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false

Patrick Farrell (Inactive) added a comment - 23/Mar/15 5:48 PM

jcf,

Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data.

I think Jinshan is right and we should follow POSIX semantics as he described.

Patrick Farrell (Inactive) added a comment - 23/Mar/15 5:48 PM jcf, Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data. I think Jinshan is right and we should follow POSIX semantics as he described.

Jinshan Xiong (Inactive) added a comment - 23/Mar/15 5:33 PM

it returns -1 only if there is no bytes having been read or written yet.

Jinshan Xiong (Inactive) added a comment - 23/Mar/15 5:33 PM it returns -1 only if there is no bytes having been read or written yet.

jacques-charles lafoucriere added a comment - 23/Mar/15 5:08 PM

Hi Jinshan

on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP.
Even if it is allowed by posix, it is not usual for storage, so may be we whould add a mount option to allow partial read/write (if any one find an interrest / use case).

jacques-charles lafoucriere added a comment - 23/Mar/15 5:08 PM Hi Jinshan on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP. Even if it is allowed by posix, it is not usual for storage, so may be we whould add a mount option to allow partial read/write (if any one find an interrest / use case).

People

Assignee:: Zhenyu Xu

Reporter:: Bill Barth (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 30 Start watching this issue

Dates

Created:: 19/Mar/15 11:22 PM

Updated:: 19/Jun/20 2:52 PM

Resolved:: 18/May/15 2:23 PM