[LU-6389] read()/write() returning less than available bytes intermittently Created: 19/Mar/15 Updated: 19/Jun/20 Resolved: 18/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.2, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Bill Barth (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
CentOS 6.5 2.6.32-431.17.1.el6.x86_64. 2.5.2 client. 2.5.3 server. |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 4 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Since March 10, 2015, we have be tracking an increasing number of user reports of intermittent I/O problems with our largest Lustre filesystem on Stampede (SCRATCH). This is affecting dozens of users on multiple jobs per user. First detected in Fortran programs and reduced to a 10-line reproducer (test_break.f), we have now also generated a C reproducer (rwb.c) that does not depend on a specific Fortran runtime library. This case was designed to mimic the underlying libc calls that the Fortran case was making without the interference from the runtime library. The attached case fails with either icc or gcc on our system. The basic case involves a long sequence of ~4MB read() or write() calls which eventually should read or write all of a large file. Intermittently, but reproducibly, one of these calls will come back short before getting to the last block of the file. I.e. a 4MB read may only read 2.5MB somewhere in the middle of the file. The number bytes read on the short call and the position in the sequence are apparently random. This issue does not occur if the file has only 1 stripe, but does consistently occur with 2 stripes or more. The problem does not occur on either of our other Lustre filesystems on Stampede, and nothing appears to have changed that is correlated in time with the start of the problems. The short read/write does not report an error when running the C code, and subsequent reads continue as normal. Writing behaves identically. Some codes, including the Intel Fortran runtime do not tolerate short reads (though they potentially could), and the codes abort (including the attached one). No codes that I know of are designed to tolerate shorter than requested writes generally. We can find no client or server error messages associated with these short read/write events. We would be happy to provide access to Stampede for testing and verification. |
| Comments |
| Comment by Peter Jones [ 19/Mar/15 ] |
|
Bobijam Could you please advise on this issue? Thanks Peter |
| Comment by Gerrit Updater [ 20/Mar/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14123 |
| Comment by Zhenyu Xu [ 20/Mar/15 ] |
|
I cannot reproduce it for read, and only reproduced the write when one OST is full. This patch will report write error. Can you give it a try? |
| Comment by Aurelien Degremont (Inactive) [ 20/Mar/15 ] |
|
Hello At CEA we are facing the same issue and working on it for few days. So far, I've track this down to can_populate_pages() and a layout lock cancel. We can easily see in Lustre debug logs that the layout lock is being canceled during I/O and so returning a short write. File layout as not changed and when lock is taken back, layout_gen is still '0'. File is accessed by only 1 node, 1 ior task and have just been created. Aurélien |
| Comment by Bruno Travouillon (Inactive) [ 20/Mar/15 ] |
|
Hi bobijam, We are currently working on the same kind of issue (short read/write). I am trying to grab relevant information to open a JIRA. When looking at your patch in lustre/llite/file.c, I don't understand why you change the behaviour? Is it for testing purposes? Obviously, this patch will solve our issue, but this is a U-turn from "return short read/write" to "Restart io for short read/write", |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
Thanks, Bobijam. We can't try this patch against a filesystem that's in production operations. We will try to reproduce this on our test system, but I have my doubts. At best we might be able to patch on 3/31/15 when we have our next maintenance. Meanwhile, I'm interested in anyone's opinions about how this could have started with no changes in the last several months. Bill. |
| Comment by Zhenyu Xu [ 20/Mar/15 ] |
|
Bill, Can you grab -1 logs when you reproduce it on the test system? |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
debug log |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
I have a -1 log from a production client node during a run that exhibits the problem. The file, bizip2ed is attached. |
| Comment by Andreas Dilger [ 20/Mar/15 ] |
|
Bobijam, I agree with Bruno that it would be better to track down the root cause why the IO is being interrupted rather than just retrying at the top level. |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
FYI a I ran this on our smaller, unloaded test rig running identical client and server versions and could not reproduce. |
| Comment by Zhenyu Xu [ 20/Mar/15 ] |
|
Bill, I couldn't find the error in the log, I think it does not cover the short read section. $ grep ll_file_io_generic lustre_bbarth.log.txt 00000080:00000001:15.0:1426859590.895686:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859590.895760:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.062183:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.062260:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.229229:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.229306:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.396594:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.396667:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.563333:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.563417:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.730218:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.730295:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859591.896834:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859591.896904:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859592.063883:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859592.063961:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859592.233223:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859592.233301:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859592.402279:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859592.402355:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859592.571321:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859592.571398:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859592.741125:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) 00000080:00000001:15.0:1426859592.741202:0:16140:0:(file.c:1131:ll_file_io_generic()) Process entered 00000080:00000001:15.0:1426859598.853529:0:16140:0:(file.c:1187:ll_file_io_generic()) Process leaving via out (rc=4194308 : 4194308 : 0x400004) all of them returns the expected length (4194308). |
| Comment by Jinshan Xiong (Inactive) [ 20/Mar/15 ] |
|
Hi Aurelien, This is because layout lock can be shared one dlm lock with other inodebits lock at open. The client may lose layout lock due to this false sharing. Jinshan |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
I have uploaded a second log file. I don't know how to produce text from the binary dump, but grepping the binary for that message only shows 4MB prints. Is it possible that the debug message is printing the size it wanted iinstead of the size it actually gave to the read()? I can guarantee that I got the short read on this run. Bill. |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
second attempt at a log file |
| Comment by Zhenyu Xu [ 20/Mar/15 ] |
|
Don't try this patch, it still has issue yet. |
| Comment by Bill Barth (Inactive) [ 20/Mar/15 ] |
|
Don't worry, we didn't try it yet. I can't reproduce this in our test environment yet. Only in production. |
| Comment by Aurelien Degremont (Inactive) [ 21/Mar/15 ] |
|
For your information, I think this issue is really important because some libraries like netCDF (which is widely use) does stupid things like : If it faces a short read, it zeroes all the unread part of the buffer, instead of doing another read, and continues and do not consider this as an error. Some of our netCDF users hit this bug with short read. We do not understand why some filesystems are impacted and other not. So far, the smallest filesystems where we can reproduce this is one with 120 OSTs and quite some load. However, some bigger FS are not impacted... for unknown reason |
| Comment by Aurelien Degremont (Inactive) [ 21/Mar/15 ] |
|
Hi Jinshan! To reproduce this bug, it is mostly a timing issue AFAIK. Layout lock should be canceled by ldlm_bl thread during I/O, exactly between 2 calls to vvp_io_write_start(). If LL is dropped between 2 writes, it will be enqueued again before doing the 2nd write and it will be OK. If your I/O does not cover several stripes, it is also fine. |
| Comment by Bill Barth (Inactive) [ 21/Mar/15 ] |
|
We have also experienced problems with NetCDF-based codes, but I don't have a compact reproducer yet. In our cases, we're seeing large writes from WRF hitting asserts in posixio.c: assert(*posp == OFF_NONE || *posp == lseek(nciop->fd, 0, SEEK_CUR));
I haven't been able to prove that this follows a short read or write, but I have had complaints of files coming out shorter than expected. All of this also followed an update to 2.5.3 on our servers, but there is at least 2 months between our update and first known case. I would say that activity on the filesystems has been much higher this month, but not in a way that substantially increases the load reported by Linux, etc. |
| Comment by Bill Barth (Inactive) [ 21/Mar/15 ] |
|
Also, for the record, we have not been able to reproduce this problem with the attached Fortran code using gfortran. strace'ing the gfortran version shows, basically, one giant read() call for the whole Fortran record rather than a long sequence of 4MB reads. This may or may not point in the right direction, but block based reads of sufficient size seem to be required. Also, changing read() to fread() in the attached C program makes the problem go away as well in my experience. |
| Comment by Jinshan Xiong (Inactive) [ 21/Mar/15 ] |
|
The root cause of this problem is clear and Bobijam is working on a fix, please stay tuned. |
| Comment by Andreas Dilger [ 22/Mar/15 ] |
|
Aurelien, have you considered filing a bug with the upstream NetCDF to handle short read/write? I don't think it is a replacement for a solution to this bug, but it does seem like a bug in that code that should be fixed. |
| Comment by Zhenyu Xu [ 23/Mar/15 ] |
|
As Jinshan pointed, the layout lock could be lost and that would cause short read/write. Based on "man 2 read", "The system guarantees to read the number of bytes requested if the descriptor references a normal file that has that many bytes left before the end-of-file, but in no other case.", so read() should be return less than available bytes, so I think we need to retry short read/write. |
| Comment by Aurelien Degremont (Inactive) [ 23/Mar/15 ] |
|
It seems this note comes from OSX man page. There is nothing like that in Linux man page. I think Lustre should do is best to avoid short read but user codes should not consider short read will never happen. Both should be fixed. There is nothing like this for write, even if OSX man page |
| Comment by Zhenyu Xu [ 23/Mar/15 ] |
|
I've updated #14123 to retry short read/write if user codes does not cope with short read/write. |
| Comment by Bill Barth (Inactive) [ 23/Mar/15 ] |
|
Aurelien, I take the following from POSIX to be the exhaustive list of reasons a short read may occur:
Since this is a regular file, only the first two may occur. Do you know of another interpretation,or am I maybe looking in the wrong place? |
| Comment by Aurelien Degremont (Inactive) [ 23/Mar/15 ] |
|
I think this is correct. That means:
|
| Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ] |
|
Hi Aurelien, In any cases, when a read or write encounters an error, such as ENOMEM or EIO, it should return the number of bytes it has already done to the application. A robust program must handle short read and write. This is just FYI. We'll make a fix for Lustre. |
| Comment by jacques-charles lafoucriere [ 23/Mar/15 ] |
|
Hi Jinshan on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP. |
| Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ] |
|
it returns -1 only if there is no bytes having been read or written yet. |
| Comment by Patrick Farrell (Inactive) [ 23/Mar/15 ] |
|
jcf, Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data. I think Jinshan is right and we should follow POSIX semantics as he described. |
| Comment by Stephane Thiell [ 23/Mar/15 ] |
|
Some literature, like Robert Love’s Linux System Programming[1], says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”. [1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false |
| Comment by Patrick Farrell (Inactive) [ 24/Mar/15 ] |
|
Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned. Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error. |
| Comment by jacques-charles lafoucriere [ 24/Mar/15 ] |
|
I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support. |
| Comment by Bill Barth (Inactive) [ 24/Mar/15 ] |
|
Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well. |
| Comment by Christopher Morrone [ 24/Mar/15 ] |
|
Livermore has hit this problem in production as well. The HPSS movers are hitting this. This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it. Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong. If that design choice is not recorded anywhere, it would be very good for someone to write it down this time. We should be treating the issue in this ticket a regression. |
| Comment by Gerrit Updater [ 24/Mar/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160 |
| Comment by Jinshan Xiong (Inactive) [ 24/Mar/15 ] |
|
Hi Bill, Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160. |
| Comment by Bill Barth (Inactive) [ 24/Mar/15 ] |
|
Thanks. We're looking at it. |
| Comment by Bill Barth (Inactive) [ 24/Mar/15 ] |
|
Testing with the 2.5 patch seems to be going fine. I'll let it run all night, but no failures so far. Two questions: How do y'all feel about this patch (i.e. are we safe to push it to production on 3/31)? And, does anyone have a clue why this might have become a problem for us recently instead of immediately after a client or server change? Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 25/Mar/15 ] |
|
Hi Bill, It should be pretty safe to apply, especially when you will have run it on testing system several days by 3/31. Have you ever upgrade MDT node or started a new job recently? |
| Comment by Bill Barth (Inactive) [ 25/Mar/15 ] |
|
We have not updated the MDT recently. We have thousands of new jobs start every day, so I'm not sure what you mean by a new job. |
| Comment by Jay Lan (Inactive) [ 25/Mar/15 ] |
|
NASA Ames hit the same problem in production. Can I take that the patch would do what Christopher Morrone said: "Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error?" |
| Comment by Gerrit Updater [ 26/Mar/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14190 |
| Comment by Zhenyu Xu [ 26/Mar/15 ] |
|
yes, the patch would try to restart and finish the IO from where it has accomplished until EOF or error encountered. |
| Comment by Bill Barth (Inactive) [ 27/Mar/15 ] |
|
I was slightly wrong. We applied patches from Also, we have been testing patchset 4 from the most recent 2.5 patch with good success. Our current plan is to deploy to production on 3/31/15. |
| Comment by Patrick Farrell (Inactive) [ 31/Mar/15 ] |
|
Stepping back, is there any intent to address the underlying layout lock revocation? Is that even addressable, or is it a permanent part of the design? I'm curious to try to better understand the underlying cause. Aurelien - I don't think I understand this comment: What's the actual race condition here? You say if it is dropped between the two writes, it is re-enqueued and all is well. So when exactly does it need to be dropped for this to be a problem? (And shouldn't the code that needs it ensure the lock is taken, rather than return?) |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/15 ] |
|
Layout lock can be lost in any of the following situations: false sharing, LRU, or memory pressure; and the tricky thing is the client doesn't know if the layout is still the same after it re-enqueues and gets new layout. |
| Comment by Gerrit Updater [ 17/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14123/ |
| Comment by Peter Jones [ 18/May/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 03/Feb/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/18275 |
| Comment by Gerrit Updater [ 20/Feb/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18275/ |