Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6389

read()/write() returning less than available bytes intermittently

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.2, Lustre 2.5.3
    • CentOS 6.5 2.6.32-431.17.1.el6.x86_64. 2.5.2 client. 2.5.3 server.
    • 4
    • 9223372036854775807

    Description

      Since March 10, 2015, we have be tracking an increasing number of user reports of intermittent I/O problems with our largest Lustre filesystem on Stampede (SCRATCH). This is affecting dozens of users on multiple jobs per user. First detected in Fortran programs and reduced to a 10-line reproducer (test_break.f), we have now also generated a C reproducer (rwb.c) that does not depend on a specific Fortran runtime library. This case was designed to mimic the underlying libc calls that the Fortran case was making without the interference from the runtime library. The attached case fails with either icc or gcc on our system.

      The basic case involves a long sequence of ~4MB read() or write() calls which eventually should read or write all of a large file. Intermittently, but reproducibly, one of these calls will come back short before getting to the last block of the file. I.e. a 4MB read may only read 2.5MB somewhere in the middle of the file. The number bytes read on the short call and the position in the sequence are apparently random. This issue does not occur if the file has only 1 stripe, but does consistently occur with 2 stripes or more. The problem does not occur on either of our other Lustre filesystems on Stampede, and nothing appears to have changed that is correlated in time with the start of the problems.

      The short read/write does not report an error when running the C code, and subsequent reads continue as normal. Writing behaves identically. Some codes, including the Intel Fortran runtime do not tolerate short reads (though they potentially could), and the codes abort (including the attached one). No codes that I know of are designed to tolerate shorter than requested writes generally. We can find no client or server error messages associated with these short read/write events.

      We would be happy to provide access to Stampede for testing and verification.

      Attachments

        Issue Links

          Activity

            [LU-6389] read()/write() returning less than available bytes intermittently

            Thanks. We're looking at it.

            bbarth Bill Barth (Inactive) added a comment - Thanks. We're looking at it.

            Hi Bill,

            Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160.

            jay Jinshan Xiong (Inactive) added a comment - Hi Bill, Just in case you didn't notice that, Bobijam has backported the patch to b2_5 at http://review.whamcloud.com/14160 .

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160
            Subject: LU-6389 llite: restart short read/write for normal IO
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 08085c551a2202594762ad999d511511be2f1c70

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14160 Subject: LU-6389 llite: restart short read/write for normal IO Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 08085c551a2202594762ad999d511511be2f1c70

            Livermore has hit this problem in production as well. The HPSS movers are hitting this.

            This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it.

            Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong.

            If that design choice is not recorded anywhere, it would be very good for someone to write it down this time.

            We should be treating the issue in this ticket a regression.

            morrone Christopher Morrone (Inactive) added a comment - - edited Livermore has hit this problem in production as well. The HPSS movers are hitting this. This problem crops up every few years, and each time we repeat the same debate about whether or not posix allows short reads. Yes, it may technically allow it, but for filesystems no applications expect it. Our design choice in Lustre has been (for probably well over a decade), that Lustre must not return short reads or writes, except in the cases of a fatal error. For fatal errors, all further IO to the file will fail as well, so it should be fairly obvious to the application that something has gone wrong. If that design choice is not recorded anywhere, it would be very good for someone to write it down this time. We should be treating the issue in this ticket a regression.

            Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well.

            bbarth Bill Barth (Inactive) added a comment - Just FYI, we tried to apply this patch to our 2.5.2 client source in order to test, and it didn't take. Once y'all are happy that it is correct, we're going to need a 2.5.2 applicable version as well.

            I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support.

            jcl jacques-charles lafoucriere added a comment - I agree we have to always follow Posix (I misunderstood jinshan uses case). My suggestion is to add a mount option to allow partial R/W in case we find some interrest in doing such behavior (eg if the patch introduce some performance decrease). The idea is to tune like it is already done for posix locking support.

            Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned.

            Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error.

            paf Patrick Farrell (Inactive) added a comment - Yes, that's correct - The point is that an error sometimes occurs, and in that case, the number of bytes successfully read or written is returned. Note that the patch proposed currently will stop Lustre from returning short reads or writes except in case of interruption/error.

            Some literature, like Robert Love’s Linux System Programming[1], says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”.

            [1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false

            thiells Stephane Thiell added a comment - Some literature, like Robert Love’s Linux System Programming [1] , says that “for regular files, write() is guaranteed to perform the entire requested write, unless an error occurs”. [1] https://books.google.fr/books?id=K1vXEb1SgawC&lpg=PA37&ots=fdB0D4uUUC&pg=PA37#v=onepage&q&f=false

            jcf,

            Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data.

            I think Jinshan is right and we should follow POSIX semantics as he described.

            paf Patrick Farrell (Inactive) added a comment - jcf, Should Lustre then also reset the file pointer to the point it was at before the partial read or write? What about the contents of the file or the buffer? I feel like returning -1 in this case is misrepresenting what has happened: Some data has already been read or written, so the contents of your buffer or disk has changed. Lustre cannot undo that, and it's an important difference from simply failing to read or write any data. I think Jinshan is right and we should follow POSIX semantics as he described.

            it returns -1 only if there is no bytes having been read or written yet.

            jay Jinshan Xiong (Inactive) added a comment - it returns -1 only if there is no bytes having been read or written yet.

            Hi Jinshan

            on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP.
            Even if it is allowed by posix, it is not usual for storage, so may be we whould add a mount option to allow partial read/write (if any one find an interrest / use case).

            jcl jacques-charles lafoucriere added a comment - Hi Jinshan on error read/write return -1, not the size moved. The only case of partial read/write developpers manage is for network file descriptors over protocols like UDP. Even if it is allowed by posix, it is not usual for storage, so may be we whould add a mount option to allow partial read/write (if any one find an interrest / use case).

            People

              bobijam Zhenyu Xu
              bbarth Bill Barth (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: