Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-359

Confused error message after write failure

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 1.8.9
    • Lustre 2.3.0, Lustre 1.8.6
    • None
    • 3
    • 4501

    Description

      The issues were found in sanity-quota test. When application writes failed for out of quota (-EDQUOT), it will close the file and exit. But the close returns "-EIO" related with former write failure, like that:

      ===============
      running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
      [dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-1] [bs=1024] [count=9410] [seek=9410]
      dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-1': Disk quota exceeded
      dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error
      running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
      [dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-1] [bs=1024] [count=1024] [seek=18821]
      dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-1': Disk quota exceeded
      dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error
      0
      ===============

      The message of "dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error" is confused, which is quite different from "dd" against local FS output. The expected output should be:

      ===============
      running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
      [dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-0] [bs=1024] [count=14631] [seek=14631]
      dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-0': Disk quota exceeded
      13182+0 records in
      13181+0 records out
      13497344 bytes (13 MB) copied, 0.999215 seconds, 13.5 MB/s
      running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
      [dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-0] [bs=1024] [count=1024] [seek=29262]
      dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-0': Disk quota exceeded
      1+0 records in
      0+0 records out
      0 bytes (0 B) copied, 0.00409484 seconds, 0.0 kB/s
      0
      ===============

      Attachments

        Issue Links

          Activity

            [LU-359] Confused error message after write failure
            pjones Peter Jones added a comment -

            Landed for 2.3

            pjones Peter Jones added a comment - Landed for 2.3
            yong.fan nasf (Inactive) added a comment - The patches have been updated: master: http://review.whamcloud.com/#change,1497 b1_8: http://review.whamcloud.com/#change,596

            New patch for master is available:

            http://review.whamcloud.com/#change,1497

            yong.fan nasf (Inactive) added a comment - New patch for master is available: http://review.whamcloud.com/#change,1497

            I rename per-inode structure based 'lli_write_rc' as per file-structure based 'fd_last_write' to track the last write/fsync result (success or failure) through the 'file' structure. So when 'sys_close()' is called against the 'file' structure, we know whether the caller has known write/fsync failure or not, and then avoid report confused failure.

            As for 'lli_async_rc', they do not conflict:
            1) if 'fd_last_write' is set, then means last write/fsync failed, and the caller has known that, so 'sys_close()->ll_flush()' returns success.
            2) if 'fd_last_write' is reset, then process as original logic

            Is anything wrong?

            yong.fan nasf (Inactive) added a comment - I rename per-inode structure based 'lli_write_rc' as per file-structure based 'fd_last_write' to track the last write/fsync result (success or failure) through the 'file' structure. So when 'sys_close()' is called against the 'file' structure, we know whether the caller has known write/fsync failure or not, and then avoid report confused failure. As for 'lli_async_rc', they do not conflict: 1) if 'fd_last_write' is set, then means last write/fsync failed, and the caller has known that, so 'sys_close()->ll_flush()' returns success. 2) if 'fd_last_write' is reset, then process as original logic Is anything wrong?

            Johann, sorry for I did not understand the reason you disapprove former patch for this issue. Would you please to give some examples for what maybe caused by introducing "ll_write_rc"? thanks!

            yong.fan nasf (Inactive) added a comment - Johann, sorry for I did not understand the reason you disapprove former patch for this issue. Would you please to give some examples for what maybe caused by introducing "ll_write_rc"? thanks!
            yong.fan nasf (Inactive) added a comment - patch for master: http://review.whamcloud.com/#change,597
            yong.fan nasf (Inactive) added a comment - patch for b1_8: http://review.whamcloud.com/#change,596

            In fact, the "-EDQUOT" is returned by ll_file_aio_write(), which is out of the control of "lli_async_rc". Maybe we can use more complex logic to trace all write related failures by single "lli_async_rc", but the separate "ll_write_rc" is quite simple to resolve these issues.

            yong.fan nasf (Inactive) added a comment - In fact, the "-EDQUOT" is returned by ll_file_aio_write(), which is out of the control of "lli_async_rc". Maybe we can use more complex logic to trace all write related failures by single "lli_async_rc", but the separate "ll_write_rc" is quite simple to resolve these issues.

            People

              yong.fan nasf (Inactive)
              yong.fan nasf (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: