[LU-359] Confused error message after write failure Created: 24/May/11  Updated: 22/Feb/13  Resolved: 09/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 1.8.6
Fix Version/s: Lustre 2.3.0, Lustre 1.8.9

Type: Bug Priority: Blocker
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4501

 Description   

The issues were found in sanity-quota test. When application writes failed for out of quota (-EDQUOT), it will close the file and exit. But the close returns "-EIO" related with former write failure, like that:

===============
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
[dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-1] [bs=1024] [count=9410] [seek=9410]
dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-1': Disk quota exceeded
dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
[dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-1] [bs=1024] [count=1024] [seek=18821]
dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-1': Disk quota exceeded
dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error
0
===============

The message of "dd: closing output file `/mnt/lustre/d0.sanity-quota/d1/f1-1': Input/output error" is confused, which is quite different from "dd" against local FS output. The expected output should be:

===============
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
[dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-0] [bs=1024] [count=14631] [seek=14631]
dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-0': Disk quota exceeded
13182+0 records in
13181+0 records out
13497344 bytes (13 MB) copied, 0.999215 seconds, 13.5 MB/s
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
[dd] [if=/dev/zero] [of=/mnt/lustre/d0.sanity-quota/d1/f1-0] [bs=1024] [count=1024] [seek=29262]
dd: writing `/mnt/lustre/d0.sanity-quota/d1/f1-0': Disk quota exceeded
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00409484 seconds, 0.0 kB/s
0
===============



 Comments   
Comment by nasf (Inactive) [ 24/May/11 ]

In fact, the "-EDQUOT" is returned by ll_file_aio_write(), which is out of the control of "lli_async_rc". Maybe we can use more complex logic to trace all write related failures by single "lli_async_rc", but the separate "ll_write_rc" is quite simple to resolve these issues.

Comment by nasf (Inactive) [ 24/May/11 ]

patch for b1_8:

http://review.whamcloud.com/#change,596

Comment by nasf (Inactive) [ 24/May/11 ]

patch for master:

http://review.whamcloud.com/#change,597

Comment by nasf (Inactive) [ 24/May/11 ]

Johann, sorry for I did not understand the reason you disapprove former patch for this issue. Would you please to give some examples for what maybe caused by introducing "ll_write_rc"? thanks!

Comment by nasf (Inactive) [ 25/May/11 ]

I rename per-inode structure based 'lli_write_rc' as per file-structure based 'fd_last_write' to track the last write/fsync result (success or failure) through the 'file' structure. So when 'sys_close()' is called against the 'file' structure, we know whether the caller has known write/fsync failure or not, and then avoid report confused failure.

As for 'lli_async_rc', they do not conflict:
1) if 'fd_last_write' is set, then means last write/fsync failed, and the caller has known that, so 'sys_close()->ll_flush()' returns success.
2) if 'fd_last_write' is reset, then process as original logic

Is anything wrong?

Comment by nasf (Inactive) [ 08/Oct/11 ]

New patch for master is available:

http://review.whamcloud.com/#change,1497

Comment by nasf (Inactive) [ 25/Jul/12 ]

The patches have been updated:

master: http://review.whamcloud.com/#change,1497
b1_8: http://review.whamcloud.com/#change,596

Comment by Peter Jones [ 09/Aug/12 ]

Landed for 2.3

Generated at Sat Feb 10 01:06:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.