[LU-1940] Test failure on test suite sanity, subtest test_118c Created: 14/Sep/12  Updated: 15/Feb/13  Resolved: 15/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Duplicate
is duplicated by LU-2604 Test failure on test suite sanity te... Resolved
Severity: 3
Rank (Obsolete): 4199

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/d98c0f7a-fe88-11e1-a707-52540035b04c.

The sub-test test_118c failed with the following error:

Multiop fsync failed, rc=30

Info required for matching: sanity 118c



 Comments   
Comment by Ian Colle (Inactive) [ 19/Sep/12 ]

https://maloo.whamcloud.com/test_sets/807ed662-0273-11e2-ab94-52540035b04c

Comment by Ian Colle (Inactive) [ 04/Oct/12 ]

https://maloo.whamcloud.com/test_sets/ffd3d9f0-0e0b-11e2-bf2b-52540035b04c

Comment by Ian Colle (Inactive) [ 04/Oct/12 ]

Hit this failure three times last night on three different patches.

Comment by Ian Colle (Inactive) [ 04/Oct/12 ]

https://maloo.whamcloud.com/test_sets/8cf491dc-0e0a-11e2-91a3-52540035b04c

Comment by Ian Colle (Inactive) [ 04/Oct/12 ]

https://maloo.whamcloud.com/test_sets/b5a029e6-0e07-11e2-bf2b-52540035b04c

Comment by Andreas Dilger [ 04/Oct/12 ]

https://maloo.whamcloud.com/test_sets/8cf491dc-0e0a-11e2-91a3-52540035b04c

Comment by Andreas Dilger [ 19/Oct/12 ]

https://maloo.whamcloud.com/sub_tests/594d6ba4-17f8-11e2-a41f-52540035b04c

Comment by Peng Tao [ 25/Oct/12 ]

https://maloo.whamcloud.com/test_sets/38aa57a4-1ea1-11e2-8b41-52540035b04c

Comment by Keith Mannthey (Inactive) [ 26/Oct/12 ]

Note 30 is EROFS:

22:40:54:Lustre: DEBUG MARKER: == sanity test 118c: Fsync blocks on EROFS until dirty pages are flushed ============ 22:40:50 (1351143650)
22:41:05:LustreError: 11-0: an error occurred while communicating with 10.10.4.161@tcp. The ost_write operation failed with -30
22:41:05:LustreError: 2990:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff880079ad4c00 x1416774108278040/t0(0) o4->lustre-OST0003-osc-ffff88007a8f6000@10.10.4.161@tcp:6/4 lens 488/192 e 0 to 0 dl 1351143702 ref 2 fl Interpret:R/0/0 rc -30/-30
22:41:05:LustreError: 11-0: an error occurred while communicating with 10.10.4.161@tcp. The ost_write operation failed with -30
22:41:05:LustreError: 11-0: an error occurred while communicating with 10.10.4.161@tcp. The ost_write operation failed with -30
22:41:05:LustreError: 11-0: an error occurred while communicating with 10.10.4.161@tcp. The ost_write operation failed with -30
22:41:06:LustreError: 11-0: an error occurred while communicating with 10.10.4.161@tcp. The ost_write operation failed with -30
22:41:06:LustreError: 2990:0:(osc_request.c:1931:brw_interpret()) lustre-OST0003-osc-ffff88007a8f6000: too many resent retries for object: 1118:0, rc = -30.
22:41:17:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118c: @@@@@@ FAIL: Multiop fsync failed, rc=30 
22:41:17:Lustre: DEBUG MARKER: sanity test_118c: @@@@@@ FAIL: Multiop fsync failed, rc=30

We seem to be giving up (too many retries) before the system has become writable. EROFS is trying to write to a read only filesystem. It seems this test is to make sure we properly block in this condition. More investigation is needed.

Comment by Hongchao Zhang [ 12/Nov/12 ]

status update:

it is still under investigation.

Comment by Hongchao Zhang [ 14/Nov/12 ]

this bug is caused by the resend limit in OSC (the default value is 10) for recoverable error(EIO, EROFS, ENOMEM,
EAGAIN, EINPROGRESS), we can disable this check for this test to fix it.

Comment by Hongchao Zhang [ 20/Nov/12 ]

the patch is tracked at http://review.whamcloud.com/#change,4622

Comment by Peter Jones [ 26/Nov/12 ]

Landed for 2.4

Comment by Nathaniel Clark [ 27/Nov/12 ]

https://maloo.whamcloud.com/test_sets/df9e51ca-3899-11e2-8c55-52540035b04c

Comment by Hongchao Zhang [ 28/Nov/12 ]

the new occurrence is still the resend count, normally there is 1s interval between the fail_loc=OBD_FAIL_OST_EROFS
and fail_loc=0, but somehow the actual time is much longer than it.

the extra patch is tracked at http://review.whamcloud.com/#change,4694

Comment by Keith Mannthey (Inactive) [ 03/Jan/13 ]

From December 30. 1 error out of the last 100 runs.

https://maloo.whamcloud.com/test_sets/90b395d0-5319-11e2-908e-52540035b04c

Comment by Andreas Dilger [ 15/Feb/13 ]

Patch 4694 was landed for 2.4.0

Generated at Sat Feb 10 01:21:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.