[LU-802] Test failure on test suite parallel-scale, subtest test_write_append_truncate Created: 28/Oct/11  Updated: 31/Oct/13  Resolved: 31/Oct/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.1.1, Lustre 2.1.2, Lustre 1.8.8, Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Chris Gearing (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4130

 Description   

This issue was created by maloo for Chris Gearing <chris@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/0219bc62-00f6-11e1-bb4f-52540025f9af.

See dmesg client vm1

The sub-test test_write_append_truncate failed with the following error:

test failed to respond and timed out

Info required for matching: parallel-scale write_append_truncate



 Comments   
Comment by Jian Yu [ 13/Feb/12 ]

Lustre Tag: v2_1_1_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/41/
Distro/Arch: RHEL6/x86_64(server), RHEL6/i686(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The same issue occurred: https://maloo.whamcloud.com/test_sets/e04b00fe-55d3-11e1-9aa8-5254004bbbd3

The stack trace on Client 1 (client-11vm1) showed that:

12:26:40:write_append_ S 0002b90d     0 23901  23898 0x00000080
12:26:40: f63c5570 00000086 4f381feb 0002b90d 00000000 000003c8 00005d5d 0000a53a
12:26:40: 00000000 c3318200 00002643 00000000 00002643 c0b0f0c0 c0b0f0c0 f63c5818
12:26:40: c0b0f0c0 c0b0aaa4 c0b0f0c0 f63c5818 c69b4000 f98351ed f982663f 0000008f
12:26:40:Call Trace:
12:26:40: [<f980de2e>] ? cl_lock_mutex_put+0x3e/0x80 [obdclass]
12:26:40: [<f980e328>] ? cl_lock_state_wait+0x1d8/0x340 [obdclass]
12:26:40: [<f980b8f5>] ? cl_lock_used_mod+0x25/0x60 [obdclass]
12:26:40: [<f980bbad>] ? cl_lock_user_add+0x4d/0x200 [obdclass]
12:26:40: [<c044d4d0>] ? default_wake_function+0x0/0x10
12:26:40: [<f9813969>] ? cl_enqueue_locked+0x1a9/0x2d0 [obdclass]
12:26:40: [<f93faee6>] ? lov_io_call+0x76/0x250 [lov]
12:26:40: [<f9813d69>] ? cl_lock_request+0x89/0x2d0 [obdclass]
12:26:40: [<f93fccc2>] ? lov_io_lock+0x72/0x260 [lov]
12:26:40: [<f89ec572>] ? vvp_io_init+0xe2/0x380 [lustre]
12:26:40: [<f84c4eaf>] ? libcfs_debug_vmsg2+0x34f/0x820 [libcfs]
12:26:40: [<f981bbe0>] ? cl_io_lock+0x4e0/0xa10 [obdclass]
12:26:40: [<f9814fee>] ? cl_io_iter_init+0x6e/0x240 [obdclass]
12:26:41: [<f981c232>] ? cl_io_loop+0x122/0x2a0 [obdclass]
12:26:41: [<f896497a>] ? ll_file_io_generic+0x41a/0x6c0 [lustre]
12:26:41: [<f9801af6>] ? cl_env_get+0x16/0x480 [obdclass]
12:26:41: [<f84c8ce2>] ? cfs_hash_dual_bd_unlock+0x22/0x50 [libcfs]
12:26:41: [<f84ccc4e>] ? cfs_hash_find_or_add+0x7e/0x160 [libcfs]
12:26:41: [<f8964d41>] ? ll_file_aio_read+0x121/0x4d0 [lustre]
12:26:41: [<c052ee2e>] ? cp_new_stat64+0xee/0x100
12:26:41: [<f8970c65>] ? ll_file_read+0x165/0x440 [lustre]
12:26:41: [<c05a24fc>] ? security_file_permission+0xc/0x10
12:26:41: [<c052a656>] ? rw_verify_area+0x66/0xe0
12:26:41: [<f8970b00>] ? ll_file_read+0x0/0x440 [lustre]
12:26:41: [<c052b01d>] ? vfs_read+0x9d/0x190
12:26:41: [<c04afccc>] ? audit_syscall_entry+0x21c/0x240
12:26:41: [<c052b151>] ? sys_read+0x41/0x70
12:26:41: [<c0409a9f>] ? sysenter_do_call+0x12/0x28
Comment by Peter Jones [ 15/Feb/12 ]

Jinshan

Could you please look at this one?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 15/Feb/12 ]

From the debug log, it seems everything was fine. How much time was set for this test? Let's try longer time before timing it out.

Comment by Jian Yu [ 16/Feb/12 ]

> From the debug log, it seems everything was fine. How much time was set for this test? Let's try longer time before timing it out.

It seems the TIMEOUT value set by the autotest system is 3600.

Per the following search result on Maloo:
https://maloo.whamcloud.com/sub_tests/query?utf8=%E2%9C%93&test_set[test_set_script_id]=b10ed7ea-55b4-11e0-bb3d-52540025f9af&sub_test[sub_test_script_id]=6c0ca7dc-55d0-11e0-bb3d-52540025f9af&sub_test[status]=&sub_test[query_bugs]=&test_session[test_host]=&test_session[test_group]=&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=1209600&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results

The write_append_truncate test took very long time (large than 2000s) to run on the VMs, but took less than 10mins to run on the physical nodes with real devices:
https://maloo.whamcloud.com/sub_tests/e59823a4-584c-11e1-9df1-5254004bbbd3
https://maloo.whamcloud.com/sub_tests/4f4f93d0-579a-11e1-99fa-5254004bbbd3

Per the parallel-scale test script:

#
# write_append_truncate
#
# threads per client
write_THREADS=${write_THREADS:-8}
write_REP=${write_REP:-10000}
[ "$SLOW" = "no" ] && write_REP=100

I think we could specify small values to write_REP/write_THREADS to reduce the run time on VMs by the autotest system.

Comment by Jinshan Xiong (Inactive) [ 16/Feb/12 ]

Thank you, yujian. Please let me know if you meet this problem again.

Comment by Sarah Liu [ 05/Mar/12 ]

got the similar issue: https://maloo.whamcloud.com/test_sets/9ff40214-6701-11e1-a9a4-5254004bbbd3

build lustre-master/493/RHEL6-x86_64

Comment by Jian Yu [ 11/May/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/194/
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.2/x86_64(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The similar issue occurred: https://maloo.whamcloud.com/test_sets/3c4bbb96-9b3e-11e1-a0a0-52540035b04c

Comment by Jian Yu [ 31/May/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64(server), RHEL6.2/i686(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The same issue occurred: https://maloo.whamcloud.com/test_sets/19c1ae10-aac9-11e1-bd84-52540035b04c

Comment by Jian Yu [ 18/Dec/12 ]

Lustre Client: 2.3.0
Lustre Server: v2_1_4_RC1
Distro/Arch: RHEL6.3/x86_64
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The same issue occurred: https://maloo.whamcloud.com/test_sets/4f1c6314-4864-11e2-8cdc-52540035b04c

Comment by Jian Yu [ 09/Sep/13 ]

The timeout issue keeps occurring while testing Lustre 2.4.1 RC2:
https://maloo.whamcloud.com/test_sets/8e327ac8-1796-11e3-8f44-52540035b04c
https://maloo.whamcloud.com/test_sets/6edeb926-17b3-11e3-8f44-52540035b04c
https://maloo.whamcloud.com/test_sets/da608a9e-1830-11e3-b39a-52540035b04c

The test timed out in 3600s. However, the same test passed on Lustre 2.4.1 RC1 in about 5570s several days ago.

Chris, could you please check whether the timeout mechanism in autotest was changed or not? Thanks.

The timeout failure is preventing the remaining sub-tests in parallel-scale.sh from running. So, I flagged this ticket as a blocker.

Comment by Jian Yu [ 31/Oct/13 ]

The parallel-scale write_append_truncate test passed in the latest full group test sessions without timeout issue. Let's close this ticket. If it hit timeout failure again, please feel free to reopen the ticket.

Generated at Sat Feb 10 01:10:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.