[LU-3393] 2.1.5<->2.4.0 interop: Test timeout on test suite sanity, subtest test_31k Created: 24/May/13  Updated: 28/May/13  Resolved: 28/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Emoly Liu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client: 2.1 build 197
server: master with patch 6426


Issue Links:
Duplicate
duplicates LU-3401 2.1.5<->2.4.0 interop: sanity test 27... Closed
Severity: 3
Rank (Obsolete): 8404

 Description   

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/a315c4c2-c485-11e2-ac71-52540035b04c.

The sub-test test_31k failed with the following error:

test failed to respond and timed out

Info required for matching: sanity 31k

On the client console:

12:15:27:Lustre: DEBUG MARKER: == sanity test 31k: link to file: the same, non-existing, dir================= 12:15:24 (1369250124)
12:15:27:LustreError: 11-0: an error occurred while communicating with 10.10.4.183@tcp. The ost_connect operation failed with -19
12:15:38:LustreError: 11-0: an error occurred while communicating with 10.10.4.183@tcp. The ost_connect operation failed with -19
12:15:50:LustreError: 11-0: an error occurred while communicating with 10.10.4.183@tcp. The ost_connect operation failed with -19
12:15:51:LustreError: Skipped 1 previous similar message
...
12:18:47:INFO: task touch:7292 blocked for more than 120 seconds.
12:18:47:touch         D 0000000000000000     0  7292   7169 0x00000080
12:18:47: ffff880067497a68 0000000000000082 ffff880067497a18 ffffffff810097cc
12:18:47: ffff88007d2c20b8 0000000000000000 0000000000497a28 ffff880002214200
12:18:47: ffff88007bfa3ab8 ffff880067497fd8 000000000000fb88 ffff88007bfa3ab8
12:18:47:Call Trace:
12:18:47: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
12:18:47: [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
12:18:47: [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
12:18:48: [<ffffffff814ea743>] wait_for_common+0x123/0x180
12:18:48: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
12:18:48: [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
12:18:48: [<ffffffffa087f518>] osc_io_setattr_end+0x28/0xb0 [osc]
12:18:48: [<ffffffffa05f28f0>] cl_io_end+0x60/0x150 [obdclass]
12:18:48: [<ffffffffa0994210>] ? lov_io_end_wrapper+0x0/0x100 [lov]
12:18:48: [<ffffffffa0994301>] lov_io_end_wrapper+0xf1/0x100 [lov]
12:18:48: [<ffffffffa0993d81>] lov_io_call+0x71/0x120 [lov]
12:18:48: [<ffffffffa0994e0c>] lov_io_end+0x4c/0x110 [lov]
12:18:48: [<ffffffffa05f28f0>] cl_io_end+0x60/0x150 [obdclass]
12:18:48: [<ffffffffa05f7a8a>] cl_io_loop+0xda/0x190 [obdclass]
12:18:48: [<ffffffffa0a6f5b3>] cl_setattr_ost+0x1c3/0x240 [lustre]
12:18:48: [<ffffffffa0a43508>] ll_setattr_raw+0x978/0xf30 [lustre]
12:18:48: [<ffffffffa0a43b1f>] ll_setattr+0x5f/0x100 [lustre]
12:18:48: [<ffffffff81192398>] notify_change+0x168/0x340
12:18:48: [<ffffffff811a59bc>] utimes_common+0xdc/0x1b0
12:18:48: [<ffffffffa043ae81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:18:48: [<ffffffffa0434a38>] ? libcfs_log_return+0x28/0x40 [libcfs]
12:18:48: [<ffffffff811a5b29>] do_utimes+0x99/0xf0
12:18:48: [<ffffffff811a5c82>] sys_utimensat+0x32/0x90
12:18:48: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Jian Yu [ 27/May/13 ]

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1508
Distro/Arch: RHEL6.4/x86_64

The same issue occurred:
https://maloo.whamcloud.com/test_sets/c8747630-c621-11e2-9bf1-52540035b04c

The issue keeps occurring in the Lustre b2_1<->b2_4, b2_1<->master interop test runs and blocks the remaining sanity tests being run.

From the historical reports on Maloo, the issue has been occurring since 2013-01-19 and related to the patches for LU-1866.

Do we need add interop codes on Lustre b2_1 branch?

Comment by Peter Jones [ 27/May/13 ]

Hi Emoly

I had a brief chat with Fanyong about this and he did not think it likely that this was related to LU-1866. Could you please investigate more deeply and see if you can establish what is causing this problem?

Thanks

Peter

Comment by Andreas Dilger [ 28/May/13 ]

This looks to be fallout from the previous test_27z() failure (filed as LU-3401), because the b2_1 version of this test doesn't remount the OST if the test hits an error.

This is fixed in master by http://review.whamcloud.com/5838, but that patch also includes a new feature.

Generated at Sat Feb 10 01:33:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.