[LU-1966] Test failure on test suite replay-ost-single, subtest test_6: Destroys weren't done in 5 sec Created: 17/Sep/12  Updated: 05/Mar/13  Resolved: 05/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-2620 Failure on test suite replay-ost-sing... Resolved
Severity: 3
Rank (Obsolete): 3989

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/0d43f964-fff5-11e1-9f3c-52540035b04c.

The sub-test test_6 failed with the following error:

== replay-ost-single test 6: Fail OST before obd_destroy == 19:17:36 (1347761856)
Waiting for orphan cleanup...
CMD: client-28vm4 /usr/sbin/lctl get_param -n obdfilter.*.mds_sync
Waiting for destroy to be done...
Waiting 0 secs for destroys to be done.
Waiting 1 secs for destroys to be done.
Waiting 2 secs for destroys to be done.
Waiting 3 secs for destroys to be done.
Waiting 4 secs for destroys to be done.
Destroys weren't done in 5 sec.
 replay-ost-single test_6: @@@@@@ FAIL: test_6 failed with 5

Info required for matching: replay-ost-single 6



 Comments   
Comment by Jian Yu [ 17/Sep/12 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_3/19
Test group: failover

replay-ost-single test 7 also failed with the same issue: https://maloo.whamcloud.com/test_sets/0d43f964-fff5-11e1-9f3c-52540035b04c

Comment by Peter Jones [ 17/Sep/12 ]

Bobijam

Could you please look into this one?

Peter

Comment by Zhenyu Xu [ 18/Sep/12 ]

looks like previous tests residue interference.

OSS debug log

line:9295 00000100:00100000:0.0:1347761823.542774:0:2427:0:(service.c:1786:ptlrpc_server_handle_req_in()) got req x1413227639926489
line:9296 00000100:00080000:0.0:1347761823.542777:0:2427:0:(service.c:1000:ptlrpc_update_export_timer()) updating export 3c3577cd-0751-66ec-7591-46dd3a204f77 at 1347761823 exp ffff88007c77a800
line:9297 00000100:00100000:0.0:1347761823.542788:0:2427:0:(service.c:1961:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io00_009:3c3577cd-0751-66ec-7591-46dd3a204f77+996:3262:x1413227639926489:12345-10.10.4.172@tcp:6
...
line:10263 00000001:02000400:0.0:1347761856.433208:0:4282:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark == replay-ost-single test 6: Fail OST before obd_destroy == 19:17:36 (1347761856)

... pid 2427 leaves no footprint in OSS log thereafter, even after test 6 and test 7 fails, OSS log does not show that it processes the OST_DESTROY request

Comment by Zhenyu Xu [ 19/Sep/12 ]

Status update:

Booked one toro node (only one is available), ran "bash /usr/lib64/lustre/tests/auster -rsv replay-ost-single", and it finished w/o error.

Comment by Jian Yu [ 20/Sep/12 ]

Hi Bobi,
If you wanna reproduce or debug this issue, you can upload a patch to Gerrit with the following test parameters:

Test-Parameters: fortestonly envdefinitions=SLOW=yes \
clientcount=4 osscount=2 mdscount=2 austeroptions=-R \
failover=true useiscsi=true testlist=replay-ost-single
Comment by Zhenyu Xu [ 21/Sep/12 ]

http://review.whamcloud.com/4067 just for issue reproduction.

Comment by Jian Yu [ 23/Sep/12 ]

Here is a link to the historical Maloo reports for replay-ost-single test 6 in failover test group:
http://tinyurl.com/8s62nfj

As we can see, the last successful run for this test is on 2012-09-07, and the test has been occurring since 2012-09-13.

Comment by Zhenyu Xu [ 24/Sep/12 ]

I checked the patch landed between 09/07 and 09/13, I think nothing about OSS change during that period could cause this issue, I tend to think it's test issue.

Comment by Jian Yu [ 25/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/23
FAILURE_MODE=HARD

autotest run failed: https://maloo.whamcloud.com/test_sets/b5c9fc00-0694-11e2-9b17-52540035b04c
manual run passed: https://maloo.whamcloud.com/test_sets/52192992-0716-11e2-ac99-52540035b04c

Comment by Peter Jones [ 25/Sep/12 ]

I think that there is enough evidence to suggest that this is purely a testing issue. We should still continue to work on resolving this and include a fix for an RC2 if one is ready and we need one, but it would not warrant holding the release on its own merits

Comment by Jian Yu [ 13/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64
Test Group: failover

The same issue occurred: https://maloo.whamcloud.com/test_sets/c1790860-1509-11e2-9adb-52540035b04c

Generated at Sat Feb 10 01:21:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.