[LU-1640] Test failure on test suite lustre-rsync-test, subtest test_2c Created: 17/Jul/12  Updated: 13/Dec/12  Resolved: 13/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-2492 MDT thread stuck: mdd_object_find -> ... Resolved
is duplicated by LU-1483 Test failure on test suite lustre-rsy... Resolved
Severity: 3
Rank (Obsolete): 6363

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/dcdbf220-cd1f-11e1-957a-52540035b04c.

The sub-test test_2c failed with the following error:

test failed to respond and timed out

It seems the MDS is stuck for some reason

mdt00_000     D 0000000000000001     0 12165      2 0x00000080
 ffff8800673a5aa0 0000000000000046 ffff8800779fdb40 ffff8800673a5b20
 ffff8800673a5a50 ffffc900018c502c 0000000000000246 0000000000000246
 ffff8800355fc6b8 ffff8800673a5fd8 000000000000f4e8 ffff8800355fc6b8
Call Trace:
 [<ffffffffa053b5d4>] ? htable_lookup+0x1a4/0x1c0 [obdclass]
 [<ffffffffa0ced77e>] cfs_waitq_wait+0xe/0x10 [libcfs]
 [<ffffffffa053b6a0>] lu_object_find_at+0xb0/0x450 [obdclass]
 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
 [<ffffffffa053ba7f>] lu_object_find_slice+0x1f/0x80 [obdclass]
 [<ffffffffa095c160>] mdd_object_find+0x10/0x70 [mdd]
 [<ffffffffa096395f>] mdd_path+0x35f/0x1060 [mdd]
 [<ffffffffa053b67c>] ? lu_object_find_at+0x8c/0x450 [obdclass]
 [<ffffffffa0963600>] ? mdd_path+0x0/0x1060 [mdd]
 [<ffffffffa0af47da>] cml_path+0x6a/0x180 [cmm]
 [<ffffffffa09c9db6>] ? mdt_object_find+0x66/0x170 [mdt]
 [<ffffffffa09ce3ff>] mdt_get_info+0x64f/0xa90 [mdt]
 [<ffffffffa09c9f0d>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt]
 [<ffffffffa09d2922>] mdt_handle_common+0x922/0x1740 [mdt]
 [<ffffffffa09d3815>] mdt_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa066757d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
 [<ffffffffa0ced65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa065ea37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
 [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
 [<ffffffffa0668b79>] ptlrpc_main+0xb69/0x1870 [ptlrpc]
 [<ffffffffa0668010>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa0668010>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa0668010>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Peter Jones [ 20/Jul/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 21/Jul/12 ]

patch tracking at http://review.whamcloud.com/3439

obdclass: htable_lookup could miss a waking up signal

In lu_object_free(), a wakeing up signal is issued to hash bucket
waiting queue telling blocking thread that a dying object is freed,
but it does not take the bucket lock, without it, there is a chance
that a thread calling htable_lookup() could be add to the bucket
waiting queue missing this waking up signal and waiting forever.

Comment by Peter Jones [ 08/Aug/12 ]

As per Bobijam this issue rarely occurs (not seen in the last three tags) and so decreasing in priority to focus on more frequently hit issues

Comment by Jian Yu [ 10/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64

The same issue occurred again: https://maloo.whamcloud.com/test_sets/664cd250-12ac-11e2-bd97-52540035b04c

Comment by Zhenyu Xu [ 13/Dec/12 ]

discussion move to LU-2492

Generated at Sat Feb 10 01:18:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.