[LU-6272] sanity-lfsck test_17: MDS deadlock Created: 23/Feb/15  Updated: 26/Feb/15  Resolved: 24/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-6230 open handle leak Resolved
duplicates LU-6301 open handle leak Resolved
Related
is related to LU-5791 LFSCK 5: use bottom object for consis... Resolved
Severity: 3
Rank (Obsolete): 17586

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2e423416-ba54-11e4-a7c7-5254006e85c2.

The sub-test test_17 failed with the following error:

test failed to respond and timed out

It looks like there a MDS deadlock

23:07:57:INFO: task mdt00_000:6380 blocked for more than 120 seconds.
23:07:57:      Not tainted 2.6.32-504.8.1.el6_lustre.g0ef66b1.x86_64 #1
23:07:57:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
23:07:57:mdt00_000     D 0000000000000001     0  6380      2 0x00000080
23:07:57: ffff88006ca2b940 0000000000000046 0000000000000000 0000000000000000
23:07:57: ffff88007af106c0 ffff880079699300 ffff88007b104000 ffff880079699300
23:07:57: ffff88006ca2b940 ffffffffa05f62af ffff880078429098 ffff88006ca2bfd8
23:07:57:Call Trace:
23:07:57: [<ffffffffa05f62af>] ? lu_object_find_try+0x9f/0x260 [obdclass]
23:07:57: [<ffffffffa05f64ad>] lu_object_find_at+0x3d/0xe0 [obdclass]
23:07:57: [<ffffffffa0fad725>] ? lod_index_lookup+0x25/0x30 [lod]
23:07:57: [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
23:07:57: [<ffffffffa05f6566>] lu_object_find+0x16/0x20 [obdclass]
23:07:57: [<ffffffffa0ebe056>] mdt_object_find+0x56/0x170 [mdt]
23:07:57: [<ffffffffa0ef5407>] mdt_reint_open+0x1527/0x2c70 [mdt]
23:07:57: [<ffffffffa04ae83c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs]
23:07:57: [<ffffffffa06130b0>] ? lu_ucred+0x20/0x30 [obdclass]
23:07:57: [<ffffffffa0edd0cd>] mdt_reint_rec+0x5d/0x200 [mdt]
23:07:57: [<ffffffffa0ec123b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
23:07:57: [<ffffffffa0ec1706>] mdt_intent_reint+0x1f6/0x430 [mdt]
23:07:57: [<ffffffffa0ebfcf4>] mdt_intent_policy+0x494/0xce0 [mdt]
23:07:57: [<ffffffffa07c24f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
23:07:57: [<ffffffffa07ee48b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
23:07:57: [<ffffffffa086e951>] tgt_enqueue+0x61/0x230 [ptlrpc]
23:07:57: [<ffffffffa086f59e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
23:07:57: [<ffffffffa081f5c1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
23:07:57: [<ffffffffa081e780>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
23:07:57: [<ffffffff8109e66e>] kthread+0x9e/0xc0
23:07:57: [<ffffffff8100c20a>] child_rip+0xa/0x20
23:07:57: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
23:07:57: [<ffffffff8100c200>] ? child_rip+0x0/0x20

Info required for matching: sanity-lfsck 17



 Comments   
Comment by Oleg Drokin [ 23/Feb/15 ]

This started to pop up on Feb 21. currently it appears that master fails 100% of time: https://testing.hpdd.intel.com/sub_tests/query?utf8=%E2%9C%93&test_set%5Btest_set_script_id%5D=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test%5Bsub_test_script_id%5D=eb9933f0-25ce-11e3-ae15-52540035b04c&sub_test%5Bstatus%5D=&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

The first ever failure is from full tst run after landing of "LU-5791 lfsck: use bottom device to locate object" http://review.whamcloud.com/13392 even though the actual testing of that patch was clean.
So I guess there's some sort of interaction with some other patch at play?

Comment by Isaac Huang (Inactive) [ 23/Feb/15 ]

Same failure but with osd-zfs:
https://testing.hpdd.intel.com/test_sets/b33a2d4a-bba3-11e4-a61a-5254006e85c2

Comment by nasf (Inactive) [ 24/Feb/15 ]

It is the duplication of LU-6230, the patch is http://review.whamcloud.com/#/c/13709/.

Comment by nasf (Inactive) [ 24/Feb/15 ]

Originally, the patch for LU-5791 depends on the patch for LU-6230, but the former patch has been landed before the later one...

Generated at Sat Feb 10 01:58:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.