[LU-9045] conf-sanity test_32c: test failed to respond and timed out Created: 25/Jan/17  Updated: 08/Feb/17  Resolved: 30/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9048 conf-sanity test_32c: test failed to ... Resolved
is related to LU-8840 sanity-lfsck test_2e: @@@@@@ FAIL: (5... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/881f72c6-e2ae-11e6-bf0a-5254006e85c2.

The sub-test test_32c failed with the following error:

test failed to respond and timed out

Panic seen on MDS1:

0:44:09:[15158.281966] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) ASSERTION( dcb->dcb_magic == TRANS_COMMIT_CB_MAGIC ) failed: commit callback entry: magic=0 name='tgt_cb_last_committed'
00:44:09:[15158.285311] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) LBUG
00:44:09:[15158.286863] Pid: 27900, comm: jbd2/loop1-8
00:44:09:[15158.288497] 
00:44:09:[15158.288497] Call Trace:
00:44:09:[15158.290800]  [<ffffffffa06727f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
00:44:09:[15158.292320]  [<ffffffffa0672861>] lbug_with_loc+0x41/0xb0 [libcfs]
00:44:09:[15158.293952]  [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs]
00:44:09:[15158.295497]  [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
00:44:09:[15158.297241]  [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2]
00:44:09:[15158.298788]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
00:44:09:[15158.300346]  [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2]
00:44:09:[15158.301683]  [<ffffffff810b1720>] ? autoremove_wake_function+0x0/0x40
00:44:09:[15158.303246]  [<ffffffffa0186dd0>] ? kjournald2+0x0/0x260 [jbd2]
00:44:09:[15158.304571]  [<ffffffff810b064f>] kthread+0xcf/0xe0
00:44:09:[15158.305997]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
00:44:09:[15158.307375]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
00:44:09:[15158.308660]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
00:44:09:[15158.310084] 
00:44:09:[15158.311178] Kernel panic - not syncing: LBUG
00:44:09:[15158.312166] CPU: 0 PID: 27900 Comm: jbd2/loop1-8 Tainted: G           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
00:44:09:[15158.312166] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
00:44:09:[15158.312166]  ffffffffa068fccc 00000000bedb0a5e ffff880039d43b80 ffffffff816863f8
00:44:09:[15158.312166]  ffff880039d43c00 ffffffff8167f823 ffffffff00000008 ffff880039d43c10
00:44:09:[15158.312166]  ffff880039d43bb0 00000000bedb0a5e 00000000bedb0a5e ffff88007fc0f838
00:44:09:[15158.312166] Call Trace:
00:44:09:[15158.312166]  [<ffffffff816863f8>] dump_stack+0x19/0x1b
00:44:09:[15158.312166]  [<ffffffff8167f823>] panic+0xe3/0x1f2
00:44:09:[15158.312166]  [<ffffffffa0672879>] lbug_with_loc+0x59/0xb0 [libcfs]
00:44:09:[15158.312166]  [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs]
00:44:09:[15158.312166]  [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
00:44:09:[15158.312166]  [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2]
00:44:09:[15158.312166]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
00:44:09:[15158.312166]  [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2]
00:44:09:[15158.312166]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
00:44:09:[15158.312166]  [<ffffffffa0186dd0>] ? commit_timeout+0x10/0x10 [jbd2]
00:44:09:[15158.312166]  [<ffffffff810b064f>] kthread+0xcf/0xe0
00:44:09:[15158.312166]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
00:44:09:[15158.312166]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
00:44:09:[15158.312166]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140

Info required for matching: conf-sanity 32c



 Comments   
Comment by Joseph Gmitter (Inactive) [ 25/Jan/17 ]

The LU tickets that landed recently that could be a cause here are LU-8821, LU-8562, LU-8840, and LU-8753.

Bob, would you be able to cook a revert patch of these landings from recent landings and test again to see if the various test_32c failures go away before we do a full on git bisect to find the cause?

Comment by Jian Yu [ 26/Jan/17 ]

I just submitted 6 patches to revert those commits and would vet the test results. The tip of the patch series is https://review.whamcloud.com/25111.

Comment by Bob Glossman (Inactive) [ 26/Jan/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/4c758afe-e3c5-11e6-9069-5254006e85c2

Comment by Jian Yu [ 27/Jan/17 ]

Test results showed that the following commit is the root cause:

commit 555d02f47401340182b47b3245a657b52fc3e68a
Author: Fan Yong <fan.yong@intel.com>
Date:   Thu Sep 22 16:54:55 2016 +0800

    LU-8840 osp: handle EA cache properly
Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ]

Thank you yujian for the quick root cause identification.

Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ]

The revert patch of the above commit is at https://review.whamcloud.com/#/c/25134/

Comment by Gerrit Updater [ 27/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25134/
Subject: LU-9045 osp: Revert "LU-8840 osp: handle EA cache properly"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: db1ef0a322f41314abd37b5ec4ad153d63c9b405

Comment by Joseph Gmitter (Inactive) [ 30/Jan/17 ]

The revert of the LU-8840 patch has resolved these failures on master.

Generated at Sat Feb 10 02:22:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.