[LU-9045] conf-sanity test_32c: test failed to respond and timed out Created: 25/Jan/17 Updated: 08/Feb/17 Resolved: 30/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/881f72c6-e2ae-11e6-bf0a-5254006e85c2. The sub-test test_32c failed with the following error: test failed to respond and timed out Panic seen on MDS1: 0:44:09:[15158.281966] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) ASSERTION( dcb->dcb_magic == TRANS_COMMIT_CB_MAGIC ) failed: commit callback entry: magic=0 name='tgt_cb_last_committed' 00:44:09:[15158.285311] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) LBUG 00:44:09:[15158.286863] Pid: 27900, comm: jbd2/loop1-8 00:44:09:[15158.288497] 00:44:09:[15158.288497] Call Trace: 00:44:09:[15158.290800] [<ffffffffa06727f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 00:44:09:[15158.292320] [<ffffffffa0672861>] lbug_with_loc+0x41/0xb0 [libcfs] 00:44:09:[15158.293952] [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs] 00:44:09:[15158.295497] [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs] 00:44:09:[15158.297241] [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2] 00:44:09:[15158.298788] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0 00:44:09:[15158.300346] [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2] 00:44:09:[15158.301683] [<ffffffff810b1720>] ? autoremove_wake_function+0x0/0x40 00:44:09:[15158.303246] [<ffffffffa0186dd0>] ? kjournald2+0x0/0x260 [jbd2] 00:44:09:[15158.304571] [<ffffffff810b064f>] kthread+0xcf/0xe0 00:44:09:[15158.305997] [<ffffffff810b0580>] ? kthread+0x0/0xe0 00:44:09:[15158.307375] [<ffffffff81696958>] ret_from_fork+0x58/0x90 00:44:09:[15158.308660] [<ffffffff810b0580>] ? kthread+0x0/0xe0 00:44:09:[15158.310084] 00:44:09:[15158.311178] Kernel panic - not syncing: LBUG 00:44:09:[15158.312166] CPU: 0 PID: 27900 Comm: jbd2/loop1-8 Tainted: G OE ------------ 3.10.0-514.6.1.el7_lustre.x86_64 #1 00:44:09:[15158.312166] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 00:44:09:[15158.312166] ffffffffa068fccc 00000000bedb0a5e ffff880039d43b80 ffffffff816863f8 00:44:09:[15158.312166] ffff880039d43c00 ffffffff8167f823 ffffffff00000008 ffff880039d43c10 00:44:09:[15158.312166] ffff880039d43bb0 00000000bedb0a5e 00000000bedb0a5e ffff88007fc0f838 00:44:09:[15158.312166] Call Trace: 00:44:09:[15158.312166] [<ffffffff816863f8>] dump_stack+0x19/0x1b 00:44:09:[15158.312166] [<ffffffff8167f823>] panic+0xe3/0x1f2 00:44:09:[15158.312166] [<ffffffffa0672879>] lbug_with_loc+0x59/0xb0 [libcfs] 00:44:09:[15158.312166] [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs] 00:44:09:[15158.312166] [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs] 00:44:09:[15158.312166] [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2] 00:44:09:[15158.312166] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0 00:44:09:[15158.312166] [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2] 00:44:09:[15158.312166] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 00:44:09:[15158.312166] [<ffffffffa0186dd0>] ? commit_timeout+0x10/0x10 [jbd2] 00:44:09:[15158.312166] [<ffffffff810b064f>] kthread+0xcf/0xe0 00:44:09:[15158.312166] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 00:44:09:[15158.312166] [<ffffffff81696958>] ret_from_fork+0x58/0x90 00:44:09:[15158.312166] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 Info required for matching: conf-sanity 32c |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 25/Jan/17 ] |
|
The LU tickets that landed recently that could be a cause here are Bob, would you be able to cook a revert patch of these landings from recent landings and test again to see if the various test_32c failures go away before we do a full on git bisect to find the cause? |
| Comment by Jian Yu [ 26/Jan/17 ] |
|
I just submitted 6 patches to revert those commits and would vet the test results. The tip of the patch series is https://review.whamcloud.com/25111. |
| Comment by Bob Glossman (Inactive) [ 26/Jan/17 ] |
|
another on master: |
| Comment by Jian Yu [ 27/Jan/17 ] |
|
Test results showed that the following commit is the root cause: commit 555d02f47401340182b47b3245a657b52fc3e68a
Author: Fan Yong <fan.yong@intel.com>
Date: Thu Sep 22 16:54:55 2016 +0800
LU-8840 osp: handle EA cache properly
|
| Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ] |
|
Thank you yujian for the quick root cause identification. |
| Comment by Joseph Gmitter (Inactive) [ 27/Jan/17 ] |
|
The revert patch of the above commit is at https://review.whamcloud.com/#/c/25134/ |
| Comment by Gerrit Updater [ 27/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25134/ |
| Comment by Joseph Gmitter (Inactive) [ 30/Jan/17 ] |
|
The revert of the |