Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9045

conf-sanity test_32c: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/881f72c6-e2ae-11e6-bf0a-5254006e85c2.

      The sub-test test_32c failed with the following error:

      test failed to respond and timed out
      

      Panic seen on MDS1:

      0:44:09:[15158.281966] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) ASSERTION( dcb->dcb_magic == TRANS_COMMIT_CB_MAGIC ) failed: commit callback entry: magic=0 name='tgt_cb_last_committed'
      00:44:09:[15158.285311] LustreError: 27900:0:(osd_handler.c:1562:osd_trans_commit_cb()) LBUG
      00:44:09:[15158.286863] Pid: 27900, comm: jbd2/loop1-8
      00:44:09:[15158.288497] 
      00:44:09:[15158.288497] Call Trace:
      00:44:09:[15158.290800]  [<ffffffffa06727f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      00:44:09:[15158.292320]  [<ffffffffa0672861>] lbug_with_loc+0x41/0xb0 [libcfs]
      00:44:09:[15158.293952]  [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs]
      00:44:09:[15158.295497]  [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
      00:44:09:[15158.297241]  [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2]
      00:44:09:[15158.298788]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
      00:44:09:[15158.300346]  [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2]
      00:44:09:[15158.301683]  [<ffffffff810b1720>] ? autoremove_wake_function+0x0/0x40
      00:44:09:[15158.303246]  [<ffffffffa0186dd0>] ? kjournald2+0x0/0x260 [jbd2]
      00:44:09:[15158.304571]  [<ffffffff810b064f>] kthread+0xcf/0xe0
      00:44:09:[15158.305997]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
      00:44:09:[15158.307375]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
      00:44:09:[15158.308660]  [<ffffffff810b0580>] ? kthread+0x0/0xe0
      00:44:09:[15158.310084] 
      00:44:09:[15158.311178] Kernel panic - not syncing: LBUG
      00:44:09:[15158.312166] CPU: 0 PID: 27900 Comm: jbd2/loop1-8 Tainted: G           OE  ------------   3.10.0-514.6.1.el7_lustre.x86_64 #1
      00:44:09:[15158.312166] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      00:44:09:[15158.312166]  ffffffffa068fccc 00000000bedb0a5e ffff880039d43b80 ffffffff816863f8
      00:44:09:[15158.312166]  ffff880039d43c00 ffffffff8167f823 ffffffff00000008 ffff880039d43c10
      00:44:09:[15158.312166]  ffff880039d43bb0 00000000bedb0a5e 00000000bedb0a5e ffff88007fc0f838
      00:44:09:[15158.312166] Call Trace:
      00:44:09:[15158.312166]  [<ffffffff816863f8>] dump_stack+0x19/0x1b
      00:44:09:[15158.312166]  [<ffffffff8167f823>] panic+0xe3/0x1f2
      00:44:09:[15158.312166]  [<ffffffffa0672879>] lbug_with_loc+0x59/0xb0 [libcfs]
      00:44:09:[15158.312166]  [<ffffffffa0f67588>] osd_trans_commit_cb+0x308/0x380 [osd_ldiskfs]
      00:44:09:[15158.312166]  [<ffffffffa0efb554>] ldiskfs_journal_commit_callback+0x84/0xc0 [ldiskfs]
      00:44:09:[15158.312166]  [<ffffffffa018260b>] jbd2_journal_commit_transaction+0x161b/0x19a0 [jbd2]
      00:44:09:[15158.312166]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
      00:44:09:[15158.312166]  [<ffffffffa0186e99>] kjournald2+0xc9/0x260 [jbd2]
      00:44:09:[15158.312166]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
      00:44:09:[15158.312166]  [<ffffffffa0186dd0>] ? commit_timeout+0x10/0x10 [jbd2]
      00:44:09:[15158.312166]  [<ffffffff810b064f>] kthread+0xcf/0xe0
      00:44:09:[15158.312166]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
      00:44:09:[15158.312166]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
      00:44:09:[15158.312166]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
      

      Info required for matching: conf-sanity 32c

      Attachments

        Issue Links

          Activity

            [LU-9045] conf-sanity test_32c: test failed to respond and timed out

            The revert of the LU-8840 patch has resolved these failures on master.

            jgmitter Joseph Gmitter (Inactive) added a comment - The revert of the LU-8840 patch has resolved these failures on master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25134/
            Subject: LU-9045 osp: Revert "LU-8840 osp: handle EA cache properly"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: db1ef0a322f41314abd37b5ec4ad153d63c9b405

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25134/ Subject: LU-9045 osp: Revert " LU-8840 osp: handle EA cache properly" Project: fs/lustre-release Branch: master Current Patch Set: Commit: db1ef0a322f41314abd37b5ec4ad153d63c9b405

            The revert patch of the above commit is at https://review.whamcloud.com/#/c/25134/

            jgmitter Joseph Gmitter (Inactive) added a comment - The revert patch of the above commit is at https://review.whamcloud.com/#/c/25134/
            jgmitter Joseph Gmitter (Inactive) added a comment - - edited

            Thank you yujian for the quick root cause identification.

            jgmitter Joseph Gmitter (Inactive) added a comment - - edited Thank you yujian for the quick root cause identification.
            yujian Jian Yu added a comment -

            Test results showed that the following commit is the root cause:

            commit 555d02f47401340182b47b3245a657b52fc3e68a
            Author: Fan Yong <fan.yong@intel.com>
            Date:   Thu Sep 22 16:54:55 2016 +0800
            
                LU-8840 osp: handle EA cache properly
            
            yujian Jian Yu added a comment - Test results showed that the following commit is the root cause: commit 555d02f47401340182b47b3245a657b52fc3e68a Author: Fan Yong <fan.yong@intel.com> Date: Thu Sep 22 16:54:55 2016 +0800 LU-8840 osp: handle EA cache properly
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/4c758afe-e3c5-11e6-9069-5254006e85c2
            yujian Jian Yu added a comment -

            I just submitted 6 patches to revert those commits and would vet the test results. The tip of the patch series is https://review.whamcloud.com/25111.

            yujian Jian Yu added a comment - I just submitted 6 patches to revert those commits and would vet the test results. The tip of the patch series is https://review.whamcloud.com/25111 .
            jgmitter Joseph Gmitter (Inactive) added a comment - - edited

            The LU tickets that landed recently that could be a cause here are LU-8821, LU-8562, LU-8840, and LU-8753.

            Bob, would you be able to cook a revert patch of these landings from recent landings and test again to see if the various test_32c failures go away before we do a full on git bisect to find the cause?

            jgmitter Joseph Gmitter (Inactive) added a comment - - edited The LU tickets that landed recently that could be a cause here are LU-8821 , LU-8562 , LU-8840 , and LU-8753 . Bob, would you be able to cook a revert patch of these landings from recent landings and test again to see if the various test_32c failures go away before we do a full on git bisect to find the cause?

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: