Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.4
    • None
    • 2.5.4-2.6.32_504.30.3.el6.atlas.x86_64.x86_64
    • 3
    • 9223372036854775807

    Description

      Wednesday morning one of our production MDS nodes hit an assertion:

      {{
      2015-11-18 10:58:35 [1912759.384335] LustreError: 14428:0:(osp_sync.c:352:osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed:
      2015-11-18 10:58:35 [1912759.396346] LustreError: 14428:0:(osp_sync.c:352:osp_sync_interpret()) LBUG
      2015-11-18 10:58:35 [1912759.404445] Pid: 14428, comm: ptlrpcd_2
      2015-11-18 10:58:35 [1912759.409032]
      2015-11-18 10:58:35 [1912759.409033] Call Trace:
      2015-11-18 10:58:35 [1912759.414039] [<ffffffffa0430895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2015-11-18 10:58:35 [1912759.422141] [<ffffffffa0430e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      2015-11-18 10:58:35 [1912759.429363] [<ffffffffa0f6e3db>] osp_sync_interpret+0x50b/0x510 [osp]
      2015-11-18 10:58:35 [1912759.437003] [<ffffffffa075aacd>] ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
      2015-11-18 10:58:35 [1912759.444806] [<ffffffff8108802b>] ? try_to_del_timer_sync+0x7b/0xe0
      2015-11-18 10:58:35 [1912759.452147] [<ffffffffa0788b13>] ptlrpcd_check+0x3d3/0x610 [ptlrpc]
      2015-11-18 10:58:35 [1912759.459582] [<ffffffffa078924b>] ptlrpcd+0x20b/0x370 [ptlrpc]
      2015-11-18 10:58:35 [1912759.466413] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
      2015-11-18 10:58:35 [1912759.473654] [<ffffffffa0789040>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      2015-11-18 10:58:35 [1912759.480485] [<ffffffff8109e78e>] kthread+0x9e/0xc0
      2015-11-18 10:58:35 [1912759.486243] [<ffffffff8100c28a>] child_rip+0xa/0x20
      2015-11-18 10:58:35 [1912759.492100] [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      2015-11-18 10:58:35 [1912759.497954] [<ffffffff8100c280>] ? child_rip+0x0/0x20
      2015-11-18 10:58:35 [1912759.503992]
      2015-11-18 10:58:35 [1912759.506435] Kernel panic - not syncing: LBUG
      2015-11-18 10:58:35 [1912759.511515] Pid: 14428, comm: ptlrpcd_2 Not tainted 2.6.32-504.30.3.el6.atlas.x86_64 #1
      2015-11-18 10:58:35 [1912759.520881] Call Trace:
      2015-11-18 10:58:35 [1912759.523916] [<ffffffff81529cbc>] ? panic+0xa7/0x16f
      2015-11-18 10:58:35 [1912759.529778] [<ffffffffa0430eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2015-11-18 10:58:35 [1912759.537188] [<ffffffffa0f6e3db>] ? osp_sync_interpret+0x50b/0x510 [osp]
      2015-11-18 10:58:35 [1912759.545012] [<ffffffffa075aacd>] ? ptlrpc_check_set+0x31d/0x1c20 [ptlrpc]
      2015-11-18 10:58:35 [1912759.553002] [<ffffffff8108802b>] ? try_to_del_timer_sync+0x7b/0xe0
      2015-11-18 10:58:35 [1912759.560340] [<ffffffffa0788b13>] ? ptlrpcd_check+0x3d3/0x610 [ptlrpc]
      2015-11-18 10:58:35 [1912759.567961] [<ffffffffa078924b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
      2015-11-18 10:58:35 [1912759.574979] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
      2015-11-18 10:58:35 [1912759.582209] [<ffffffffa0789040>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      2015-11-18 10:58:35 [1912759.589034] [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
      2015-11-18 10:58:35 [1912759.594981] [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      2015-11-18 10:58:35 [1912759.601024] [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      2015-11-18 10:58:35 [1912759.606874] [<ffffffff8100c280>] ? child_rip+0x0/0x20
      }}

      Is this related to https://jira.hpdd.intel.com/browse/LU-5629 ?

      Attachments

        Issue Links

          Activity

            [LU-7453] osp_sync_interpret assertion
            pjones Peter Jones added a comment -

            As per ORNL ok to close - this just happened one time a long time ago

            pjones Peter Jones added a comment - As per ORNL ok to close - this just happened one time a long time ago

            any additional patches on top of that?

            bzzz Alex Zhuravlev added a comment - any additional patches on top of that?
            yujian Jian Yu added a comment -

            Thank you, Alex.

            The server version is Lustre 2.5.4 (2.5.4-2.6.32_504.30.3.el6.atlas.x86_64.x86_64).

            yujian Jian Yu added a comment - Thank you, Alex. The server version is Lustre 2.5.4 (2.5.4-2.6.32_504.30.3.el6.atlas.x86_64.x86_64).

            yes, we can add a bit more debug.. which target branch is supposed?

            bzzz Alex Zhuravlev added a comment - yes, we can add a bit more debug.. which target branch is supposed?
            bzzz Alex Zhuravlev added a comment - - edited

            it was a resend for OST_DESTROY, though I have no a good idea for the root cause yet..

            bzzz Alex Zhuravlev added a comment - - edited it was a resend for OST_DESTROY, though I have no a good idea for the root cause yet..
            yujian Jian Yu added a comment -

            Hi Alex,

            Jesse has uploaded the logs. Could you please investigate and suggest? Thank you.

            yujian Jian Yu added a comment - Hi Alex, Jesse has uploaded the logs. Could you please investigate and suggest? Thank you.

            Jesse -

            Just FYI, those ptlrpc functions aren't related to the higher level operations you're describing. They both act on sets of RPCs, so they don't have anything to do with SETATTR or DESTROY.

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - Jesse - Just FYI, those ptlrpc functions aren't related to the higher level operations you're describing. They both act on sets of RPCs, so they don't have anything to do with SETATTR or DESTROY. Patrick
            hanleyja Jesse Hanley added a comment -

            I've uploaded the log. Sorry for the delay.

            I've never seen this extension for crash before. Thanks for the info!


            Jesse

            hanleyja Jesse Hanley added a comment - I've uploaded the log. Sorry for the delay. I've never seen this extension for crash before. Thanks for the info! – Jesse
            bzzz Alex Zhuravlev added a comment - - edited

            there is a ticket with a binary and instructions: https://bugzilla.lustre.org/show_bug.cgi?id=13155
            if you've got time - please, try to extract the log from the dump. thanks in advance!

            bzzz Alex Zhuravlev added a comment - - edited there is a ticket with a binary and instructions: https://bugzilla.lustre.org/show_bug.cgi?id=13155 if you've got time - please, try to extract the log from the dump. thanks in advance!
            yujian Jian Yu added a comment -

            Hi Alex,
            Could you please take a look at Jesse's comments and advise? Thank you.

            yujian Jian Yu added a comment - Hi Alex, Could you please take a look at Jesse's comments and advise? Thank you.

            People

              bzzz Alex Zhuravlev
              hanleyja Jesse Hanley
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: