Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3433

Encountered a assertion for the ols_state being set to a impossible state

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.4.1
    • Cray Lustre 2.4 clients running on SLES11 SP2.
    • 3
    • 8552

    Description

      During a test run using Lustre 2.4 one of our clients encountered this LBUG.

      [2013-05-29 18:39:47][c7-1c1s3n3]LustreError:16573:0:(osc_lock.c:1165:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
      [2013-05-29 18:39:47][c7-1c1s3n3]LustreError: 16573:0
      (osc_lock.c:1165:osc_lock_enqueue()) LBUG
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff81006451>] try_stack_unwind+0x161/0x1a0
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff81004ca9>] dump_trace+0x89/0x440
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa013a897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa013ade7>] lbug_with_loc+0x47/0xc0 [libcfs]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa0639f55>] osc_lock_enqueue+0x725/0x8b0 [osc]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa032d4eb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa06ccded>] lov_lock_enqueue+0x1fd/0x880 [lov]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa032d4eb>] cl_enqueue_try+0xfb/0x320 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa032e3bf>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa032efbe>] cl_lock_request+0x7e/0x270 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa0334274>] cl_io_lock+0x394/0x5c0 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa033453a>] cl_io_loop+0x9a/0x1a0 [obdclass]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa074b90f>] ll_file_io_generic+0x33f/0x5f0 [lustre]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa074c08b>] ll_file_aio_read+0x23b/0x290 [lustre]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffffa074d002>] ll_file_read+0x1f2/0x280 [lustre]
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff81135548>] vfs_read+0xc8/0x180
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff8113b799>] kernel_read+0x49/0x60
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff8113b885>] prepare_binprm+0xd5/0x100
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff8113c670>] do_execve_common+0x1c0/0x300
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff8113c83f>] do_execve+0x3f/0x50
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff8100ae1e>] sys_execve+0x4e/0x80
      [2013-05-29 18:39:47][c7-1c1s3n3] [<ffffffff81316a3c>] stub_execve+0x6c/0xc0
      [2013-05-29 18:39:47][c7-1c1s3n3] [<0000000020176437>] 0x20176437
      [2013-05-29 18:39:47][c7-1c1s3n3]Kernel panic - not syncing: LBUG

      Attachments

        1. test.sh
          0.2 kB
        2. write-eintr.c
          3 kB

        Issue Links

          Activity

            [LU-3433] Encountered a assertion for the ols_state being set to a impossible state
            pjones Peter Jones added a comment -

            Just timing. At the time the fix for LU-3889 was unproven so reverting to the previous known state was preferable. I think that moving forward with both fixes is a sound approach.

            pjones Peter Jones added a comment - Just timing. At the time the fix for LU-3889 was unproven so reverting to the previous known state was preferable. I think that moving forward with both fixes is a sound approach.

            The fix of LU-3889 has been landed on master & b2_5, it looks to me that we should just re-add this patch and backport the fix of LU-3889 to b2_4.

            Bob, what's your opinion? I see the revert patch was uploaded by you, why we revert this patch instead of backport the LU-3889?

            niu Niu Yawei (Inactive) added a comment - The fix of LU-3889 has been landed on master & b2_5, it looks to me that we should just re-add this patch and backport the fix of LU-3889 to b2_4. Bob, what's your opinion? I see the revert patch was uploaded by you, why we revert this patch instead of backport the LU-3889 ?

            Patrick - I understand that. We hit the bug described in this ticket, not the bug caused by the fix for this ticket. We would like to be able to cherry-pick the fix so that our client does not assert hit this bug, but if the patch needed to be reverted it is not yet safe for me to do so.

            morrone Christopher Morrone (Inactive) added a comment - Patrick - I understand that. We hit the bug described in this ticket, not the bug caused by the fix for this ticket. We would like to be able to cherry-pick the fix so that our client does not assert hit this bug, but if the patch needed to be reverted it is not yet safe for me to do so.

            Christopher - I'm not sure what exactly your 2.4.0-19 consists of, but Intel's 2.4.0 does not have this patch reverted. Only if you're pulling updated sources from the b2.4 (or whatever it's called - sorry) branch in the repo, would you have the patch reverted.

            If you've got 2.4.0 (or even 2.4.1, I believe), you wouldn't have the patch to begin with.

            paf Patrick Farrell (Inactive) added a comment - Christopher - I'm not sure what exactly your 2.4.0-19 consists of, but Intel's 2.4.0 does not have this patch reverted. Only if you're pulling updated sources from the b2.4 (or whatever it's called - sorry) branch in the repo, would you have the patch reverted. If you've got 2.4.0 (or even 2.4.1, I believe), you wouldn't have the patch to begin with.

            We saw the same assertion in production at LLNL with Lustre 2.4.0-19chaos. Unfortunately it did not dump the backtrace to the console, and the crash dump failed. But the assertion and particular unknown state (4) were the same as in this ticket, so I am assuming that it is the same bug.

            Since the fix needed to be reverted, it would appear that this issue needs to be reopened.

            morrone Christopher Morrone (Inactive) added a comment - We saw the same assertion in production at LLNL with Lustre 2.4.0-19chaos. Unfortunately it did not dump the backtrace to the console, and the crash dump failed. But the assertion and particular unknown state (4) were the same as in this ticket, so I am assuming that it is the same bug. Since the fix needed to be reverted, it would appear that this issue needs to be reopened.

            Patch is being reverted from b2_4 due to test failures: http://review.whamcloud.com/8621

            adilger Andreas Dilger added a comment - Patch is being reverted from b2_4 due to test failures: http://review.whamcloud.com/8621
            yujian Jian Yu added a comment -

            Patch landed on Lustre b2_4 branch for 2.4.2.

            yujian Jian Yu added a comment - Patch landed on Lustre b2_4 branch for 2.4.2.
            yujian Jian Yu added a comment -

            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/66/
            Distro/Arch: RHEL6.4/x86_64

            The same failure occurred while running parallel-scale test write_disjoint:
            https://maloo.whamcloud.com/test_sets/e255840a-6193-11e3-a871-52540035b04c

            Console log on client node:

            17:35:46:Lustre: DEBUG MARKER: == parallel-scale test write_disjoint: write_disjoint == 17:31:44 (1386639104)
            17:35:46:LustreError: 12360:0:(osc_lock.c:1165:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6
            17:35:46:LustreError: 12360:0:(osc_lock.c:1165:osc_lock_enqueue()) LBUG
            17:35:46:Pid: 12360, comm: write_disjoint
            17:35:47:
            17:35:47:Call Trace:
            17:35:47: [<ffffffffa19ac895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            17:35:47: [<ffffffffa19ace97>] lbug_with_loc+0x47/0xb0 [libcfs]
            17:35:47: [<ffffffffa05e4fc0>] ? osc_lock_enqueue+0x0/0x870 [osc]
            17:35:47: [<ffffffffa05e5674>] osc_lock_enqueue+0x6b4/0x870 [osc]
            17:35:47: [<ffffffffa19ad717>] ? cfs_waitq_broadcast+0x17/0x20 [libcfs]
            17:35:47: [<ffffffffa1b15f5e>] ? cl_lock_state_signal+0x7e/0x160 [obdclass]
            17:35:47: [<ffffffffa1b19d4c>] cl_enqueue_try+0xfc/0x300 [obdclass]
            17:35:47: [<ffffffffa06620da>] lov_lock_enqueue+0x22a/0x850 [lov]
            17:35:47: [<ffffffffa1b19d4c>] cl_enqueue_try+0xfc/0x300 [obdclass]
            17:35:47: [<ffffffffa1b1b13f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass]
            17:35:47: [<ffffffffa1b1bdae>] cl_lock_request+0x7e/0x270 [obdclass]
            17:35:47: [<ffffffffa1b2104c>] cl_io_lock+0x3cc/0x560 [obdclass]
            17:35:48: [<ffffffffa1b21282>] cl_io_loop+0xa2/0x1b0 [obdclass]
            17:35:48: [<ffffffffa06e09e0>] ll_file_io_generic+0x450/0x600 [lustre]
            17:35:48: [<ffffffffa06e1922>] ll_file_aio_write+0x142/0x2c0 [lustre]
            17:35:48: [<ffffffffa06e1c0c>] ll_file_write+0x16c/0x2a0 [lustre]
            17:35:48: [<ffffffff81181398>] vfs_write+0xb8/0x1a0
            17:35:48: [<ffffffff81181c91>] sys_write+0x51/0x90
            17:35:48: [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
            17:35:48: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
            17:35:48:
            17:35:48:Kernel panic - not syncing: LBUG
            

            Patch back-ported to Lustre b2_4 branch: http://review.whamcloud.com/8530

            yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/66/ Distro/Arch: RHEL6.4/x86_64 The same failure occurred while running parallel-scale test write_disjoint: https://maloo.whamcloud.com/test_sets/e255840a-6193-11e3-a871-52540035b04c Console log on client node: 17:35:46:Lustre: DEBUG MARKER: == parallel-scale test write_disjoint: write_disjoint == 17:31:44 (1386639104) 17:35:46:LustreError: 12360:0:(osc_lock.c:1165:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 17:35:46:LustreError: 12360:0:(osc_lock.c:1165:osc_lock_enqueue()) LBUG 17:35:46:Pid: 12360, comm: write_disjoint 17:35:47: 17:35:47:Call Trace: 17:35:47: [<ffffffffa19ac895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 17:35:47: [<ffffffffa19ace97>] lbug_with_loc+0x47/0xb0 [libcfs] 17:35:47: [<ffffffffa05e4fc0>] ? osc_lock_enqueue+0x0/0x870 [osc] 17:35:47: [<ffffffffa05e5674>] osc_lock_enqueue+0x6b4/0x870 [osc] 17:35:47: [<ffffffffa19ad717>] ? cfs_waitq_broadcast+0x17/0x20 [libcfs] 17:35:47: [<ffffffffa1b15f5e>] ? cl_lock_state_signal+0x7e/0x160 [obdclass] 17:35:47: [<ffffffffa1b19d4c>] cl_enqueue_try+0xfc/0x300 [obdclass] 17:35:47: [<ffffffffa06620da>] lov_lock_enqueue+0x22a/0x850 [lov] 17:35:47: [<ffffffffa1b19d4c>] cl_enqueue_try+0xfc/0x300 [obdclass] 17:35:47: [<ffffffffa1b1b13f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass] 17:35:47: [<ffffffffa1b1bdae>] cl_lock_request+0x7e/0x270 [obdclass] 17:35:47: [<ffffffffa1b2104c>] cl_io_lock+0x3cc/0x560 [obdclass] 17:35:48: [<ffffffffa1b21282>] cl_io_loop+0xa2/0x1b0 [obdclass] 17:35:48: [<ffffffffa06e09e0>] ll_file_io_generic+0x450/0x600 [lustre] 17:35:48: [<ffffffffa06e1922>] ll_file_aio_write+0x142/0x2c0 [lustre] 17:35:48: [<ffffffffa06e1c0c>] ll_file_write+0x16c/0x2a0 [lustre] 17:35:48: [<ffffffff81181398>] vfs_write+0xb8/0x1a0 17:35:48: [<ffffffff81181c91>] sys_write+0x51/0x90 17:35:48: [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290 17:35:48: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 17:35:48: 17:35:48:Kernel panic - not syncing: LBUG Patch back-ported to Lustre b2_4 branch: http://review.whamcloud.com/8530

            patch landed for 2.5

            niu Niu Yawei (Inactive) added a comment - patch landed for 2.5
            vitaly_fertman Vitaly Fertman added a comment - http://review.whamcloud.com/6709

            Niu, James,

            We have a candidate patch for this from Xyratex that we're testing on our end. It appears to resolve the issue, and once I'm sure it's solid, we'll make sure it's pushed upstream.

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - Niu, James, We have a candidate patch for this from Xyratex that we're testing on our end. It appears to resolve the issue, and once I'm sure it's solid, we'll make sure it's pushed upstream. Patrick

            People

              niu Niu Yawei (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: