Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2779

LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))

Details

    • 3
    • 6732

    Description

      We hit this on Sequoia during shutdown time, I don't see an existing bug open for this crash either:

      2013-02-01 14:35:53.812430 {R4-llnl} [bgqadmin]{5}.3.1: 
      2013-02-01 14:35:53.812843 {R4-llnl} [bgqadmin]{5}.3.1: Broadcast message from root@seqio262-ib0
      2013-02-01 14:35:53.813165 {R4-llnl} [bgqadmin]{5}.3.1: 	(unknown) at 14:35 ...
      2013-02-01 14:35:53.813752 {R4-llnl} [bgqadmin]{5}.3.1: The system is going down for halt NOW!
      2013-02-01 14:35:53.814093 {R4-llnl} [bgqadmin]{5}.2.3: Stopping Common I/O Services: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))) ) failed: 
      2013-02-01 14:35:53.814429 {R4-llnl} [bgqadmin]{5}.2.3: LustreError: 4653:0:(cl_lock.c:1967:discard_cb()) LBUG
      2013-02-01 14:35:53.814746 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
      2013-02-01 14:35:53.815076 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b40] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
      2013-02-01 14:35:53.815397 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392bf0] [8000000000ab0c88] .libcfs_debug_dumpstack+0xd8/0x150 [libcfs]
      2013-02-01 14:35:53.815717 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab1450] .lbug_with_loc+0x50/0xc0 [libcfs]
      2013-02-01 14:35:53.816042 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
      2013-02-01 14:35:53.816392 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
      2013-02-01 14:35:53.816732 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
      2013-02-01 14:35:53.817047 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
      2013-02-01 14:35:53.817363 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
      2013-02-01 14:35:53.817877 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
      2013-02-01 14:35:53.818248 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
      2013-02-01 14:35:53.818565 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
      2013-02-01 14:35:53.818901 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
      2013-02-01 14:35:53.819221 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
      2013-02-01 14:35:53.819535 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
      2013-02-01 14:35:53.819908 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
      2013-02-01 14:35:53.820387 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
      2013-02-01 14:35:53.820707 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
      2013-02-01 14:35:53.821021 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
      2013-02-01 14:35:53.821337 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
      2013-02-01 14:35:53.821652 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
      2013-02-01 14:35:53.821966 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
      2013-02-01 14:35:53.822282 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
      2013-02-01 14:35:53.822595 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
      2013-02-01 14:35:53.822909 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
      2013-02-01 14:35:53.823222 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
      2013-02-01 14:35:53.823534 {R4-llnl} [bgqadmin]{5}.2.3: Kernel panic - not syncing: LBUG
      2013-02-01 14:35:53.823844 {R4-llnl} [bgqadmin]{5}.2.3: Call Trace:
      2013-02-01 14:35:53.824153 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392b60] [c000000000008d1c] .show_stack+0x7c/0x184 (unreliable)
      2013-02-01 14:35:53.824466 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392c10] [c000000000431ef4] .panic+0x80/0x1ac
      2013-02-01 14:35:53.824776 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ca0] [8000000000ab14b0] .lbug_with_loc+0xb0/0xc0 [libcfs]
      2013-02-01 14:35:53.825089 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392d30] [80000000024f15f8] .discard_cb+0x238/0x240 [obdclass]
      2013-02-01 14:35:53.825401 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392dd0] [80000000024ecadc] .cl_page_gang_lookup+0x26c/0x600 [obdclass]
      2013-02-01 14:35:53.825721 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392ef0] [80000000024f11f8] .cl_lock_discard_pages+0x188/0x2c0 [obdclass]
      2013-02-01 14:35:53.826045 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e392fa0] [80000000046aa390] .osc_lock_flush+0x290/0x4a0 [osc]
      2013-02-01 14:35:53.826358 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393090] [80000000046aa6dc] .osc_lock_cancel+0x13c/0x2c0 [osc]
      2013-02-01 14:35:53.826670 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393160] [80000000024eda90] .cl_lock_cancel0+0xd0/0x2b0 [obdclass]
      2013-02-01 14:35:53.826982 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393220] [80000000024f09f8] .cl_lock_hold_release+0x258/0x450 [obdclass]
      2013-02-01 14:35:53.827295 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3932f0] [80000000024f36fc] .cl_lock_unhold+0x8c/0x270 [obdclass]
      2013-02-01 14:35:53.827608 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3933c0] [800000000513e5b4] .lov_sublock_release+0x244/0x370 [lov]
      2013-02-01 14:35:53.827920 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393480] [8000000005141f68] .lov_lock_enqueue+0x388/0xb20 [lov]
      2013-02-01 14:35:53.828232 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3935c0] [80000000024f4d88] .cl_enqueue_try+0x1d8/0x510 [obdclass]
      2013-02-01 14:35:53.828649 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e3936d0] [80000000024f6d88] .cl_enqueue_locked+0xa8/0x2c0 [obdclass]
      2013-02-01 14:35:53.829092 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393780] [80000000024f72b0] .cl_lock_request+0xe0/0x370 [obdclass]
      2013-02-01 14:35:53.829361 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393850] [800000000695efb4] .cl_glimpse_lock+0x2b4/0x640 [lustre]
      2013-02-01 14:35:53.829629 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393940] [800000000695f538] .cl_glimpse_size0+0x1f8/0x270 [lustre]
      2013-02-01 14:35:53.829892 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393a10] [80000000068f1510] .ll_inode_revalidate_it+0x220/0x2c0 [lustre]
      2013-02-01 14:35:53.830155 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393ad0] [80000000068f15f0] .ll_getattr_it+0x40/0x180 [lustre]
      2013-02-01 14:35:53.830420 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393b70] [80000000068f1774] .ll_getattr+0x44/0x60 [lustre]
      2013-02-01 14:35:53.830686 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c20] [c0000000000d57d8] .vfs_getattr+0x38/0x60
      2013-02-01 14:35:53.830952 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393c90] [c0000000000d5e4c] .vfs_fstatat+0x78/0xa8
      2013-02-01 14:35:53.831217 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58
      2013-02-01 14:35:53.831483 {R4-llnl} [bgqadmin]{5}.2.3: [c00000038e393e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      Attachments

        Issue Links

          Activity

            [LU-2779] LBUG in discard_cb: !(page->cp_type == CPT_CACHEABLE) || (!PageWriteback(cl_page_vmpage(env, page)))
            jgmitter Joseph Gmitter (Inactive) made changes -
            Link New: This issue is related to LDEV-134 [ LDEV-134 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is duplicated by LU-4581 [ LU-4581 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.4.2 [ 10605 ]
            pjones Peter Jones made changes -
            Labels Original: LB mq413 sequoia topsequoia New: sequoia topsequoia

            indeed. patch set 2 allows the interrupt by setting interruption callback to LWI_ON_SIGNAL_NOOP. In that case, it makes sense to upgrade to patch set 4.

            jay Jinshan Xiong (Inactive) added a comment - indeed. patch set 2 allows the interrupt by setting interruption callback to LWI_ON_SIGNAL_NOOP. In that case, it makes sense to upgrade to patch set 4.

            But the fix looks different. Patch set 4 changes the LWI_INTR interrupt handler from LWI_ON_SIGNAL_NOOP to NULL, and does nothing else. Patch set 2 did not do that. Instead the only significant change in patch set 2 was the removal of the KLASSERT.

            I'm missing how those two things are equivalent.

            morrone Christopher Morrone (Inactive) added a comment - But the fix looks different. Patch set 4 changes the LWI_INTR interrupt handler from LWI_ON_SIGNAL_NOOP to NULL, and does nothing else. Patch set 2 did not do that. Instead the only significant change in patch set 2 was the removal of the KLASSERT. I'm missing how those two things are equivalent.

            The reason why this patch was made is that OSC has to wait for the IO RPC to finish anyway, no matter how much time it will need; otherwise it will hit the assertion in discard_cb(). If the OST is in recovery, it may take really long time for that OST to finish the RPC.

            jay Jinshan Xiong (Inactive) added a comment - The reason why this patch was made is that OSC has to wait for the IO RPC to finish anyway, no matter how much time it will need; otherwise it will hit the assertion in discard_cb(). If the OST is in recovery, it may take really long time for that OST to finish the RPC.
            jay Jinshan Xiong (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Patch set 2 has more debug information. The real fix is the same.

            jay Jinshan Xiong (Inactive) added a comment - Patch set 2 has more debug information. The real fix is the same.

            Patch set 2.

            morrone Christopher Morrone (Inactive) added a comment - Patch set 2.

            People

              jay Jinshan Xiong (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: