Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1951

SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • SWL Hyperion/LLNL
    • 3
    • 4375

    Description

      MDS crash dumped, attempting to locate dump at this time.
      Message from MDS:

      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
      2012-09-16 11:35:57 LustreError: 5503:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG

      This looks like a possible dup of ORI-577, however that bug was supposed to have been fixed.

      MDS did not dump a stack, was configured with panic_on_lbug.
      Will attempt to replicate

      Attachments

        Activity

          [LU-1951] SWL: osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:

          Hit this again while running SWL, backtrace:

          PID: 4891   TASK: ffff88016dedaaa0  CPU: 13  COMMAND: "mdt03_014"
           #0 [ffff88016e693918] machine_kexec at ffffffff8103281b
           #1 [ffff88016e693978] crash_kexec at ffffffff810ba792
           #2 [ffff88016e693a48] panic at ffffffff814fd591
           #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs]
           #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs]
           #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd]
           #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd]
           #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm]
           #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt]
           #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt]
          #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt]
          #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt]
          #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt]
          #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt]
          #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc]
          #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc]
          #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a
          
          cliffw Cliff White (Inactive) added a comment - Hit this again while running SWL, backtrace: PID: 4891 TASK: ffff88016dedaaa0 CPU: 13 COMMAND: "mdt03_014" #0 [ffff88016e693918] machine_kexec at ffffffff8103281b #1 [ffff88016e693978] crash_kexec at ffffffff810ba792 #2 [ffff88016e693a48] panic at ffffffff814fd591 #3 [ffff88016e693ac8] lbug_with_loc at ffffffffa0395f6b [libcfs] #4 [ffff88016e693ae8] osd_object_ref_del at ffffffffa0a8b6c1 [osd_ldiskfs] #5 [ffff88016e693b18] mdo_ref_del at ffffffffa0ef0ffd [mdd] #6 [ffff88016e693b28] mdd_unlink at ffffffffa0ef6675 [mdd] #7 [ffff88016e693be8] cml_unlink at ffffffffa06bc037 [cmm] #8 [ffff88016e693c28] mdt_reint_unlink at ffffffffa0f7b454 [mdt] #9 [ffff88016e693ca8] mdt_reint_rec at ffffffffa0f78151 [mdt] #10 [ffff88016e693cc8] mdt_reint_internal at ffffffffa0f719aa [mdt] #11 [ffff88016e693d18] mdt_reint at ffffffffa0f71cf4 [mdt] #12 [ffff88016e693d38] mdt_handle_common at ffffffffa0f65802 [mdt] #13 [ffff88016e693d88] mdt_regular_handle at ffffffffa0f666f5 [mdt] #14 [ffff88016e693d98] ptlrpc_server_handle_request at ffffffffa095db3c [ptlrpc] #15 [ffff88016e693e98] ptlrpc_main at ffffffffa095f111 [ptlrpc] #16 [ffff88016e693f48] kernel_thread at ffffffff8100c14a

          Check the last comment posted by Cliff on LU-1948, we hit this again while running SWL.

          liang Liang Zhen (Inactive) added a comment - Check the last comment posted by Cliff on LU-1948 , we hit this again while running SWL.

          MDS crashed, panic stack:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 0 ) failed:
          2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:35 LBUG
          2012-09-27 21:56:35 LBUG
          2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
          2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 : LBUG
          2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20

          
          

          vmcore is on brent in ~/cliffw/lu1948/erofs

          cliffw Cliff White (Inactive) added a comment - MDS crashed, panic stack: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 2012-09-27 21:56:35 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] 2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 0 ) failed: 2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:35 2012-09-27 21:56:35 Kernel panic - not syncing: LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168 2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 : LBUG 2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20 vmcore is on brent in ~/cliffw/lu1948/erofs
          pjones Peter Jones added a comment -

          Hongchao

          If you think that it should be then please set inspectors on the patch.

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao If you think that it should be then please set inspectors on the patch. Thanks Peter
          pjones Peter Jones added a comment -

          No it is not.

          pjones Peter Jones added a comment - No it is not.

          Is the patch (http://review.whamcloud.com/#change,4041) included in this build(24)?

          hongchao.zhang Hongchao Zhang added a comment - Is the patch ( http://review.whamcloud.com/#change,4041 ) included in this build(24)?

          /var/log/messages from evicted client

          Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1348812749/real 1348812749]  req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
          Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180
          Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import
          Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib)
          Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)
          
          cliffw Cliff White (Inactive) added a comment - /var/log/messages from evicted client Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1348812749/real 1348812749] req@ffff8802116cec00 x1414295663929201/t0(0) o250->MGC192.168.127.6@o2ib@192.168.127.6@o2ib:26/25 lens 400/544 e 0 to 1 dl 1348812804 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Sep 27 23:13:24 hyperion787 kernel: Lustre: 3928:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 31 previous similar messages Sep 27 23:14:59 hyperion787 kernel: Lustre: Evicted from MGS (at MGC192.168.127.6@o2ib_0) after server handle changed from 0x5e1536afc64a7a2d to 0x9107144105607180 Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Reactivating import Sep 27 23:14:59 hyperion787 kernel: Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 192.168.127.6@o2ib) Sep 27 23:16:21 hyperion787 kernel: Lustre: lustre-MDT0000-mdc-ffff88033ba61400: Connection restored to lustre-MDT0000 (at 192.168.127.6@o2ib1)

          Also, some issues when recovering the MDS after the dump:

          Lustre: lustre-MDT0000: used disk, loading
          Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131
          Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22
          Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect
          LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
          LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39
          LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids
          Lustre: lustre-MDT0000: disconnecting 1 stale clients
          Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted.
          Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans
          Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans
          Lustre: Skipped 5 previous similar messages
          Lustre: Skipped 45 previous similar messages
          
          cliffw Cliff White (Inactive) added a comment - Also, some issues when recovering the MDS after the dump: Lustre: lustre-MDT0000: used disk, loading Lustre: 4204:0:(ldlm_lib.c:2139:target_recovery_init()) RECOVERY: service lustre-MDT0000, 91 recoverable clients, last_transno 8624944131 Lustre: 4140:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log lustre-mdtir error -22 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 91 clients reconnect LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids LustreError: 4216:0:(mdt_recovery.c:497:mdt_txn_stop_cb()) Replay transno 8624998654 failed: rc -39 LustreError: 4216:0:(mds_lov.c:351:mds_lov_update_objids()) Unexpected gap in objids Lustre: lustre-MDT0000: disconnecting 1 stale clients Lustre: lustre-MDT0000: Recovery over after 1:22, of 91 clients 90 recovered and 1 was evicted. Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000f_UUID now active, resetting orphans Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST001b_UUID now active, resetting orphans Lustre: Skipped 5 previous similar messages Lustre: Skipped 45 previous similar messages

          Looks like we still have this issue with build 24:

          ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015
          Sep 27 21:56:35 hyperion-rst6 kernel:
          Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace:
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
          Sep 27 21:56:35 hyperion-rst6 kernel:
          Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
          
          cliffw Cliff White (Inactive) added a comment - Looks like we still have this issue with build 24: ep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG Sep 27 21:56:35 hyperion-rst6 kernel: Pid: 5611, comm: mdt00_015 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Call Trace: Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] Sep 27 21:56:35 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Sep 27 21:56:35 hyperion-rst6 kernel: Sep 27 21:56:35 hyperion-rst6 kernel: Kernel panic - not syncing: LBUG
          pjones Peter Jones added a comment -

          Dropping priority as this only occurred once in five days of testing

          pjones Peter Jones added a comment - Dropping priority as this only occurred once in five days of testing

          We have found memory corruption in LU-1976, the crashed OI index node may cause accessing unknown RAM areas. Under such case, some objects/inodes may be affected also. Although there is no directly evidence to say that LU-1951 is just the duplication of LU-1976, I tend to the duplication of LU-1976. Since the patch for LU-1976 is ready, we can wait for the new result with such patch to check whether LU-1951 can be reproduced or not.

          yong.fan nasf (Inactive) added a comment - We have found memory corruption in LU-1976 , the crashed OI index node may cause accessing unknown RAM areas. Under such case, some objects/inodes may be affected also. Although there is no directly evidence to say that LU-1951 is just the duplication of LU-1976 , I tend to the duplication of LU-1976 . Since the patch for LU-1976 is ready, we can wait for the new result with such patch to check whether LU-1951 can be reproduced or not.

          People

            hongchao.zhang Hongchao Zhang
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: