Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1216

LBUG: ASSERTION(lli->lli_sai == NULL) failed using robinhood tool

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • Lustre 2.1.3
    • None
    • None
    • Lustre-2.1.0 with 2.6.32-131.12.1 kernel
    • 3
    • 5157

    Description

      We're running a backup purpose application, robinhood, which reads changes from lustre changelog and then performs copies from lustre to another backend . We're often having the same crash on the client node performing all these actions.

      Before the crash I can see some llog threads blocked for more than 120 seconds (see dmesg attached). Then I get some errors reading the directory being accessed and finally the LBUG is raised.

      See attached file to see complete dmesg:

      [<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
      [<ffffffff81066790>] ? process_timeout+0x0/0x10
      [<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
      [<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
      [<ffffffff81004142>] ? kernel_thread+0x82/0xe0
      [<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
      [<ffffffff8147c793>] wait_for_common+0x123/0x180
      [<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
      [<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
      [<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
      [<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
      [<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
      [<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
      [<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
      [<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
      [<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
      [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20
      INFO: task mdc_clg_send_th:4499 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdc_clg_send_ D 0000000000000001 0 4499 2 0x00000080
      ffff8823b3e53c40 0000000000000046 ffff8823b3e53bd0 ffffffff8104c67d
      ffff8823b3e53c80 0000000000000282 ffff8823b3f7c900 0000000000000000
      ffff8823b3f7ccc8 ffff8823b3e53fd8 000000000000db00 ffff8823b3f7ccc8
      Call Trace:
      [<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
      [<ffffffff81066790>] ? process_timeout+0x0/0x10
      [<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
      [<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
      [<ffffffff81004142>] ? kernel_thread+0x82/0xe0
      [<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
      [<ffffffff8147c793>] wait_for_common+0x123/0x180
      [<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
      [<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
      [<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
      [<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
      [<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
      [<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
      [<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
      [<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
      [<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
      [<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffff810041aa>] child_rip+0xa/0x20
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
      [<ffffffff810041a0>] ? child_rip+0x0/0x20
      LustreError: 4491:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x200000002:0x2:0x0] at 0: rc -1
      LustreError: 4491:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -1] [parent 4491]
      LustreError: 8003:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
      LustreError: 4485:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
      LustreError: 4485:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4485]
      LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
      LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
      LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4493]
      LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
      LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
      LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 0]
      LustreError: 4483:0:(file.c:525:ll_file_open()) ASSERTION(lli->lli_sai == NULL) failed
      LustreError: 4483:0:(file.c:525:ll_file_open()) LBUG
      Pid: 4483, comm: rbh-backup

      Call Trace:
      [<ffffffffa057e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa057ee95>] lbug_with_loc+0x75/0xe0 [libcfs]
      [<ffffffffa058ae46>] libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa098398c>] ll_file_open+0xe0c/0xe60 [lustre]
      [<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
      [<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
      [<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
      [<ffffffffa096a917>] ll_dir_open+0x37/0x100 [lustre]
      [<ffffffff8115e5fa>] ? file_move+0x4a/0x70
      [<ffffffff8115a59a>] __dentry_open+0x10a/0x350
      [<ffffffff8115a979>] lookup_instantiate_filp+0x69/0x90
      [<ffffffffa0969701>] ll_revalidate_nd+0x331/0x400 [lustre]
      [<ffffffff8116b07a>] do_lookup+0x5a/0x220
      [<ffffffff8116b7a9>] __link_path_walk+0x569/0x820
      [<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
      [<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
      [<ffffffff8116c12a>] path_walk+0x6a/0xe0
      [<ffffffff8116c2fb>] do_path_lookup+0x5b/0xa0
      [<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
      [<ffffffff8116cf6b>] do_filp_open+0xfb/0xd30
      [<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
      [<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
      [<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
      [<ffffffff8115a349>] do_sys_open+0x69/0x140
      [<ffffffff810044f5>] ? math_state_restore+0x45/0x60
      [<ffffffff8115a460>] sys_open+0x20/0x30
      [<ffffffff81003172>] system_call_fastpath+0x16/0x1b

      Kernel panic - not syncing: LBUG
      Pid: 4483, comm: rbh-backup Tainted: G ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1
      Call Trace:
      [<ffffffff8147b742>] ? panic+0x78/0x143
      [<ffffffffa057eeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
      [<ffffffffa058ae46>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
      [<ffffffffa098398c>] ? ll_file_open+0xe0c/0xe60 [lustre]
      [<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
      [<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
      [<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
      [<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
      [<ffffffff8115e5fa>] ? file_move+0x4a/0x70
      [<ffffffff8115a59a>] ? __dentry_open+0x10a/0x350
      [<ffffffff8115a979>] ? lookup_instantiate_filp+0x69/0x90
      [<ffffffffa0969701>] ? ll_revalidate_nd+0x331/0x400 [lustre]
      [<ffffffff8116b07a>] ? do_lookup+0x5a/0x220
      [<ffffffff8116b7a9>] ? __link_path_walk+0x569/0x820
      [<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
      [<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
      [<ffffffff8116c12a>] ? path_walk+0x6a/0xe0
      [<ffffffff8116c2fb>] ? do_path_lookup+0x5b/0xa0
      [<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
      [<ffffffff8116cf6b>] ? do_filp_open+0xfb/0xd30
      [<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
      [<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
      [<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
      [<ffffffff8115a349>] ? do_sys_open+0x69/0x140
      [<ffffffff810044f5>] ? math_state_restore+0x45/0x60
      [<ffffffff8115a460>] ? sys_open+0x20/0x30
      [<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b

      And the associated bt for rbh (robinhood running thread) process. The complete crash file could be provided if needed.

      crash> bt
      PID: 4483 TASK: ffff8823b3f28a80 CPU: 1 COMMAND: "rbh-backup"
      #0 [ffff8823b3e977f0] machine_kexec at ffffffff81027a2b
      #1 [ffff8823b3e97850] crash_kexec at ffffffff810a3a52
      #2 [ffff8823b3e97920] panic at ffffffff8147b749
      #3 [ffff8823b3e979a0] lbug_with_loc at ffffffffa057eeeb
      #4 [ffff8823b3e979f0] libcfs_assertion_failed at ffffffffa058ae46
      #5 [ffff8823b3e97a40] ll_file_open at ffffffffa098398c
      #6 [ffff8823b3e97b50] ll_dir_open at ffffffffa096a917
      #7 [ffff8823b3e97ba0] __dentry_open at ffffffff8115a59a
      #8 [ffff8823b3e97c00] lookup_instantiate_filp at ffffffff8115a979
      #9 [ffff8823b3e97c20] ll_revalidate_nd at ffffffffa0969701
      #10 [ffff8823b3e97c80] do_lookup at ffffffff8116b07a
      #11 [ffff8823b3e97ce0] __link_path_walk at ffffffff8116b7a9
      #12 [ffff8823b3e97d60] path_walk at ffffffff8116c12a
      #13 [ffff8823b3e97da0] do_path_lookup at ffffffff8116c2fb
      #14 [ffff8823b3e97dd0] do_filp_open at ffffffff8116cf6b
      #15 [ffff8823b3e97f20] do_sys_open at ffffffff8115a349
      #16 [ffff8823b3e97f70] sys_open at ffffffff8115a460
      #17 [ffff8823b3e97f80] system_call_fastpath at ffffffff81003172
      RIP: 000000354a20ed4d RSP: 00007fe88e2bf5e8 RFLAGS: 00010246
      RAX: 0000000000000002 RBX: ffffffff81003172 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fe8880071f0
      RBP: 00007fe88e2c09c2 R8: 0000000000000000 R9: 0000000000000000
      R10: fffffffffffffff0 R11: 0000000000000293 R12: ffffffff8115a460
      R13: ffff8823b3e97f78 R14: 0000000000000003 R15: 00007fe88e2be5c0
      ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              dmoreno Diego Moreno (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: