Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
None
-
Lustre-2.1.0 with 2.6.32-131.12.1 kernel
-
3
-
5157
Description
We're running a backup purpose application, robinhood, which reads changes from lustre changelog and then performs copies from lustre to another backend . We're often having the same crash on the client node performing all these actions.
Before the crash I can see some llog threads blocked for more than 120 seconds (see dmesg attached). Then I get some errors reading the directory being accessed and finally the LBUG is raised.
See attached file to see complete dmesg:
[<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
[<ffffffff81066790>] ? process_timeout+0x0/0x10
[<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81004142>] ? kernel_thread+0x82/0xe0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff8147c793>] wait_for_common+0x123/0x180
[<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
[<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
[<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
[<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041aa>] child_rip+0xa/0x20
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041a0>] ? child_rip+0x0/0x20
INFO: task mdc_clg_send_th:4499 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdc_clg_send_ D 0000000000000001 0 4499 2 0x00000080
ffff8823b3e53c40 0000000000000046 ffff8823b3e53bd0 ffffffff8104c67d
ffff8823b3e53c80 0000000000000282 ffff8823b3f7c900 0000000000000000
ffff8823b3f7ccc8 ffff8823b3e53fd8 000000000000db00 ffff8823b3f7ccc8
Call Trace:
[<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
[<ffffffff81066790>] ? process_timeout+0x0/0x10
[<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81004142>] ? kernel_thread+0x82/0xe0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff8147c793>] wait_for_common+0x123/0x180
[<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
[<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
[<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
[<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041aa>] child_rip+0xa/0x20
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041a0>] ? child_rip+0x0/0x20
LustreError: 4491:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x200000002:0x2:0x0] at 0: rc -1
LustreError: 4491:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -1] [parent 4491]
LustreError: 8003:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4485:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4485:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4485]
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4493]
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 0]
LustreError: 4483:0:(file.c:525:ll_file_open()) ASSERTION(lli->lli_sai == NULL) failed
LustreError: 4483:0:(file.c:525:ll_file_open()) LBUG
Pid: 4483, comm: rbh-backup
Call Trace:
[<ffffffffa057e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa057ee95>] lbug_with_loc+0x75/0xe0 [libcfs]
[<ffffffffa058ae46>] libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa098398c>] ll_file_open+0xe0c/0xe60 [lustre]
[<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
[<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8115e5fa>] ? file_move+0x4a/0x70
[<ffffffff8115a59a>] __dentry_open+0x10a/0x350
[<ffffffff8115a979>] lookup_instantiate_filp+0x69/0x90
[<ffffffffa0969701>] ll_revalidate_nd+0x331/0x400 [lustre]
[<ffffffff8116b07a>] do_lookup+0x5a/0x220
[<ffffffff8116b7a9>] __link_path_walk+0x569/0x820
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8116c12a>] path_walk+0x6a/0xe0
[<ffffffff8116c2fb>] do_path_lookup+0x5b/0xa0
[<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
[<ffffffff8116cf6b>] do_filp_open+0xfb/0xd30
[<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
[<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
[<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
[<ffffffff8115a349>] do_sys_open+0x69/0x140
[<ffffffff810044f5>] ? math_state_restore+0x45/0x60
[<ffffffff8115a460>] sys_open+0x20/0x30
[<ffffffff81003172>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: LBUG
Pid: 4483, comm: rbh-backup Tainted: G ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1
Call Trace:
[<ffffffff8147b742>] ? panic+0x78/0x143
[<ffffffffa057eeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
[<ffffffffa058ae46>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa098398c>] ? ll_file_open+0xe0c/0xe60 [lustre]
[<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
[<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8115e5fa>] ? file_move+0x4a/0x70
[<ffffffff8115a59a>] ? __dentry_open+0x10a/0x350
[<ffffffff8115a979>] ? lookup_instantiate_filp+0x69/0x90
[<ffffffffa0969701>] ? ll_revalidate_nd+0x331/0x400 [lustre]
[<ffffffff8116b07a>] ? do_lookup+0x5a/0x220
[<ffffffff8116b7a9>] ? __link_path_walk+0x569/0x820
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8116c12a>] ? path_walk+0x6a/0xe0
[<ffffffff8116c2fb>] ? do_path_lookup+0x5b/0xa0
[<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
[<ffffffff8116cf6b>] ? do_filp_open+0xfb/0xd30
[<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
[<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
[<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
[<ffffffff8115a349>] ? do_sys_open+0x69/0x140
[<ffffffff810044f5>] ? math_state_restore+0x45/0x60
[<ffffffff8115a460>] ? sys_open+0x20/0x30
[<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b
And the associated bt for rbh (robinhood running thread) process. The complete crash file could be provided if needed.
crash> bt
PID: 4483 TASK: ffff8823b3f28a80 CPU: 1 COMMAND: "rbh-backup"
#0 [ffff8823b3e977f0] machine_kexec at ffffffff81027a2b
#1 [ffff8823b3e97850] crash_kexec at ffffffff810a3a52
#2 [ffff8823b3e97920] panic at ffffffff8147b749
#3 [ffff8823b3e979a0] lbug_with_loc at ffffffffa057eeeb
#4 [ffff8823b3e979f0] libcfs_assertion_failed at ffffffffa058ae46
#5 [ffff8823b3e97a40] ll_file_open at ffffffffa098398c
#6 [ffff8823b3e97b50] ll_dir_open at ffffffffa096a917
#7 [ffff8823b3e97ba0] __dentry_open at ffffffff8115a59a
#8 [ffff8823b3e97c00] lookup_instantiate_filp at ffffffff8115a979
#9 [ffff8823b3e97c20] ll_revalidate_nd at ffffffffa0969701
#10 [ffff8823b3e97c80] do_lookup at ffffffff8116b07a
#11 [ffff8823b3e97ce0] __link_path_walk at ffffffff8116b7a9
#12 [ffff8823b3e97d60] path_walk at ffffffff8116c12a
#13 [ffff8823b3e97da0] do_path_lookup at ffffffff8116c2fb
#14 [ffff8823b3e97dd0] do_filp_open at ffffffff8116cf6b
#15 [ffff8823b3e97f20] do_sys_open at ffffffff8115a349
#16 [ffff8823b3e97f70] sys_open at ffffffff8115a460
#17 [ffff8823b3e97f80] system_call_fastpath at ffffffff81003172
RIP: 000000354a20ed4d RSP: 00007fe88e2bf5e8 RFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffff81003172 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fe8880071f0
RBP: 00007fe88e2c09c2 R8: 0000000000000000 R9: 0000000000000000
R10: fffffffffffffff0 R11: 0000000000000293 R12: ffffffff8115a460
R13: ffff8823b3e97f78 R14: 0000000000000003 R15: 00007fe88e2be5c0
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Attachments
Issue Links
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....