[LU-1216] LBUG: ASSERTION(lli->lli_sai == NULL) failed using robinhood tool Created: 14/Mar/12  Updated: 26/Aug/12  Resolved: 29/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Diego Moreno (Inactive) Assignee: Lai Siyao
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre-2.1.0 with 2.6.32-131.12.1 kernel


Severity: 3
Rank (Obsolete): 5157

 Description   

We're running a backup purpose application, robinhood, which reads changes from lustre changelog and then performs copies from lustre to another backend . We're often having the same crash on the client node performing all these actions.

Before the crash I can see some llog threads blocked for more than 120 seconds (see dmesg attached). Then I get some errors reading the directory being accessed and finally the LBUG is raised.

See attached file to see complete dmesg:

[<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
[<ffffffff81066790>] ? process_timeout+0x0/0x10
[<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81004142>] ? kernel_thread+0x82/0xe0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff8147c793>] wait_for_common+0x123/0x180
[<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
[<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
[<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
[<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041aa>] child_rip+0xa/0x20
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041a0>] ? child_rip+0x0/0x20
INFO: task mdc_clg_send_th:4499 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdc_clg_send_ D 0000000000000001 0 4499 2 0x00000080
ffff8823b3e53c40 0000000000000046 ffff8823b3e53bd0 ffffffff8104c67d
ffff8823b3e53c80 0000000000000282 ffff8823b3f7c900 0000000000000000
ffff8823b3f7ccc8 ffff8823b3e53fd8 000000000000db00 ffff8823b3f7ccc8
Call Trace:
[<ffffffff8104c67d>] ? wake_up_new_task+0xed/0x150
[<ffffffff81066790>] ? process_timeout+0x0/0x10
[<ffffffff8147cb05>] schedule_timeout+0x205/0x2d0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff81004142>] ? kernel_thread+0x82/0xe0
[<ffffffffa05fcd10>] ? llog_process_thread+0x0/0xe70 [obdclass]
[<ffffffff8147c793>] wait_for_common+0x123/0x180
[<ffffffff8104cf40>] ? default_wake_function+0x0/0x20
[<ffffffffa0581cba>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
[<ffffffffa05ffa90>] ? llog_cat_process_cb+0x0/0x400 [obdclass]
[<ffffffff8147c8ad>] wait_for_completion+0x1d/0x20
[<ffffffffa05fbb33>] llog_process_flags+0xf3/0x660 [obdclass]
[<ffffffffa0751587>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[<ffffffffa05feef8>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[<ffffffffa05fdfef>] ? llog_init_handle+0x17f/0xa70 [obdclass]
[<ffffffffa0890240>] ? changelog_show_cb+0x0/0x310 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08998bb>] mdc_changelog_send_thread+0x4cb/0xb90 [mdc]
[<ffffffff810017bc>] ? __switch_to+0x1ac/0x320
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041aa>] child_rip+0xa/0x20
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffffa08993f0>] ? mdc_changelog_send_thread+0x0/0xb90 [mdc]
[<ffffffff810041a0>] ? child_rip+0x0/0x20
LustreError: 4491:0:(dir.c:478:ll_get_dir_page()) read cache page: [0x200000002:0x2:0x0] at 0: rc -1
LustreError: 4491:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -1] [parent 4491]
LustreError: 8003:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4485:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4485:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4485]
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 4493]
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) dir page locate: [0x200000002:0x2:0x0] at 0: rc -5
LustreError: 4493:0:(dir.c:439:ll_get_dir_page()) Skipped 1 previous similar message
LustreError: 4493:0:(statahead.c:1001:is_first_dirent()) error reading dir [0x200000002:0x2:0x0] at 0: [rc -5] [parent 0]
LustreError: 4483:0:(file.c:525:ll_file_open()) ASSERTION(lli->lli_sai == NULL) failed
LustreError: 4483:0:(file.c:525:ll_file_open()) LBUG
Pid: 4483, comm: rbh-backup

Call Trace:
[<ffffffffa057e855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa057ee95>] lbug_with_loc+0x75/0xe0 [libcfs]
[<ffffffffa058ae46>] libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa098398c>] ll_file_open+0xe0c/0xe60 [lustre]
[<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
[<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8115e5fa>] ? file_move+0x4a/0x70
[<ffffffff8115a59a>] __dentry_open+0x10a/0x350
[<ffffffff8115a979>] lookup_instantiate_filp+0x69/0x90
[<ffffffffa0969701>] ll_revalidate_nd+0x331/0x400 [lustre]
[<ffffffff8116b07a>] do_lookup+0x5a/0x220
[<ffffffff8116b7a9>] __link_path_walk+0x569/0x820
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8116c12a>] path_walk+0x6a/0xe0
[<ffffffff8116c2fb>] do_path_lookup+0x5b/0xa0
[<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
[<ffffffff8116cf6b>] do_filp_open+0xfb/0xd30
[<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
[<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
[<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
[<ffffffff8115a349>] do_sys_open+0x69/0x140
[<ffffffff810044f5>] ? math_state_restore+0x45/0x60
[<ffffffff8115a460>] sys_open+0x20/0x30
[<ffffffff81003172>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 4483, comm: rbh-backup Tainted: G ---------------- T 2.6.32-131.12.1.bl6.Bull.26.x86_64 #1
Call Trace:
[<ffffffff8147b742>] ? panic+0x78/0x143
[<ffffffffa057eeeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
[<ffffffffa058ae46>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa098398c>] ? ll_file_open+0xe0c/0xe60 [lustre]
[<ffffffffa0990171>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]
[<ffffffffa09654f4>] ? ll_revalidate_it_finish+0x64/0x240 [lustre]
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8115e5fa>] ? file_move+0x4a/0x70
[<ffffffff8115a59a>] ? __dentry_open+0x10a/0x350
[<ffffffff8115a979>] ? lookup_instantiate_filp+0x69/0x90
[<ffffffffa0969701>] ? ll_revalidate_nd+0x331/0x400 [lustre]
[<ffffffff8116b07a>] ? do_lookup+0x5a/0x220
[<ffffffff8116b7a9>] ? __link_path_walk+0x569/0x820
[<ffffffffa096a8e0>] ? ll_dir_open+0x0/0x100 [lustre]
[<ffffffffa096a917>] ? ll_dir_open+0x37/0x100 [lustre]
[<ffffffff8116c12a>] ? path_walk+0x6a/0xe0
[<ffffffff8116c2fb>] ? do_path_lookup+0x5b/0xa0
[<ffffffff8115e811>] ? get_empty_filp+0xa1/0x170
[<ffffffff8116cf6b>] ? do_filp_open+0xfb/0xd30
[<ffffffffa0985784>] ? ll_file_release+0x2e4/0x430 [lustre]
[<ffffffff8117bbb0>] ? mntput_no_expire+0x30/0x110
[<ffffffff81179ba2>] ? alloc_fd+0x92/0x160
[<ffffffff8115a349>] ? do_sys_open+0x69/0x140
[<ffffffff810044f5>] ? math_state_restore+0x45/0x60
[<ffffffff8115a460>] ? sys_open+0x20/0x30
[<ffffffff81003172>] ? system_call_fastpath+0x16/0x1b

And the associated bt for rbh (robinhood running thread) process. The complete crash file could be provided if needed.

crash> bt
PID: 4483 TASK: ffff8823b3f28a80 CPU: 1 COMMAND: "rbh-backup"
#0 [ffff8823b3e977f0] machine_kexec at ffffffff81027a2b
#1 [ffff8823b3e97850] crash_kexec at ffffffff810a3a52
#2 [ffff8823b3e97920] panic at ffffffff8147b749
#3 [ffff8823b3e979a0] lbug_with_loc at ffffffffa057eeeb
#4 [ffff8823b3e979f0] libcfs_assertion_failed at ffffffffa058ae46
#5 [ffff8823b3e97a40] ll_file_open at ffffffffa098398c
#6 [ffff8823b3e97b50] ll_dir_open at ffffffffa096a917
#7 [ffff8823b3e97ba0] __dentry_open at ffffffff8115a59a
#8 [ffff8823b3e97c00] lookup_instantiate_filp at ffffffff8115a979
#9 [ffff8823b3e97c20] ll_revalidate_nd at ffffffffa0969701
#10 [ffff8823b3e97c80] do_lookup at ffffffff8116b07a
#11 [ffff8823b3e97ce0] __link_path_walk at ffffffff8116b7a9
#12 [ffff8823b3e97d60] path_walk at ffffffff8116c12a
#13 [ffff8823b3e97da0] do_path_lookup at ffffffff8116c2fb
#14 [ffff8823b3e97dd0] do_filp_open at ffffffff8116cf6b
#15 [ffff8823b3e97f20] do_sys_open at ffffffff8115a349
#16 [ffff8823b3e97f70] sys_open at ffffffff8115a460
#17 [ffff8823b3e97f80] system_call_fastpath at ffffffff81003172
RIP: 000000354a20ed4d RSP: 00007fe88e2bf5e8 RFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffff81003172 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fe8880071f0
RBP: 00007fe88e2c09c2 R8: 0000000000000000 R9: 0000000000000000
R10: fffffffffffffff0 R11: 0000000000000293 R12: ffffffff8115a460
R13: ffff8823b3e97f78 R14: 0000000000000003 R15: 00007fe88e2be5c0
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b



 Comments   
Comment by Peter Jones [ 14/Mar/12 ]

Lai

Could you look into this one please?

Thanks

Peter

Comment by Lai Siyao [ 15/Mar/12 ]

The first ll_get_dir_page() failed with errno -1 (-EPERM), this looks to be returned from MDS, could you check MDS log?

This assert is caused by statahead code, which is buggy in 2.1 code, and it's fixed in 2.2, the patch is at http://review.whamcloud.com/#change,1208, but unfortunately it depends on some other changes, so it can't be backport to 2.1 directly. To fix this issue, you can disable statahead by `echo 0 > /proc/fs/lustre/llite/<client_device>/statahead_max` on client, or else I can help backport the patch to 2.1, which may take a little time.

Comment by Peter Jones [ 29/Mar/12 ]

Bull will workaround this by disabling until they upgrade to 2.2+

Comment by Bob Glossman (Inactive) [ 01/Aug/12 ]

Putting in a fix to disable it by default.
This is just a temporary workaround until we have a proper fix.

http://review.whamcloud.com/#change,3512

Generated at Sat Feb 10 01:14:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.