Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
13869
Description
The client node crashed with the following stack trace:
May 5 06:13:59 iwc321 kernel: LustreError: 32047:0:(osc_io.c:834:osc_req_attr_set()) LBUG 2014-05-05 06:17:19 INFO: task ior:32047 blocked for more than 120 seconds. 2014-05-05 06:17:19 Not tainted 2.6.32-431.5.1.el6.x86_64 #1 2014-05-05 06:17:19 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2014-05-05 06:17:19 ior D 0000000000000006 0 32047 32038 0x00000080 2014-05-05 06:17:19 ffff8801b06a95c8 0000000000000082 0000000000000000 ffff8801bd6c5540 2014-05-05 06:17:19 ffff8801bd6c5540 ffffffffa0d4d160 ffff8801b06a9588 ffff8801b0791068 2014-05-05 06:17:19 ffff8801bd6c5af8 ffff8801b06a9fd8 000000000000fbc8 ffff8801bd6c5af8 2014-05-05 06:17:19 Call Trace: 2014-05-05 06:17:19 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 2014-05-05 06:17:19 [<ffffffffa03acecd>] lbug_with_loc+0x7d/0xb0 [libcfs] 2014-05-05 06:17:19 [<ffffffffa0d241bc>] osc_req_attr_set+0x57c/0x5b0 [osc] 2014-05-05 06:17:19 [<ffffffffa03b7328>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-05-05 06:17:19 [<ffffffffa0524ef1>] cl_req_attr_set+0xd1/0x230 [obdclass] 2014-05-05 06:17:19 [<ffffffffa0d14ec7>] osc_build_rpc+0x487/0x1850 [osc] 2014-05-05 06:17:19 [<ffffffffa03bc911>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-05-05 06:17:19 [<ffffffffa0d304c7>] osc_io_unplug0+0x1257/0x1f00 [osc] 2014-05-05 06:17:19 [<ffffffffa0d32ef1>] osc_io_unplug+0x11/0x20 [osc] 2014-05-05 06:17:19 [<ffffffffa0d330cd>] osc_queue_sync_pages+0x1cd/0x350 [osc] 2014-05-05 06:17:19 [<ffffffffa0d2477f>] osc_io_submit+0x33f/0x4b0 [osc] 2014-05-05 06:17:20 [<ffffffffa052472e>] cl_io_submit_rw+0x6e/0x160 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0db80f9>] lov_io_submit+0x189/0x4b0 [lov] 2014-05-05 06:17:20 [<ffffffffa052472e>] cl_io_submit_rw+0x6e/0x160 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0526dc0>] cl_io_read_page+0x180/0x190 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0e72f81>] ll_readpage+0x91/0x1a0 [lustre] 2014-05-05 06:17:20 [<ffffffff811212fc>] generic_file_aio_read+0x1fc/0x700 2014-05-05 06:17:20 [<ffffffffa0ea4079>] vvp_io_read_start+0x259/0x470 [lustre] 2014-05-05 06:17:20 [<ffffffffa05249ba>] cl_io_start+0x6a/0x140 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0528b44>] cl_io_loop+0xb4/0x1b0 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0e45f06>] ll_file_io_generic+0x2b6/0x710 [lustre] 2014-05-05 06:17:20 [<ffffffffa05188f9>] ? cl_env_get+0x29/0x350 [obdclass] 2014-05-05 06:17:20 [<ffffffffa0e46faf>] ll_file_aio_read+0x13f/0x2c0 [lustre] 2014-05-05 06:17:20 [<ffffffffa0e4729c>] ll_file_read+0x16c/0x2a0 [lustre] 2014-05-05 06:17:20 [<ffffffff81189695>] vfs_read+0xb5/0x1a0 2014-05-05 06:17:20 [<ffffffff811897d1>] sys_read+0x51/0x90 2014-05-05 06:17:20 [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290 2014-05-05 06:17:20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
This bug is introduced by the commit: http://review.whamcloud.com/8523 where read ahead lock matching process was optimized. However, it turns out there exists a race with lock cancellation.
As a workaround fix, I'm going to disable the optimization and will come up with a fix later, after 2.6 is released.