Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.9.0
-
None
-
3
-
9223372036854775807
Description
Our clients have max_rpcs_in_flight set to 16. Some of our clients hit this bug at times, which crashes these nodes:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed: 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace: 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492] 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1. 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed: 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace: 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc] 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0 2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492] 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac 2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8
Seems to be related to LU-8680