[LU-10239] Lustre crash (client): The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16. Created: 14/Nov/17 Updated: 26/Feb/20 Resolved: 04/Jan/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Murshid Azman | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Our clients have max_rpcs_in_flight set to 16. Some of our clients hit this bug at times, which crashes these nodes:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492]
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1.
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace:
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc]
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0
2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492]
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac
2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8
Seems to be related to |
| Comments |
| Comment by Phil Schwan (Inactive) [ 19/Nov/17 ] |
|
We just ran a job that blew up 1,600 nodes with this issue, so we're highly motivated to solve or work around it. My theory – which is admittedly not very deep – goes something like this:
Am I wrong? What prevents osc_io_submit() from building up a 4096-page I/O that contains more than 16 chunks? This is a 2.9.52 client, but I checked master and it didn't look like things were materially different in this area. I could see that cutting max_pages_per_rpc to 16 might work around it, but that would be an extremely harsh workaround, resulting in a lot of extremely tiny I/Os... Does anyone have a clue to spare, perhaps, to point us in the right direction? |
| Comment by James Nunez (Inactive) [ 20/Dec/17 ] |
|
Bobijam - Would you please look into this issue? Thank you |
| Comment by Gerrit Updater [ 21/Dec/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/30627 |
| Comment by Gerrit Updater [ 04/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30627/ |
| Comment by Peter Jones [ 04/Jan/19 ] |
|
Landed for 2.13 |
| Comment by Jacek Tomaka [ 08/Feb/20 ] |
|
Hey, Any chances to backport this fix to 2.12.4? |
| Comment by Jacek Tomaka [ 26/Feb/20 ] |
|
Hmm? |