Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10239

Lustre crash (client): The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      Our clients have max_rpcs_in_flight set to 16. Some of our clients hit this bug at times, which crashes these nodes:

       

      2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed: 
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] 
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace:
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492] 
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8
       
      2017-10-27T20:22:40-05:00 node0748 kernel: [14998.118253] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]} The first extent to be fit in a RPC contains 17 chunks, which is over the limit 16.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.154188] LustreError: 1708:0:(osc_cache.c:1226:osc_extent_tree_dump0()) Dump object ffff882dc5cce4d0 extents at try_to_add_extent_for_io:1931, mppr: 4096.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.181833] LustreError: 1708:0:(osc_cache.c:1239:osc_extent_tree_dump0()) extent ffff882e6c434000@{[60858 -> 64953/64953], [1|0|+|lockdone|wSu|ffff882dc5cce4d0], [0|4096|+|-| (null)|4096| (null)]}urgent 1.
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.216119] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) ASSERTION( data->erd_page_count != 0 || chunk_count <= data->erd_max_chunks ) failed: 
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.244827] LustreError: 1708:0:(osc_cache.c:1931:try_to_add_extent_for_io()) LBUG
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.259546] Pid: 1708, comm: ptlrpcd_00_86
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] 
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.270365] Call Trace:
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.287055] [<ffffffffa07537f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.301366] [<ffffffffa0753861>] lbug_with_loc+0x41/0xb0 [libcfs]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.314695] [<ffffffffa0c69498>] try_to_add_extent_for_io.isra.24+0xf58/0x12e0 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.328249] [<ffffffffa0c6b9dd>] osc_io_unplug0+0x3fd/0x1950 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.339730] [<ffffffff810d2372>] ? load_balance+0x192/0x990
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.350013] [<ffffffff810ce46c>] ? dequeue_entity+0x11c/0x5d0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.360444] [<ffffffffa0c6db30>] osc_io_unplug+0x10/0x20 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.370874] [<ffffffffa0c49441>] brw_queue_work+0x31/0xd0 [osc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.381510] [<ffffffffa0a5e3d7>] work_interpreter+0x37/0xf0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.392538] [<ffffffffa0a5b0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.405132] [<ffffffffa0a5cabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.416411] [<ffffffffa0a88a3b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.427367] [<ffffffffa0a88d57>] ptlrpcd+0x227/0x560 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.437727] [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.448479] [<ffffffffa0a88b30>] ? ptlrpcd+0x0/0x560 [ptlrpc]
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.458669] [<ffffffff810b064f>] kthread+0xcf/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.467800] [<ffffffff810b0580>] ? kthread+0x0/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.477111] [<ffffffff81696818>] ret_from_fork+0x58/0x90
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.486513] [<ffffffff810b0580>] ? kthread+0x0/0xe0
       2017-10-27T20:22:40-05:00 node0748 kernel: [14998.495492] 
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.500760] Kernel panic - not syncing: LBUG
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.509021] CPU: 203 PID: 1708 Comm: ptlrpcd_00_86 Tainted: P OE ------------ 3.10.0-514.6.1.el7.x86_64 #1
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.524591] Hardware name: Penguin Computing Relion 1904GT/S7200AP, BIOS S72C610.86B.01.02.0001.112820162103 11/28/2016
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.540144] ffffffffa0770ccc 00000000e8cd63f9 ffff882ed87a7958 ffffffff816862ac
       2017-10-27T20:22:41-05:00 node0748 kernel: [14998.551919] ffff882ed87a79d8 ffffffff8167f6b3 ffffffff00000008 ffff882ed87a79e8
       
      

       

      Seems to be related to LU-8680

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ma256 Murshid Azman (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: