Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.1.3
-
None
-
3
-
8869
Description
At TGCC site, which is currently running Lustre 2.1.3, time to time, customer get crashes with the following assertion :
LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) LBUG Pid: 23580, comm: IMB-IO Call Trace: [<ffffffffa034d7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa034de07>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0917a8f>] lov_sub_get+0x47f/0x6f0 [lov] [<ffffffffa0913ca2>] lov_sublock_env_get+0xd2/0x140 [lov] [<ffffffffa0914e61>] lov_sublock_alloc+0xf1/0x470 [lov] [<ffffffffa09162fc>] lov_lock_init_raid0+0x3dc/0xe30 [lov] [<ffffffffa090eab4>] lov_lock_init+0x54/0xe0 [lov] [<ffffffffa049215c>] cl_lock_hold_mutex+0x37c/0x6b0 [obdclass] [<ffffffffa04925ee>] cl_lock_request+0x5e/0x1c0 [obdclass] [<ffffffffa09ee9bf>] cl_glimpse_lock+0x16f/0x410 [lustre] [<ffffffffa09f2f0a>] ccc_prep_size+0x10a/0x290 [lustre] [<ffffffffa09f8425>] vvp_io_read_start+0xb5/0x3e0 [lustre] [<ffffffffa04938da>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa0497bbc>] cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09a7f07>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09a81f9>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09a86b9>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff81163a15>] vfs_read+0xb5/0x1a0 [<ffffffff81163b51>] sys_read+0x51/0x90 [<ffffffff81487d7e>] ? do_device_not_available+0xe/0x10 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
After some investigation, it seems to be LU-2652, and we tried a backport of http://review.whamcloud.com/5157, http://review.whamcloud.com/5158 and http://review.whamcloud.com/5159.
But there was a lot of changes in the corresponding files since lustre 2.1 (layout lock), and 33/45 chuncks are failing.
Moreover, it seems that these 3 patches are to fix deadlocks introduced by LU-1876 (Layout Lock Server Patch Landings to Master).
ok thanks Sebastien