[LU-3524] Lustre 2.1.3: lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed Created: 28/Jun/13  Updated: 18/Sep/14  Resolved: 18/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Patrick Valentin (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 8869

 Description   

At TGCC site, which is currently running Lustre 2.1.3, time to time, customer get crashes with the following assertion :

LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed:
LustreError: 23580:0:(lov_io.c:212:lov_sub_get()) LBUG
Pid: 23580, comm: IMB-IO

Call Trace:
 [<ffffffffa034d7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa034de07>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0917a8f>] lov_sub_get+0x47f/0x6f0 [lov]
 [<ffffffffa0913ca2>] lov_sublock_env_get+0xd2/0x140 [lov]
 [<ffffffffa0914e61>] lov_sublock_alloc+0xf1/0x470 [lov]
 [<ffffffffa09162fc>] lov_lock_init_raid0+0x3dc/0xe30 [lov]
 [<ffffffffa090eab4>] lov_lock_init+0x54/0xe0 [lov]
 [<ffffffffa049215c>] cl_lock_hold_mutex+0x37c/0x6b0 [obdclass]
 [<ffffffffa04925ee>] cl_lock_request+0x5e/0x1c0 [obdclass]
 [<ffffffffa09ee9bf>] cl_glimpse_lock+0x16f/0x410 [lustre]
 [<ffffffffa09f2f0a>] ccc_prep_size+0x10a/0x290 [lustre]
 [<ffffffffa09f8425>] vvp_io_read_start+0xb5/0x3e0 [lustre]
 [<ffffffffa04938da>] cl_io_start+0x6a/0x140 [obdclass]
 [<ffffffffa0497bbc>] cl_io_loop+0xcc/0x190 [obdclass]
 [<ffffffffa09a7f07>] ll_file_io_generic+0x3a7/0x560 [lustre]
 [<ffffffffa09a81f9>] ll_file_aio_read+0x139/0x2c0 [lustre]
 [<ffffffffa09a86b9>] ll_file_read+0x169/0x2a0 [lustre]
 [<ffffffff81163a15>] vfs_read+0xb5/0x1a0
 [<ffffffff81163b51>] sys_read+0x51/0x90
 [<ffffffff81487d7e>] ? do_device_not_available+0xe/0x10
 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

After some investigation, it seems to be LU-2652, and we tried a backport of http://review.whamcloud.com/5157, http://review.whamcloud.com/5158 and http://review.whamcloud.com/5159.
But there was a lot of changes in the corresponding files since lustre 2.1 (layout lock), and 33/45 chuncks are failing.
Moreover, it seems that these 3 patches are to fix deadlocks introduced by LU-1876 (Layout Lock Server Patch Landings to Master).



 Comments   
Comment by Peter Jones [ 28/Jun/13 ]

Bruno is looking into this one

Comment by Bruno Faccini (Inactive) [ 28/Jun/13 ]

Patrick,
Do you know if the different crashes occured when running with the application/workload ?
Do we have any details on how the "IMB-IO" process/application works and particularly if it uses some stripping specifics?
Moreover do you know if this crash could be forced to reproduce ?

Comment by Lustre Bull [ 28/Jun/13 ]

Hi bruno,

I don't have anymore information about this LBUG. I forward you questions to Bull support team to have more details.

Comment by Alexandre Louvet [ 01/Jul/13 ]

I guess it is a standard IMB-IO but with a lustre aware mpi-io library. I have asked final user to provide fine details and will keep you updated.

Alex.

Comment by Bruno Faccini (Inactive) [ 04/Jul/13 ]

On my side and in the meantime I investigate patches from LU-2652/LU-2766 to see if they are really related.

Comment by Bruno Faccini (Inactive) [ 12/Jul/13 ]

To help me working more in-deep on this issue, could it be possible to get the full stacks out of the crash-dump ?? And may be more like concerned data structs if I ask you later ?

Comment by Sebastien Buisson (Inactive) [ 18/Sep/14 ]

As we are unable to provide requested information, this ticket can be closed.

Thank you,
Sebastien.

Comment by Peter Jones [ 18/Sep/14 ]

ok thanks Sebastien

Generated at Sat Feb 10 01:34:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.