[LU-2652] lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed Created: 19/Jan/13 Updated: 05/Jul/13 Resolved: 30/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6194 | ||||||||||||||||
| Description |
|
HAd this crash running sanityn test 12: [45366.990436] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, stat, unlink) ============= 10:01:13 (1358607673) [45367.146837] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085 file: /mnt/lustre/lockdir/lockfile=144115205255725083 [45445.727378] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: [45445.728276] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) LBUG [45445.728751] Pid: 20512, comm: statmany [45445.729089] [45445.729090] Call Trace: [45445.729696] [<ffffffffa0a7a915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [45445.730184] [<ffffffffa0a7af17>] lbug_with_loc+0x47/0xb0 [libcfs] [45445.730673] [<ffffffffa14041ba>] lov_sub_get+0x4aa/0x690 [lov] [45445.731129] [<ffffffffa1400132>] lov_sublock_env_get+0xd2/0x140 [lov] [45445.731612] [<ffffffffa1401841>] lov_sublock_alloc+0xf1/0x450 [lov] [45445.732086] [<ffffffffa1402a2c>] lov_lock_init_raid0+0x3ec/0xe50 [lov] [45445.732632] [<ffffffffa13fa2ae>] lov_lock_init+0x1e/0x60 [lov] [45445.733135] [<ffffffffa0bec52c>] cl_lock_hold_mutex+0x34c/0x660 [obdclass] [45445.733652] [<ffffffffa0bec9a2>] cl_lock_request+0x62/0x270 [obdclass] [45445.734143] [<ffffffffa090b1f9>] cl_glimpse_lock+0x179/0x480 [lustre] [45445.734635] [<ffffffffa090ba65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre] [45445.735142] [<ffffffffa08c5298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre] [45445.735748] [<ffffffffa08c5309>] ll_getattr_it+0x49/0x170 [lustre] [45445.736230] [<ffffffffa08c5467>] ll_getattr+0x37/0x40 [lustre] [45445.736719] [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30 [45445.737183] [<ffffffff81180891>] vfs_getattr+0x51/0x80 [45445.737613] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 [45445.738070] [<ffffffff81180920>] vfs_fstatat+0x60/0x80 [45445.738508] [<ffffffff81180a6b>] vfs_stat+0x1b/0x20 [45445.738919] [<ffffffff81180a94>] sys_newstat+0x24/0x50 [45445.739332] [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50 [45445.740967] [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60 [45445.741415] [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10 [45445.741875] [<ffffffff8100befb>] ? device_not_available+0x1b/0x20 [45445.742317] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [45445.742763] [45445.744348] Kernel panic - not syncing: LBUG Crashdump is in /exports/crashdumps/192.168.10.217-2013-01-19-10\:02\:36 |
| Comments |
| Comment by Oleg Drokin [ 19/Jan/13 ] |
|
Sometimes instead of the crash I am getting this sort of a locked up thread: .111182] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, sta .378607] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085 .812526] LNet: Service thread pid 12451 was inactive for 40.00s. The thread migh .813279] Pid: 12451, comm: mdt00_001 .813563] .813564] Call Trace: .813979] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 .814234] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 .814507] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 .814762] [<ffffffffa0e00771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] .815052] [<ffffffffa053184d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] .815338] [<ffffffffa052d360>] ? ldlm_expired_completion_wait+0x0/0x260 [ptlrpc] .815791] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 .816062] [<ffffffffa0530f98>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] .816351] [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] .816692] [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] .816980] [<ffffffffa09dedc0>] mdt_object_lock0+0x360/0xb10 [mdt] .817278] [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] .817584] [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] .817878] [<ffffffffa09df614>] mdt_object_lock+0x14/0x20 [mdt] .818158] [<ffffffffa09fef3a>] mdt_reint_link+0x3fa/0x910 [mdt] .818459] [<ffffffffa0559c5e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] .818918] [<ffffffffa09faa91>] mdt_reint_rec+0x41/0xe0 [mdt] .819194] [<ffffffffa09f4133>] mdt_reint_internal+0x4f3/0x7e0 [mdt] .819495] [<ffffffffa09f4464>] mdt_reint+0x44/0xe0 [mdt] .819764] [<ffffffffa09e4782>] mdt_handle_common+0x8e2/0x1680 [mdt] .820052] [<ffffffffa0a19ed5>] mds_regular_handle+0x15/0x20 [mdt] .820352] [<ffffffffa0568483>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] .820872] [<ffffffffa0e0065e>] ? cfs_timer_arm+0xe/0x10 [libcfs] .821172] [<ffffffffa0561221>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] .821474] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 .821758] [<ffffffffa056af8d>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] .822055] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .822333] [<ffffffff8100c14a>] child_rip+0xa/0x20 .822608] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .822904] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .823181] [<ffffffff8100c140>] ? child_rip+0x0/0x20 .823438] .. (this thread actually finished after 100 seconds) ... .864250] INFO: task statmany:14503 blocked for more than 120 seconds. .865404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag .867239] statmany D 0000000000000000 2608 14503 14484 0x00000000 .868449] ffff880055affb10 0000000000000086 ffff880055affad8 ffff880055affad4 .869752] ffff880055affaf8 ffff8800bcc23080 ffff8800063167c0 0000000000000000 .870198] ffff8800aa03e8b8 ffff880055afffd8 000000000000fba8 ffff8800aa03e8b8 .870647] Call Trace: .870842] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 .871102] [<ffffffff81054aaa>] ? enqueue_task_fair+0x14a/0x4e0 .871356] [<ffffffff81096d6a>] ? sched_clock_cpu+0x6a/0x110 .871610] [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30 .871867] [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30 .872156] [<ffffffff814f9ec7>] ? down_read+0x37/0x40 .872474] [<ffffffffa0717cba>] lov_lsm_addref+0x2a/0x130 [lov] .872796] [<ffffffffa0717dea>] lov_lsm_get+0x2a/0x40 [lov] .873063] [<ffffffffa0b28c95>] ccc_inode_lsm_get+0x15/0x20 [lustre] .873332] [<ffffffffa0addb83>] ll_merge_lvb+0x53/0x5e0 [lustre] .873600] [<ffffffff814f919e>] ? mutex_unlock+0xe/0x10 .873863] [<ffffffffa0f6dd0a>] ? cl_lock_mutex_put+0x7a/0x90 [obdclass] .874149] [<ffffffffa0b283a8>] cl_glimpse_lock+0x328/0x480 [lustre] .874445] [<ffffffffa0b28a65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre] .874735] [<ffffffffa0ae2298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre] .875036] [<ffffffffa0ae2309>] ll_getattr_it+0x49/0x170 [lustre] .875322] [<ffffffffa0ae2467>] ll_getattr+0x37/0x40 [lustre] .875598] [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30 .875875] [<ffffffff81180891>] vfs_getattr+0x51/0x80 .876154] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 .876439] [<ffffffff81180920>] vfs_fstatat+0x60/0x80 .876696] [<ffffffff81180a6b>] vfs_stat+0x1b/0x20 .876947] [<ffffffff81180a94>] sys_newstat+0x24/0x50 .877203] [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50 .877473] [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60 .877751] [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10 .878001] [<ffffffff8100befb>] ? device_not_available+0x1b/0x20 .878277] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b .878552] INFO: task ll_imp_inval:14515 blocked for more than 120 seconds. .878838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag .879297] ll_imp_inval D 0000000000000003 4640 14515 2 0x00000000 .879607] ffff8800b5f31980 0000000000000046 0000000000000000 ffffffff810385d8 .880107] ffff8800ffffffff 00075e5906b4fe40 ffff8800ffffffff ffff8800b5f31980 .880590] ffff8800ac106af8 ffff8800b5f31fd8 000000000000fba8 ffff8800ac106af8 .881064] Call Trace: .881268] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 .881548] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 .881807] [<ffffffff814fad23>] rwsem_down_write_failed+0x23/0x30 .882054] [<ffffffff8127c133>] call_rwsem_down_write_failed+0x13/0x20 .882339] [<ffffffff814f9e85>] ? down_write+0x45/0x50 .882605] [<ffffffffa0716e44>] lov_conf_lock+0x34/0xc0 [lov] .882877] [<ffffffffa0717268>] lov_conf_set+0x48/0x6a0 [lov] .883163] [<ffffffffa0f65aad>] ? cl_env_nested_get+0x5d/0xc0 [obdclass] .883465] [<ffffffffa0e189eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs] .883771] [<ffffffffa0f63d08>] cl_conf_set+0x58/0x100 [obdclass] .884076] [<ffffffffa0adb1c8>] ll_layout_conf+0x78/0x250 [lustre] .884365] [<ffffffffa0b0e947>] ll_md_blocking_ast+0x3e7/0x720 [lustre] .884655] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 .884936] [<ffffffffa051106c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc] .885243] [<ffffffffa052afda>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] .885561] [<ffffffffa052f8cb>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc] .885848] [<ffffffffa0519c1d>] cleanup_resource+0x12d/0x300 [ptlrpc] .886128] [<ffffffffa0519e1f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc] .886993] [<ffffffffa0e167af>] cfs_hash_for_each_relax+0x17f/0x360 [libcfs] .887434] [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] .887717] [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] .887997] [<ffffffffa0e1817f>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs] .888463] [<ffffffffa051685e>] ldlm_namespace_cleanup+0x2e/0xc0 [ptlrpc] .888800] [<ffffffffa04bbc81>] mdc_import_event+0x1c1/0xa30 [mdc] .889143] [<ffffffffa057993f>] ptlrpc_invalidate_import+0x29f/0x8c0 [ptlrpc] .889596] [<ffffffffa0e10401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] .889870] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .890341] [<ffffffffa0579faf>] ptlrpc_invalidate_import_thread+0x4f/0x2e0 [ptlrp .890807] [<ffffffff8100c14a>] child_rip+0xa/0x20 .891077] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .892918] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .893398] [<ffffffff8100c140>] ? child_rip+0x0/0x20 (these two threads never complete) |
| Comment by Jinshan Xiong (Inactive) [ 22/Jan/13 ] |
[ 756.853670] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) ASSERTION( list_empty(&hdr->coh_locks) ) failed: [ 756.854210] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) LBUG [ 756.854498] Pid: 25456, comm: ll_sa_25403 [ 756.854724] [ 756.854725] Call Trace: [ 756.855090] [<ffffffffa0449915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 756.855355] [<ffffffffa0449f17>] lbug_with_loc+0x47/0xb0 [libcfs] [ 756.855617] [<ffffffffa0a711af>] lov_layout_change+0x2df/0x350 [lov] [ 756.855874] [<ffffffffa0a714b7>] lov_conf_set+0x297/0x6a0 [lov] [ 756.856124] [<ffffffffa045a401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 756.856384] [<ffffffffa04629eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs] [ 756.856807] [<ffffffffa061ad08>] cl_conf_set+0x58/0x100 [obdclass] [ 756.857073] [<ffffffffa0dd71c8>] ll_layout_conf+0x78/0x250 [lustre] [ 756.857334] [<ffffffffa0de62ea>] ll_layout_lock_set+0x40a/0xd30 [lustre] [ 756.857599] [<ffffffffa0947320>] ? lmv_set_lock_data+0x100/0x480 [lmv] [ 756.857862] [<ffffffffa0de73ef>] ll_layout_refresh+0x7df/0xd80 [lustre] [ 756.858128] [<ffffffffa0e0a560>] ? ll_md_blocking_ast+0x0/0x720 [lustre] [ 756.858411] [<ffffffffa0765370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [ 756.858685] [<ffffffffa0e2d2ed>] vvp_io_fini+0xfd/0x1b0 [lustre] [ 756.858939] [<ffffffffa0a7b7ad>] ? lov_io_fini+0x16d/0x3d0 [lov] [ 756.859202] [<ffffffffa062c807>] cl_io_fini+0x77/0x260 [obdclass] [ 756.859460] [<ffffffffa0e24985>] cl_glimpse_size0+0xc5/0x1d0 [lustre] [ 756.859725] [<ffffffffa0e1cde2>] ll_agl_trigger+0xd2/0x350 [lustre] [ 756.859983] [<ffffffffa0e233a7>] ll_statahead_thread+0x2d7/0xf40 [lustre] [ 756.860245] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 756.860501] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.860776] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 756.861012] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.861279] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.861537] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 756.861769] [ 756.885413] Kernel panic - not syncing: LBUG |
| Comment by Jinshan Xiong (Inactive) [ 22/Jan/13 ] |
|
I've known the root cause of the problem, need time to cook a patch |
| Comment by Jinshan Xiong (Inactive) [ 24/Jan/13 ] |
|
patches are at: http://review.whamcloud.com/5157, http://review.whamcloud.com/5158, http://review.whamcloud.com/5159. Oleg is verifying them. |
| Comment by Oleg Drokin [ 08/Feb/13 ] |
|
Patches landed, issue no longer visible |
| Comment by John Hammond [ 28/Jun/13 ] |
|
Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do: # llmount.sh
# cd /mnt/lustre
# touch 0 1
# while true; do
lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
done &
# while true; do
lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
done &
The same is true of |
| Comment by Andreas Dilger [ 28/Jun/13 ] |
|
Do you think this is the same bug as was previously being hit here, or a different bug with the same symptom (this wasn't hit with racer, and was apparently fixed by landing the patch)? Reopening this bug (which is marked a blocker) that was fixed and landed for 2.4.0 complicates the tracking process. It might be better to close this one again and open a separate bug related to racer + layout swap. While the symptom (LASSERT()) is serious, the stress of swapping the layout of two files in a loop is very unlikely to be seen in real life, so I won't rank this as a blocker. |
| Comment by John Hammond [ 29/Jun/13 ] |
|
Unfortunately this is still present in 2.4. Oleg's description matches up very well with what I'm seeing on 2.4 and 2.5, so reopening seems best. Based on stack traces and the shared reproducer, I assume this is really the same issue as |
| Comment by Jinshan Xiong (Inactive) [ 30/Jun/13 ] |
|
Duplicate of |
| Comment by Bruno Faccini (Inactive) [ 04/Jul/13 ] |
|
TGCC site hits the same issue/LBUG running an MPI-IO app on 2.1.3 ( |