[LU-2790] Failure to allocated osd keys leads to ofd_intent_policy()) ASSERTION( res_lvb != ((void *)0) ) failed Created: 10/Feb/13  Updated: 31/Mar/13  Resolved: 31/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-2867 2.1.4<->2.4.0 interop: parallel-scale... Resolved
Related
is related to LU-1431 Support for larger than 1MB sequentia... Resolved
is related to LU-2748 OSD uses kmalloc with high order to a... Resolved
is related to LU-2748 OSD uses kmalloc with high order to a... Resolved
Severity: 3
Rank (Obsolete): 6756

 Description   

After recent landings of LU-1431 patch amnd corresponding crop up of allocation failures due to LU-2748...

[25420.342529] ll_ost00_005: page allocation failure. order:5, mode:0x50
[25420.342845] Pid: 22594, comm: ll_ost00_005 Not tainted 2.6.32-debug #6
[25420.343134] Call Trace:
[25420.343356]  [<ffffffff81125bd6>] ? __alloc_pages_nodemask+0x976/0x9e0
[25420.343652]  [<ffffffff81160a62>] ? kmem_getpages+0x62/0x170
[25420.344062]  [<ffffffff8116349c>] ? fallback_alloc+0x1bc/0x270
[25420.344431]  [<ffffffff81162db7>] ? cache_grow+0x4d7/0x520
[25420.344748]  [<ffffffff81163188>] ? ____cache_alloc_node+0xa8/0x200
[25420.345035]  [<ffffffff81163838>] ? __kmalloc+0x208/0x2a0
[25420.345319]  [<ffffffffa09efc00>] ? cfs_alloc+0x30/0x60 [libcfs]
[25420.345614]  [<ffffffffa09efc00>] ? cfs_alloc+0x30/0x60 [libcfs]
[25420.345899]  [<ffffffffa048953e>] ? osd_key_init+0x1e/0x5d0 [osd_ldiskfs]
[25420.346231]  [<ffffffffa0eae3df>] ? keys_fill+0x6f/0x190 [obdclass]
[25420.346534]  [<ffffffffa0eb1e8b>] ? lu_context_init+0xab/0x260 [obdclass]
[25420.346842]  [<ffffffffa0eb205e>] ? lu_env_init+0x1e/0x30 [obdclass]
[25420.347134]  [<ffffffffa05bc90c>] ? ost_blocking_ast+0x5c/0xca0 [ost]
[25420.347443]  [<ffffffffa10ebded>] ? ldlm_work_bl_ast_lock+0xdd/0x290 [ptlrpc]
[25420.347770]  [<ffffffffa112c18f>] ? ptlrpc_set_wait+0x6f/0x880 [ptlrpc]
[25420.348102]  [<ffffffff81090154>] ? __init_waitqueue_head+0x24/0x40
[25420.348548]  [<ffffffffa09ef8a5>] ? cfs_waitq_init+0x15/0x20 [libcfs]
[25420.348977]  [<ffffffffa112876e>] ? ptlrpc_prep_set+0x11e/0x300 [ptlrpc]
[25420.349293]  [<ffffffffa10ebd10>] ? ldlm_work_bl_ast_lock+0x0/0x290 [ptlrpc]
[25420.349796]  [<ffffffffa10ee19b>] ? ldlm_run_ast_work+0x1db/0x460 [ptlrpc]
[25420.350126]  [<ffffffffa110580f>] ? ldlm_process_extent_lock+0x1af/0xa90 [ptlrpc]
[25420.350606]  [<ffffffffa10ee7b4>] ? ldlm_lock_enqueue+0x394/0x870 [ptlrpc]
[25420.350923]  [<ffffffffa1114e87>] ? ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc]
[25420.351417]  [<ffffffffa1115a86>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
[25420.351749]  [<ffffffffa1115a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
[25420.352248]  [<ffffffffa05bc8b0>] ? ost_blocking_ast+0x0/0xca0 [ost]
[25420.352574]  [<ffffffffa11123c0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
[25420.353124]  [<ffffffffa05c4807>] ? ost_handle+0x1be7/0x4590 [ost]
[25420.353543]  [<ffffffffa09fb204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[25420.353945]  [<ffffffffa1144e03>] ? ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
[25420.354432]  [<ffffffffa09ef65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[25420.354741]  [<ffffffffa113de91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[25420.355023]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[25420.355299]  [<ffffffffa11478cd>] ? ptlrpc_main+0xafd/0x17f0 [ptlrpc]
[25420.355606]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.355890]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[25420.356188]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.356495]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.356787]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
....
[25420.500609] LustreError: 22594:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 114: rc -12
[25420.502383] LustreError: 18292:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0000-osc-ffff88003f9d2bf0 lock: ffff880046658db0/0xd15dff8dc7742d63 lrc: 6/0,2 mode: --/PW res: 114/8589935616 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x0 nid: local remote: 0xd15dff8dc77438ca expref: -99 pid: 20430 timeout: 0 lvb_type: 1
...
[25420.604693] LustreError: 22594:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 116: rc -12
[25420.604777] LustreError: 18293:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0000-osc-ffff880054e39bf0 lock: ffff880084978db0/0xd15dff8dc7744183 lrc: 6/0,2 mode: --/PW res: 116/8589935616 rrc: 1 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x0 nid: local remote: 0xd15dff8dc77442a9 expref: -99 pid: 20443 timeout: 0 lvb_type: 1
[25420.620838]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[25420.621142]  [<ffffffffa11478cd>] ? ptlrpc_main+0xafd/0x17f0 [ptlrpc]
[25420.621445]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.621760]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[25420.622044]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.622339]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.622648]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
...
[25420.702106] LustreError: 22594:0:(ofd_dlm.c:177:ofd_intent_policy()) ASSERTION( res_lvb != ((void *)0) ) failed: 
[25420.702490] LustreError: 22594:0:(ofd_dlm.c:177:ofd_intent_policy()) LBUG
[25420.702705] Pid: 22594, comm: ll_ost00_005
[25420.702853] 
[25420.702853] Call Trace:
[25420.703112]  [<ffffffffa09ee915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[25420.703314]  [<ffffffffa09eef17>] lbug_with_loc+0x47/0xb0 [libcfs]
[25420.703492]  [<ffffffffa0697c85>] ofd_intent_policy+0x795/0x7c0 [ofd]
[25420.703712]  [<ffffffffa10ee70a>] ldlm_lock_enqueue+0x2ea/0x870 [ptlrpc]
[25420.703906]  [<ffffffffa1114e87>] ldlm_handle_enqueue0+0x4f7/0x1090 [ptlrpc]
[25420.704121]  [<ffffffffa1115a86>] ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
[25420.704332]  [<ffffffffa1115a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
[25420.704667]  [<ffffffffa05bc8b0>] ? ost_blocking_ast+0x0/0xca0 [ost]
[25420.704926]  [<ffffffffa11123c0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
[25420.705271]  [<ffffffffa05c4807>] ost_handle+0x1be7/0x4590 [ost]
[25420.705511]  [<ffffffffa09fb204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[25420.705715]  [<ffffffffa1144e03>] ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
[25420.706015]  [<ffffffffa09ef65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[25420.706213]  [<ffffffffa113de91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[25420.706397]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[25420.706585]  [<ffffffffa11478cd>] ptlrpc_main+0xafd/0x17f0 [ptlrpc]
[25420.706775]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.706956]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[25420.707232]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.707441]  [<ffffffffa1146dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[25420.707634]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[25420.707805] 
[25420.708112] Kernel panic - not syncing: LBUG


 Comments   
Comment by Alex Zhuravlev [ 10/Feb/13 ]

probably it makes sense to allocate I/O-related data (like dr_pages, etc), separately and on-demand.

Comment by Jian Yu [ 07/Mar/13 ]

The 2.1.4<->2.4.0 interop testing is affected by the issue in this ticket. Please refer to the failure in LU-2867.

Comment by Alex Zhuravlev [ 07/Mar/13 ]

I think this will be solved by http://review.whamcloud.com/#change,5444

Comment by Peter Jones [ 07/Mar/13 ]

Thanks Alex! Yu Jian could you please test this patch to confirm whether this is indeed the case?

Comment by Jian Yu [ 07/Mar/13 ]

Yu Jian could you please test this patch to confirm whether this is indeed the case?

Sure, I created http://review.whamcloud.com/5647 to test patch with Lustre b2_1 client. Let's wait for the test result.

Comment by Jian Yu [ 11/Mar/13 ]

Sure, I created http://review.whamcloud.com/5647 to test patch with Lustre b2_1 client. Let's wait for the test result.

The parallel-scale test passed: https://maloo.whamcloud.com/test_sessions/a8135b98-8a1e-11e2-b891-52540035b04c

Comment by Peter Jones [ 11/Mar/13 ]

Closing as a duplicate of LU-2748

Comment by Oleg Drokin [ 11/Mar/13 ]

I disagree with Alex' assessment.

LU-2748 only masked the symptoms here by making the original allocation more robust.
But should it fail for other reasons, this bug will still occur.

Comment by Alex Zhuravlev [ 11/Mar/13 ]

this code was taken directly from obdfilter (which has the same assert) and it never was a problem. that said i don't mean the code is absolutely correct, but I don't think this will be a problem with ofd.

Comment by nasf (Inactive) [ 11/Mar/13 ]

The failure occurred in ofd_lvbo_init() as following:

==========================
OBD_ALLOC_PTR(lvb);
if (lvb == NULL)
GOTO(out, rc = -ENOMEM);
==========================

The needed size for the LVB is just 56 bytes, very small.

Comment by Oleg Drokin [ 13/Mar/13 ]

Well, now that hte failed caller is exposed, we just need to fix the caller to do something more sensible.

But this is not a huge priority because it's not expected to really fit ever.

Comment by nasf (Inactive) [ 13/Mar/13 ]

This is the patch to handle lvbo_init() failure:

http://review.whamcloud.com/#change,5699

Comment by Jodi Levi (Inactive) [ 13/Mar/13 ]

Per discussions with Oleg, reducing priority to major.

Comment by Peter Jones [ 31/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:28:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.