[LU-3650] osc_ldlm_blocking_ast() LBUG Created: 27/Jul/13  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.4.0-RC1_3chaos, PPC64 client


Severity: 3
Rank (Obsolete): 9393

 Description   

We are hitting the LBUG in osc_ldlm_blocking_ast() that says "This should never happen". The backtrace looks like:

PID: 29660  TASK: c00000098d26bab0  CPU: 42  COMMAND: "ldlm_bl_44"
 #0 [c000000dc9f3f7c0] .crash_kexec at c0000000000e5aa4
 #1 [c000000dc9f3f9c0] .panic at c0000000005c4f40
 #2 [c000000dc9f3fa50] .lbug_with_loc at d00000000acd14e0 [libcfs]
 #3 [c000000dc9f3fae0] .osc_ldlm_blocking_ast at d00000000afbe1d8 [osc]
 #4 [c000000dc9f3fbc0] .ldlm_cancel_callback at d00000000bf3148c [ptlrpc]
 #5 [c000000dc9f3fc60] .ldlm_cli_cancel_local at d00000000bf4db78 [ptlrpc]
 #6 [c000000dc9f3fd20] .ldlm_cli_cancel_list_local at d00000000bf4f80c [ptlrpc]
 #7 [c000000dc9f3fe40] .ldlm_bl_thread_main at d00000000bf5c07c [ptlrpc]
 #8 [c000000dc9f3ff90] .kernel_thread at c000000000032fd4

Both times we have seen this (that I am aware of), the LBUG was preceded shortly before by a kernel page allocation error, so that may be part of the recipe for triggering this bug.

This is a PPC64 client running lustre 2.4.0-RC1_3chaos.



 Comments   
Comment by Keith Mannthey (Inactive) [ 27/Jul/13 ]

Do you have the page allocation error? Do you know which allocations fail right before the crash?

Comment by Peter Jones [ 27/Jul/13 ]

Lai

Do you have anything else to add?

Peter

Comment by Christopher Morrone [ 29/Jul/13 ]

This was the most recent failure:

2013-07-26 14:41:40 ib0: failed to allocate receive buffer 75
2013-07-26 14:42:57 ib0: Reap connection for gid fe80:0000:0000:0000:0002:c903:004b:150d
2013-07-26 14:42:57 ib0: neigh free for 000048 fe80:0000:0000:0000:0002:c903:004b:150d
2013-07-26 14:42:57 ib0: Destroy active connection 0x5fa head 0x91b7 tail 0x91b7
2013-07-26 14:43:30 ib0: Request connection 0x5fc for gid fe80:0000:0000:0000:0002:c903:004b:150d qpn 0x48
2013-07-26 14:43:30 ib0: REP received.
2013-07-26 14:43:30 ib0: REQ arrived
2013-07-26 14:47:07 ldlm_bl_44: page allocation failure. order:0, mode:0x50
2013-07-26 14:47:07 Call Trace:
2013-07-26 14:47:07 [c000000dc9f3f2f0] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
2013-07-26 14:47:07 [c000000dc9f3f3a0] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-07-26 14:47:07 [c000000dc9f3f540] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-07-26 14:47:07 [c000000dc9f3f5d0] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-07-26 14:47:07 [c000000dc9f3f6b0] [c0000000001ae31c] .__kmalloc+0x1fc/0x260
2013-07-26 14:47:07 [c000000dc9f3f770] [d00000000acd2724] .cfs_alloc+0x34/0x80 [libcfs]
2013-07-26 14:47:07 [c000000dc9f3f800] [d00000000ba74c14] .lu_context_init+0x84/0x440 [obdclass]
2013-07-26 14:47:07 [c000000dc9f3f8c0] [d00000000ba80fa4] .cl_env_new+0x234/0x510 [obdclass]
2013-07-26 14:47:07 [c000000dc9f3f9a0] [d00000000ba814c8] .cl_env_get+0x88/0x230 [obdclass]
2013-07-26 14:47:07 [c000000dc9f3fa50] [d00000000ba816fc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-07-26 14:47:07 [c000000dc9f3fae0] [d00000000afbde78] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-07-26 14:47:07 [c000000dc9f3fbc0] [d00000000bf3148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc]
2013-07-26 14:47:07 [c000000dc9f3fc60] [d00000000bf4db78] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc]
2013-07-26 14:47:07 [c000000dc9f3fd20] [d00000000bf4f80c] .ldlm_cli_cancel_list_local+0x14c/0x410 [ptlrpc]
2013-07-26 14:47:07 [c000000dc9f3fe40] [d00000000bf5c07c] .ldlm_bl_thread_main+0x19c/0x610 [ptlrpc]
2013-07-26 14:47:07 [c000000dc9f3ff90] [c000000000032fd4] .kernel_thread+0x54/0x70

It looks like a hit a while back was preceded by several overlapping allocation failures, so its a little harder to read:

2013-05-17 10:35:01 ldlm_bl_145: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 [c00000032b1db180] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
2013-05-17 10:35:01 [c00000032b1db230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-05-17 10:35:01 [c00000032b1db3d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c00000032b1db460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-05-17 10:35:01 [c00000032b1db540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 [c00000032b1db5f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs]
2013-05-17 10:35:01 [c00000032b1db670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc]
2013-05-17 10:35:01 [c00000032b1db730] [d00000000b8534b0] .keys_fill+0xd0/0x280 [obdclass]
2013-05-17 10:35:01 ldlm_bl_08: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 ldlm_bl_121: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c00000032b1db800] [d00000000b857dc8] .lu_context_init+0x108/0x440 [obdclass]
2013-05-17 10:35:01 [c00000032b1db8c0] [d00000000b8638a4] .cl_env_new+0x234/0x510 [obdclass]
2013-05-17 10:35:01 ldlm_bl_01: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 ldlm_bl_32: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 [c000000dd2903180] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
2013-05-17 10:35:01 [c000000dd2903230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-05-17 10:35:01 [c000000c1d9ef180] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
2013-05-17 10:35:01 [c00000032b1db9a0] [d00000000b863dc8] .cl_env_get+0x88/0x230 [obdclass]
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 [c000000c1d9ef230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-05-17 10:35:01 [c000000f4e87f180] [c000000000012f04] .show_stack+0x74/0x1c0 (unreliable)
2013-05-17 10:35:01 [c000000f4a17f180] [c000000000012f04] .show_stack+0x74/0x1c0[c000000dd29033d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c000000dd2903460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-05-17 10:35:01 [c000000dd2903540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 [c000000dd29035f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs] (unreliable)
2013-05-17 10:35:01 [c000000f4e87f230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-05-17 10:35:01 [c000000f4e87f3d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c000000f4e87f460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-05-17 10:35:01 [c000000f4e87f540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 [c000000c1d9ef3d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c000000c1d9ef460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-05-17 10:35:01 [c000000c1d9ef540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 [c000000c1d9ef5f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs][c00000032b1dba50] [d00000000b863ffc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-05-17 10:35:01 ldlm_bl_38: page allocation failure. order:0, mode:0x50
2013-05-17 10:35:01 
2013-05-17 10:35:01 Call Trace:
2013-05-17 10:35:01 [c000000c1d9ef670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc][c000000ddb6d7180] [c000000000012f04] .show_stack+0x74/0x1c0
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000f4e87f5f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs]
2013-05-17 10:35:01 [c000000f4e87f670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc]
2013-05-17 10:35:01 [c00000032b1dbae0] [d00000000c0bcaf8] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-05-17 10:35:01 [c00000032b1dbbc0] [d00000000bd9148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc]
2013-05-17 10:35:01 
2013-05-17 10:35:01  (unreliable)
2013-05-17 10:35:01 [c000000f4e87f730] [d00000000b8534b0] .keys_fill+0xd0/0x280 [obdclass][c000000ddb6d7230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9ef730] [d00000000b8534b0] .keys_fill+0xd0/0x280 [obdclass][c000000f4a17f230] [c0000000001669e0] .__alloc_pages_nodemask+0x640/0x8c0[c00000032b1dbc60] [d00000000bdadd48] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc]
2013-05-17 10:35:01 [c000000f4e87f800] [d00000000b857dc8] .lu_context_init+0x108/0x440 [obdclass][c00000032b1dbd20] [d00000000bdaf9dc] .ldlm_cli_cancel_list_local+0x14c/0x410 [ptlrpc]
2013-05-17 10:35:01 
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d73d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c000000ddb6d7460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310[c000000f4a17f3d0] [c0000000001abf9c] .kmem_getpages+0x7c/0x1a0
2013-05-17 10:35:01 [c000000dd2903670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc][c000000f4a17f460] [c0000000001ad3a4] .fallback_alloc+0x244/0x310
2013-05-17 10:35:01 
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000dd2903730] [d00000000b8534b0] .keys_fill+0xd0/0x280 [obdclass]
2013-05-17 10:35:01 [c000000dd2903800] [d00000000b857dc8] .lu_context_init+0x108/0x440 [obdclass][c000000f4e87f8c0] [d00000000b8638a4] .cl_env_new+0x234/0x510 [obdclass]
2013-05-17 10:35:01 [c000000dd29038c0] [d00000000b8638a4] .cl_env_new+0x234/0x510 [obdclass]
2013-05-17 10:35:01 

2013-05-17 10:35:01 [c000000c1d9ef800] [d00000000b857dc8] .lu_context_init+0x108/0x440 [obdclass][c000000ddb6d7540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9ef8c0] [d00000000b8638a4] .cl_env_new+0x234/0x510 [obdclass][c000000ddb6d75f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9ef9a0] [d00000000b863dc8] .cl_env_get+0x88/0x230 [obdclass]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d7670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc][c000000c1d9efa50] [d00000000b863ffc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-05-17 10:35:01 [c00000032b1dbe40] [d00000000bdbc24c] .ldlm_bl_thread_main+0x19c/0x610 [ptlrpc]
2013-05-17 10:35:01 [c000000dd29039a0] [d00000000b863dc8] .cl_env_get+0x88/0x230 [obdclass][c000000ddb6d7730] [d00000000b8534b0] .keys_fill+0xd0/0x280 [obdclass]
2013-05-17 10:35:01 [c000000f4e87f9a0] [d00000000b863dc8] .cl_env_get+0x88/0x230 [obdclass][c000000dd2903a50] [d00000000b863ffc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-05-17 10:35:01 [c000000f4a17f540] [c0000000001af1cc] .kmem_cache_alloc+0x18c/0x220
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c00000032b1dbf90] [c000000000032fd4] .kernel_thread+0x54/0x70
2013-05-17 10:35:01 [c000000dd2903ae0] [d00000000c0bcaf8] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-05-17 10:35:01 [c000000dd2903bc0] [d00000000bd9148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc][c000000f4e87fa50] [d00000000b863ffc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d7800] [d00000000b857dc8] .lu_context_init+0x108/0x440 [obdclass][c000000c1d9efae0] [d00000000c0bcaf8] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d78c0] [d00000000b8638a4] .cl_env_new+0x234/0x510 [obdclass][c000000c1d9efbc0] [d00000000bd9148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d79a0] [d00000000b863dc8] .cl_env_get+0x88/0x230 [obdclass][c000000c1d9efc60] [d00000000bdadd48] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc]
2013-05-17 10:35:01 
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d7a50] [d00000000b863ffc] .cl_env_nested_get+0x8c/0xf0 [obdclass]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9efd20] [d00000000bdaf9dc] .ldlm_cli_cancel_list_local+0x14c/0x410 [ptlrpc][c000000ddb6d7ae0] [d00000000c0bcaf8] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-05-17 10:35:01 Mem-Info:
2013-05-17 10:35:01 [c000000dd2903c60] [d00000000bdadd48] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc][c000000f4a17f5f0] [d00000000aa72588] .cfs_mem_cache_alloc+0x28/0x40 [libcfs]
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9efe40] [d00000000bdbc24c] .ldlm_bl_thread_main+0x19c/0x610 [ptlrpc]
2013-05-17 10:35:01 
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000c1d9eff90] [c000000000032fd4] .kernel_thread+0x54/0x70[c000000f4e87fae0] [d00000000c0bcaf8] .osc_ldlm_blocking_ast+0x58/0x4f0 [osc]
2013-05-17 10:35:01 Node 0 DMA per-cpu:
2013-05-17 10:35:01 
2013-05-17 10:35:01 [c000000ddb6d7bc0] [d00000000bd9148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc][c000000f4a17f670] [d00000000c0b2cec] .osc_session_init+0x5c/0x350 [osc]
2013-05-17 10:35:01 [c000000f4e87fbc0] [d00000000bd9148c] .ldlm_cancel_callback+0x9c/0x310 [ptlrpc]CPU    0: hi:    6, btch:   1 usd:   4
2013-05-17 10:35:01 CPU    1: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    2: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    3: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    4: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    5: hi:    6, btch:   1 usd:   3
2013-05-17 10:35:01 CPU    6: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    7: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU    8: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 Mem-Info:
2013-05-17 10:35:01 
2013-05-17 10:35:01 Node 0 
2013-05-17 10:35:01 [c000000dd2903d20] [d00000000bdaf9dc] .ldlm_cli_cancel_list_local+0x14c/0x410 [ptlrpc]CPU    9: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 CPU   10: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 [c000000f4e87fc60] [d00000000bdadd48] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc]DMA per-cpu:
2013-05-17 10:35:01 [c000000ddb6d7c60] [d00000000bdadd48] .ldlm_cli_cancel_local+0xd8/0x7d0 [ptlrpc]CPU    0: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 
2013-05-17 10:35:01 CPU    1: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 [c000000ddb6d7d20] [d00000000bdaf9dc] .ldlm_cli_cancel_list_local+0x14c/0x410 [ptlrpc]CPU    2: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 
2013-05-17 10:35:01 CPU    3: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 [c000000ddb6d7e40] [d00000000bdbc24c] .ldlm_bl_thread_main+0x19c/0x610 [ptlrpc]CPU    4: hi:    6, btch:   1 usd:   5
2013-05-17 10:35:01 
2013-05-17 10:35:01 CPU    5: hi:    6, btch:   1 usd:   3
2013-05-17 10:35:01 [c000000ddb6d7f90] [c000000000032fd4] .kernel_thread+0x54/0x70CPU    6: hi:    6, btch:   1 usd:   5
Comment by Lai Siyao [ 30/Jul/13 ]

IMO it's best to prealloc env for ldlm blocking threads, I'll cook a patch later.

Comment by Lai Siyao [ 31/Jul/13 ]

Patch is on http://review.whamcloud.com/#/c/7189/

Comment by Lai Siyao [ 01/Aug/13 ]

This looks to be more complicated than original thought: osc_ldlm_blocking_ast can be called under several different context:
1. cancel lock in ldlm_callback_handler, because it failed to handle blocking AST to ldlm bl threads

    ptlrpc_main ->
        ldlm_callback_handler ->
            l_blocking_ast

2. cancel lock in context of ldlm bl threads. This is the case happened in this ticket.

    ldlm_bl_thread_main ->
        l_blocking_ast

3. cancel lock voluntarily, this is simple and fine
4. same as 3, but it may be called in the context of outer IO

    cl_enqueue ->
        osc_enqueue_base ->
            ldlm_prep_elc_req ->
                l_blocking_ast

The original patch only fixes case 2, and there doesn't look to be simple way to fix case 1 and 4. so we may have to get rid of the LBUG for env allocation failure, and return error anyway.

Comment by Lai Siyao [ 06/Aug/13 ]

On second thought, for case 4 this is an extra ELC (ELC count is too big to pack in request, and the remaining locks should be canceled in a separate request), failure to do it won't cause any trouble.

And for case 1, generally the blocking ast is handled to ldlm bl threads, and in rare case it will fail (men allocation failure), and upon its failure, we don't need to guarantee blocking ast success, and system will recover to normal state.

So in the latest patch we just let blocking ast fail upon env get failure instead of LBUG.

Generated at Sat Feb 10 01:35:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.