[LU-3629] vvp_env_session() ASSERTION( ses != ((void *)0) ) Created: 24/Jul/13  Updated: 12/Jan/19  Resolved: 12/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.4.0-RC1_3chaos, PPC64 lustre client


Severity: 3
Rank (Obsolete): 9346

 Description   

A few weeks ago we had a login node (Lustre client) die in the following assertion:

(llite_internal.h:1064:vvp_env_session()) ASSERTION( ses != ((void *)0) )

It was running Lustre 2.4.0-RC1_3chaos

The backtrace from crash looks like:

crash> bt
PID: 10792  TASK: c000000ef9de7da0  CPU: 28  COMMAND: "slurm_prolog"
 #0 [c000000bc3706720] .crash_kexec at c0000000000e5aa4
 #1 [c000000bc3706920] .panic at c0000000005c4f40
 #2 [c000000bc37069b0] .lbug_with_loc at d00000000aa714e0 [libcfs]
 #3 [c000000bc3706a40] .vvp_io_init at d00000000c6cc03c [lustre]
 #4 [c000000bc3706b20] .cl_io_init0 at d00000000b808024 [obdclass]
 #5 [c000000bc3706bd0] .cl_pages_prune at d00000000b7fbc18 [obdclass]
 #6 [c000000bc3706c80] .cl_object_prune at d00000000b7f1f00 [obdclass]
 #7 [c000000bc3706d30] .lov_delete_raid0 at d00000000c1fa8a4 [lov]
 #8 [c000000bc3706e50] .lov_object_delete at d00000000c1f9240 [lov]
 #9 [c000000bc3706f00] .lu_object_free at d00000000b7e3520 [obdclass]
#10 [c000000bc3706fe0] .lu_object_put at d00000000b7e7360 [obdclass]
#11 [c000000bc37070b0] .cl_object_put at d00000000b7f2c90 [obdclass]
#12 [c000000bc3707120] .cl_inode_fini at d00000000c6bfd68 [lustre]
#13 [c000000bc3707230] .ll_clear_inode at d00000000c677264 [lustre]
#14 [c000000bc3707310] .clear_inode at c0000000001e1cc8
#15 [c000000bc37073a0] .dispose_list at c0000000001e2068
#16 [c000000bc3707450] .shrink_icache_memory at c0000000001e24c4
#17 [c000000bc3707540] .shrink_slab at c00000000016ecbc
#18 [c000000bc3707600] .do_try_to_free_pages at c0000000001716b0
#19 [c000000bc3707720] .try_to_free_pages at c000000000171a88
#20 [c000000bc3707820] .__alloc_pages_nodemask at c0000000001668c0
#21 [c000000bc37079c0] .alloc_pages_vma at c0000000001a2694
#22 [c000000bc3707a70] .handle_pte_fault at c00000000017fec4
#23 [c000000bc3707b80] .do_page_fault at c0000000005c14b0
#24 [c000000bc3707e30] handle_page_fault at c00000000000520c
 Data Access error  [301] exception frame:
 R0:  0000000000000000    R1:  00000fffffffd9c0    R2:  0000040000323268   
 R3:  0000040000320878    R4:  000000000000001d    R5:  00000fffffffdebe   
 R6:  0000000000000000    R7:  00000400003210c8    R8:  0000000000000218   
 R9:  0000000010320000    R10: 0000000000000031    R11: 0000000000020001   
 R12: 0000000028002482    R13: 000004000004f040   
 NIP: 00000400001f3ad4    MSR: 800000000000d032    OR3: 0000000010340000
 CTR: 0000000000000000    LR:  00000400001f494c    XER: 0000000000000010
 CCR: 0000000028002482    MQ:  0000000000000001    DAR: 0000000010320008
 DSISR: 0000000042000000     Syscall Result: 0000000000000000

This was a PPC64 login node.



 Comments   
Comment by Peter Jones [ 25/Jul/13 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 29/Jul/13 ]

I didn't see how this can happen from the master code. Chris, how to checkout the 2.4.0-RC1_3chaos? I'd like to if there is any difference in your branch. Thanks.

Comment by Peter Jones [ 29/Jul/13 ]

Niu

Try looking at https://github.com/chaos/lustre

Peter

Comment by Niu Yawei (Inactive) [ 05/Aug/13 ]

I still don't see how this can happen after checking the chaos code, I suppose it's rare, isn't it? Do you know if this was happening when memory is under pressure? and is there any other abnormal log from client? Thanks.

Comment by Christopher Morrone [ 07/Aug/13 ]

It is fairly rare, yes. Memory pressure is certainly a possibility. The login nodes are heavily used.

Comment by Peter Jones [ 20/Jul/17 ]

Ned

Is this issue still seen on 2.8.x releases?

Peter

Comment by Peter Jones [ 12/Jan/19 ]

closing ancient ticket

Generated at Sat Feb 10 01:35:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.