[LU-7966] LNetError: 4231:0:(linux-cpu.c:1081:cfs_cpu_init()) LBUG Created: 31/Mar/16  Updated: 31/Mar/16  Resolved: 31/Mar/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Frank Heckes (Inactive)
Resolution: Done Votes: 0
Labels: soak
Environment:

lola
build: 2.8 GA + patches


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happens during soak testing of build '20160324' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160324).
LNet run on IB (all nodes equipped with Mellanox HCAs 4xQDR )

Sequence of events

  • Error happened after a MDS node paniced (see LU-7935) during MDT failback at 2016-03-29 14:53( umount of MDT). The MDT (lola-9) node
    was unsuable (i.e no primary or secondary resources mounted) as the error occurred on the Lustre client described below. Anyway, evtl. this event isn't related.
  • Lustre client crash with the following error message:
    <0>LNetError: 4231:0:(linux-cpu.c:1081:cfs_cpu_init()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) || (((cpumask_size())) <= (2 << 12) && ((((((gfp_t)0x10u) | ((gfp_t)0x40u)))) & (((gfp_t)0x20u)))) != 0 ) failed: 
    <0>LNetError: 4231:0:(linux-cpu.c:1081:cfs_cpu_init()) LBUG
    <0>Kernel panic - not syncing: LBUG in interrupt.
    <0>
    <4>Pid: 4231, comm: modprobe Not tainted 2.6.32-504.30.3.el6.x86_64 #1
    <4>Call Trace:
    <4> [<ffffffff815293fc>] ? panic+0xa7/0x16f
    <4> [<ffffffffa0478ebd>] ? lbug_with_loc+0x8d/0xb0 [libcfs]
    <4> [<ffffffffa047dcfc>] ? cfs_cpu_init+0xc7c/0xcb0 [libcfs]
    <4> [<ffffffff810a5525>] ? atomic_notifier_chain_register+0x55/0x60
    <4> [<ffffffffa047875c>] ? libcfs_register_panic_notifier+0x1c/0x20 [libcfs]
    <4> [<ffffffffa0482b70>] ? init_libcfs_module+0x0/0x340 [libcfs]
    <4> [<ffffffffa0482b97>] ? init_libcfs_module+0x27/0x340 [libcfs]
    <4> [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
    <4> [<ffffffff810c0181>] ? sys_init_module+0xe1/0x250
    <4> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
    

Attached files:
console, messages, vmcore-dmsg.txt of affected node lola-33.
Crash dump file is available.



 Comments   
Comment by Frank Heckes (Inactive) [ 31/Mar/16 ]

One question related. Shouldn't LBUG added to health_check?

[root@lola-33 127.0.0.1-2015-09-29-16:30:51]# cat /proc/fs/lustre/health_check 
healthy
[root@lola-33 127.0.0.1-2015-09-29-16:30:51]# date ; uptime
Thu Mar 31 04:25:11 PDT 2016
 04:25:11 up 6 days, 20:06,  1 user,  load average: 4.00, 3.97, 3.79
Comment by Frank Heckes (Inactive) [ 31/Mar/16 ]

Sorry, ticket can closed. Error is obsolete. I was confused by the date the LBUG happened.

Generated at Sat Feb 10 02:13:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.