[LU-4886] Kernel Panic "cl_lock_put" Created: 12/Apr/14  Updated: 14/Apr/14  Resolved: 12/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Rustem Bikboulatov Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Linux 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP


Attachments: GIF File 20140113 - Hardware Diagram v0.1_R3.gif    
Severity: 3
Rank (Obsolete): 13522

 Description   

We have a kernel crash on Lustre Client 2.1.5 with the following log:

[root@r01 ~]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux /var/crash/127.0.0.1-2014-04-11-17\:36\:14/vmcore

crash 6.0.4-2.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2014-04-11-17:36:14/vmcore [PARTIAL DUMP]
CPUS: 16
DATE: Fri Apr 11 17:35:10 2014
UPTIME: 12 days, 06:09:56
LOAD AVERAGE: 1.22, 1.18, 1.57
TASKS: 604
NODENAME: r01
RELEASE: 2.6.32-279.19.1.el6_lustre.x86_64
VERSION: #1 SMP Wed Mar 20 16:37:18 PDT 2013
MACHINE: x86_64 (2400 Mhz)
MEMORY: 12 GB
PANIC: ""
PID: 28331
COMMAND: "ldlm_bl_00"
TASK: ffff880334904aa0 [THREAD_INFO: ffff88023c770000]
CPU: 6
STATE: TASK_RUNNING (PANIC)

crash> log

...

Pid: 28331, comm: ldlm_bl_00 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa04de478>] [<ffffffffa04de478>] cl_lock_put+0x118/0x490 [obdclass]
RSP: 0018:ffff88023c771c40 EFLAGS: 00010246
RAX: 0000000000000001 RBX: 5a5a5a5a5a5a5a5a RCX: ffff8801410d37b8
RDX: ffffffffa04ff485 RSI: 5a5a5a5a5a5a5a5a RDI: ffff880181fc3930
RBP: ffff88023c771c70 R08: ffffffffa04ef540 R09: 00000000000002f4
R10: 00000000deadbeef R11: 0000000000000000 R12: ffff880181fc3930
R13: ffff880181fc3930 R14: ffff88018be93420 R15: ffff88023c771ca0
FS: 00007f63a402f700(0000) GS:ffff8801c5840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003b1faabc30 CR3: 00000001cf1a7000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_00 (pid: 28331, threadinfo ffff88023c770000, task ffff880334904aa0)
Stack:
ffff88023c771c70 ffff880132aaf000 ffff88018be93420 ffff880181fc3930
<d> ffff88018be93420 ffff88023c771ca0 ffff88023c771ce0 ffffffffa0953b30
<d> ffff880200000000 ffff8801410d37b8 ffff8801410d37b8 00000002a0393092
Call Trace:
[<ffffffffa0953b30>] osc_ldlm_blocking_ast+0xb0/0x380 [osc]
[<ffffffffa05e4cc0>] ldlm_cancel_callback+0x60/0x100 [ptlrpc]
[<ffffffffa05ff14b>] ldlm_cli_cancel_local+0x7b/0x380 [ptlrpc]
[<ffffffffa0602fd8>] ldlm_cli_cancel+0x58/0x3a0 [ptlrpc]
[<ffffffffa0952af1>] osc_lock_cancel+0xe1/0x1b0 [osc]
[<ffffffffa04d544d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa04db225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa04dbf0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0953bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa0606123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa0606561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 00 00 00 00 00 c7 05 5c db 06 00 01 00 00 00 48 c7 c7 a0 bf 54 a0 8b 13 4c 8b 45 08 31 c0 e8 d0 d9 ea ff eb 12 66 0f 1f 44 00 00 <48> 8b 43 28 48 8b 40 08 4c 8b 68 18 f0 ff 0b 0f 94 c0 84 c0 74
RIP [<ffffffffa04de478>] cl_lock_put+0x118/0x490 [obdclass]
RSP <ffff88023c771c40>

Cluster configuration:

Lustre Server MGS/MDS - mmp-2
Lustre Servers OSS - n11, n12, n13, n14, n15, n21, n22, n23, n24, n25
Lustre Clients - r01, r02, r03, r04, mmp-1, vn-1, cln01, cln02, cln03, cln04

(refer to the diagram "20140113 - Hardware Diagram v0.1_R3.gif" in attachment)

Environment:
Linux 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP

Mount points:
OSS:
/dev/md11 on /lustre/ost type lustre (rw,noauto,_netdev,abort_recov)

MGS/MDS:
/dev/lustre_mgs on /lustre/mgs type lustre (rw,noauto,_netdev,abort_recov)
/dev/lustre_mdt1 on /lustre/mdt1 type lustre (rw,noauto,_netdev,abort_recov)

Clients (r01, r02, r03, r04, mmp-1, vn-1):
mmp-2@tcp:mmp-1@tcp:/lustre1 on /array1 type lustre (rw,noauto,_netdev,noflock,abort_recov,lazystatfs)

Clients (cln01, cln02, cln03, cln04):
mmp-2@tcp:mmp-1@tcp:/lustre1 on /array1 type lustre (rw,noauto,_netdev,localflock,abort_recov,lazystatfs)

Stripe config:
[root@mmp-1 ~]# lfs getstripe /array1/.
/array1/.
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1

kdump config:
core_collector makedumpfile -c --message-level 1 -d 31

We have a crash dump file, and if you need it for analysis, we are ready to upload it.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 12/Apr/14 ]

Can you please try patch in LU-4558? The patch number is http://review.whamcloud.com/9939 for b2_1.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/14 ]

LU-4558

Comment by Rustem Bikboulatov [ 12/Apr/14 ]

Yes, I have seen a patch LU-4558 and review it carefully. I have some questions:

1) This patch changes the two procedures:

cl_lock_delete0
cl_lock * cl_lock_find

Trace log in LU-4558 contains the procedure "cl_lock_delete0":

<4> [997.881412] [<ffffffffa05c55b5>] cl_lock_delete0 +0 xb5/0x1d0 [obdclass]

In my case the trace log contains no procedure "cl_lock_delete0", and contains other procedures:

[<ffffffffa04db225>] cl_lock_cancel0 +0 x75/0x160 [obdclass]
[<ffffffffa04dbf0b>] cl_lock_cancel +0 x13b/0x140 [obdclass]

Is patch LU-4558 really can help in my case?

2) Can I install the patch LU-4558 for only one client? Or do I need to patch all the servers (OSS, MDS, MGS, all clients) ?

Comment by Jinshan Xiong (Inactive) [ 13/Apr/14 ]

I think the problem you met was due to referring a freed lock, which is exactly the patch in LU-4558 trying to fix.

You don't need this patch on server side, and yes, you can pick a few clients to apply the patch and upgrade all other clients only if it works.

Comment by Rustem Bikboulatov [ 13/Apr/14 ]

When I try to compile lustre 2.1.5 with patch LU-4558, I get this error:

======================================================

Making all in obdclass
make[5]: Entering directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5/lustre/obdclass'
Making all in linux
make[6]: Entering directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5/lustre/obdclass'
cl_lock.c: In function 'cl_lock_delete0':
cl_lock.c:825: error: 'bool' undeclared (first use in this function)
cl_lock.c:825: error: (Each undeclared identifier is reported only once
cl_lock.c:825: error: for each function it appears in.)
cl_lock.c:825: error: expected ';' before 'in_cache'
cl_lock.c:833: error: 'in_cache' undeclared (first use in this function)
make[6]: *** [liblustreclass_a-cl_lock.o] Error 1
make[6]: *** Waiting for unfinished jobs....
make[6]: Leaving directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5/lustre/obdclass'
make[5]: *** [all-recursive] Error 1
make[5]: Leaving directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5/lustre/obdclass'
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5/lustre'
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/home/build2/kernel/rpmbuild/BUILD/lustre-2.1.5'
error: Bad exit status from /var/tmp/rpm-tmp.XaX5IH (%build)

RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.XaX5IH (%build)
make[1]: *** [rpms-real] Error 1
make[1]: Leaving directory `/home/build2/lustre-release'
make: *** [rpms] Error 2

===========================================================================

It seems that a "bool" type is not defined

Comment by Rustem Bikboulatov [ 14/Apr/14 ]

In addition i want to say that we got the new kernel crash error (on server r04), which has a very similar trace log:

https://jira.hpdd.intel.com/browse/LU-3614?focusedCommentId=81519&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-81519

Generated at Sat Feb 10 01:46:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.