[LU-6049] General Protection Fault at echo_session_key_fini+0xa9 Created: 18/Dec/14 Updated: 14/Jun/18 Resolved: 20/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Valentin (Inactive) | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
kernel 2.6.32-279.5.2 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16855 | ||||||||
| Description |
|
General Protection Fault at echo_session_key_fini+0xa9 A crash on an OSS with the following trace has been encountered with lustre 2.1.6 and kernel 2.6.32-279.5.2 2013-10-18 07:57:21 general protection fault: 0000 [#1] SMP 2013-10-18 07:57:21 last sysfs file: /sys/devices/pci0000:80/0000:80:02.2/0000:84:00.0/host10/port-10:3/end_device-10:3/target10:0:3/10:0:3:4/state 2013-10-18 07:57:21 CPU 0 2013-10-18 07:57:21 Modules linked in: nfs fscache obdecho(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportf s sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod uinput usbhid hid sg mpt2sas scsi_transport_sas raid_class sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext4 mbcache jbd2 ehci_hcd sd_mod crc_t10dif ahci megaraid_sas [last unloaded: scsi_wait_scan] 2013-10-18 07:57:21 2013-10-18 07:57:21 Pid: 14573, comm: obd_zombid Not tainted 2.6.32-279.5.2.bl6.Bull.35_restricted.x86_64 #1 Bull SAS bullx R/X9DRH-7TF/7F/iTF/iF 2013-10-18 07:57:21 RIP: 0010:[<ffffffffa0a2dc49>] [<ffffffffa0a2dc49>] echo_session_key_fini+0xa9/0x100 [obdecho] 2013-10-18 07:57:21 RSP: 0018:ffff88105451bce0 EFLAGS: 00010246 2013-10-18 07:57:21 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000002 2013-10-18 07:57:21 RDX: ffff8808587c7540 RSI: 5a5a5a5a5a5a5a5a RDI: 0000000000000000 2013-10-18 07:57:21 RBP: ffff88105451bcf0 R08: ffffffff81ac5ee8 R09: 0000000000000140 2013-10-18 07:57:21 R10: 0000000000000000 R11: 000000000000000c R12: ffff880f1ad9a2f8 2013-10-18 07:57:21 R13: 0000000000000010 R14: ffff88105451be20 R15: ffff8802f65a46d0 2013-10-18 07:57:21 FS: 00007f14eea7f700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 2013-10-18 07:57:21 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-10-18 07:57:21 CR2: 00007f5cb4fc8000 CR3: 0000000001a06000 CR4: 00000000000406f0 2013-10-18 07:57:21 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-10-18 07:57:21 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-10-18 07:57:21 Process obd_zombid (pid: 14573, threadinfo ffff881054518000, task ffff881068e0c080) 2013-10-18 07:57:21 Stack: 2013-10-18 07:57:21 0000000000000010 ffffffffa0a3cf40 ffff88105451bd20 ffffffffa0554d90 2013-10-18 07:57:21 <d> ffff88105451bd20 ffffffffa0a3cf40 ffff880f1ad9a2f8 ffff88071cd302c0 2013-10-18 07:57:21 <d> ffff88105451bd40 ffffffffa055514b 0000000000000000 ffff8802f65a46c8 2013-10-18 07:57:21 Call Trace: 2013-10-18 07:57:21 [<ffffffffa0554d90>] key_fini+0x60/0x1e0 [obdclass] 2013-10-18 07:57:21 [<ffffffffa055514b>] lu_context_key_quiesce+0x5b/0x90 [obdclass] 2013-10-18 07:57:21 [<ffffffffa05551d9>] lu_context_key_quiesce_many+0x59/0x80 [obdclass] 2013-10-18 07:57:21 [<ffffffffa0a2d910>] echo_type_stop+0x20/0x30 [obdecho] 2013-10-18 07:57:21 [<ffffffffa0554322>] lu_device_fini+0x52/0xd0 [obdclass] 2013-10-18 07:57:21 [<ffffffffa0a2fed7>] echo_device_free+0x247/0x510 [obdecho] 2013-10-18 07:57:22 [<ffffffffa053871d>] class_decref+0x46d/0x590 [obdclass] 2013-10-18 07:57:22 [<ffffffffa0523bae>] obd_zombie_impexp_cull+0x31e/0x620 [obdclass] 2013-10-18 07:57:22 [<ffffffffa0523f75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass] 2013-10-18 07:57:22 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] 2013-10-18 07:57:22 [<ffffffff8100412a>] child_rip+0xa/0x20 2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] 2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass] 2013-10-18 07:57:22 [<ffffffff81004120>] ? child_rip+0x0/0x20 2013-10-18 07:57:22 Code: 99 02 00 00 48 c7 05 d3 49 01 00 00 00 00 00 c7 05 c1 49 01 00 10 00 00 00 e8 14 74 a2 ff 48 b8 5a 5a 5a 5a 5a 5a 5a 5a 48 89 de <48> 89 03 48 8b 3d d5 69 01 00 e8 98 bb a1 ff 48 83 c4 08 5b c9 2013-10-18 07:57:22 RIP [<ffffffffa0a2dc49>] echo_session_key_fini+0xa9/0x100 [obdecho] 2013-10-18 07:57:22 RSP <ffff88105451bce0> 2013-10-18 07:57:22 Initializing cgroup subsys cpuset 2013-10-18 07:57:22 Initializing cgroup subsys cpu 2013-10-18 07:57:22 Linux version 2.6.32-279.5.2.bl6.Bull.35_restricted.x86_64 (derr@atlas.frec.bull.fr) (gcc version 4.4.6 20120305 (Bull 4.4.6-4) (GCC) ) #1 SMP Wed Apr 24 14:29:54 CEST 2013 The crash occured in echo_session_key_fini(). static void echo_session_key_fini(...)
{
struct echo_session_info *session = data;
OBD_SLAB_FREE_PTR(session, echo_session_kmem); <================ GPF here, session is 0x5a5a5a5a5a5a5a5a
}
After OBD_SLAB_FREE_PTR() expansion:
------------------------------------
static void echo_session_key_fini(...)
{
struct echo_session_info *session = data;
LASSERT(session);
lprocfs_counter_sub(obd_memory, OBD_MEMORY_STAT, (long)(sizeof *(session)));
CDEBUG(D_MALLOC, name " '" #session "': %d at %p.\n", (int)(sizeof *(session)), session);
memset(session, 0x5a, sizeof *(session)); <================ GPF here, session is 0x5a5a5a5a5a5a5a5a
cfs_mem_cache_free(echo_session_kmem, session);
(session) = (void *)0xdeadbeef;
}
|
| Comments |
| Comment by Patrick Valentin (Inactive) [ 18/Dec/14 ] |
|
I was not able to reproduce the same crash (probably a too small window), but with the following script, I was able to reproduce several crashes on a lustre client without any lustre file system mounted. Reproducer: N=0
while true; do
N=$((N+1))
D=$(((N%2000)*10))
modprobe obdecho
modprobe lustre > /dev/null &
usleep $D
rmmod obdecho
wait
rmmod lustre
[ -f /tmp/stop ] && break
echo -e ".\c"
[ $((N%100)) -eq 0 ] && echo "$N: (`date`)"
done 2> /dev/null
With this reproducer, we have one of the following crashes in a few tenth of minutes to one or two hours (lustre 2.1.6 and lustre 2.5.3). crash 1: key_fini(): ASSERTION( atomic_read(&key->lct_used) > 1 ) failed KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2014-12-11-13:13:52/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Thu Dec 11 13:12:04 2014
UPTIME: 00:33:04
LOAD AVERAGE: 1.07, 1.22, 1.01
TASKS: 641
NODENAME: mo48
RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64
VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014
MACHINE: x86_64 (2700 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: LBUG"
PID: 35296
COMMAND: "rmmod"
TASK: ffff88085dc78040 [THREAD_INFO: ffff88085d736000]
CPU: 1
STATE: TASK_RUNNING (PANIC)
crash> log | tail -60
Lustre: Echo OBD driver; http://www.lustre.org/
...
Lustre: Echo OBD driver; http://www.lustre.org/
slab error in kmem_cache_destroy(): cache `echo_thread_kmem': Can't free all objects
Pid: 35271, comm: rmmod Not tainted 2.6.32-504.1.3.el6.Bull.66.x86_64 #1
Call Trace:
[<ffffffff8117849b>] ? kmem_cache_destroy+0xbb/0xf0
[<ffffffffa070779d>] ? lu_kmem_fini+0x2d/0x50 [obdclass]
[<ffffffffa07ea181>] ? echo_client_exit+0x21/0x30 [obdecho]
[<ffffffffa07ea19e>] ? obdecho_exit+0xe/0x30 [obdecho]
[<ffffffff810bdf14>] ? sys_delete_module+0x194/0x260
[<ffffffff81531b7e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
LustreError: 35296:0:(lu_object.c:1406:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
LustreError: 35296:0:(lu_object.c:1406:key_fini()) LBUG
...
crash> bt 35296
PID: 35296 TASK: ffff88085dc78040 CPU: 1 COMMAND: "rmmod"
#0 [ffff88085d737be0] machine_kexec at ffffffff8103b7eb
#1 [ffff88085d737c40] crash_kexec at ffffffff810ca852
#2 [ffff88085d737d10] panic at ffffffff8152ae84
#3 [ffff88085d737d90] lbug_with_loc at ffffffffa05b7eeb [libcfs]
#4 [ffff88085d737db0] key_fini at ffffffffa0707bab [obdclass]
#5 [ffff88085d737de0] lu_context_key_quiesce at ffffffffa0707e8d [obdclass]
#6 [ffff88085d737e00] lu_context_key_degister at ffffffffa0707f5d [obdclass]
#7 [ffff88085d737e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass]
#8 [ffff88085d737e80] vvp_type_fini at ffffffffa0d08f3f [lustre]
#9 [ffff88085d737e90] lu_device_type_fini at ffffffffa07065e3 [obdclass]
#10 [ffff88085d737ea0] ccc_global_fini at ffffffffa0d0522c [lustre]
#11 [ffff88085d737ec0] vvp_global_fini at ffffffffa0d09d55 [lustre]
#12 [ffff88085d737ed0] cleanup_module at ffffffffa0d0f60a [lustre]
#13 [ffff88085d737ee0] sys_delete_module at ffffffff810bdf14
#14 [ffff88085d737f80] system_call_fastpath at ffffffff8100b072
RIP: 00007f43dbd61077 RSP: 00007fffbc8ea458 RFLAGS: 00010202
RAX: 00000000000000b0 RBX: ffffffff8100b072 RCX: 00007fffbc8ea380
RDX: 00007fffbc8ea44f RSI: 0000000000000880 RDI: 00007fffbc8ea460
RBP: 00007fffbc8ea4b0 R8: 00007f43dc21d700 R9: 0000000000000100
R10: 00007fffbc8ea1e0 R11: 0000000000000202 R12: 00007fffbc8ec6d7
R13: 0000000000000880 R14: 00007fffbc8ea460 R15: 0000000000000001
ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b
crash 2: lu_context_key_degister(): ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2 KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2014-12-11-14:19:24/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Thu Dec 11 14:18:03 2014
UPTIME: 01:02:10
LOAD AVERAGE: 1.30, 1.32, 1.23
TASKS: 648
NODENAME: mo48
RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64
VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014
MACHINE: x86_64 (2699 Mhz)
MEMORY: 64 GB
PANIC: "Kernel panic - not syncing: LBUG"
PID: 30694
COMMAND: "rmmod"
TASK: ffff88107b2a8040 [THREAD_INFO: ffff881069404000]
CPU: 14
STATE: TASK_RUNNING (PANIC)
crash> log | tail -60
Lustre: Echo OBD driver; http://www.lustre.org/
...
Lustre: Echo OBD driver; http://www.lustre.org/
LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2
LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) LBUG
...
crash> bt 30694
PID: 30694 TASK: ffff88107b2a8040 CPU: 14 COMMAND: "rmmod"
#0 [ffff881069405c30] machine_kexec at ffffffff8103b7eb
#1 [ffff881069405c90] crash_kexec at ffffffff810ca852
#2 [ffff881069405d60] panic at ffffffff8152ae84
#3 [ffff881069405de0] lbug_with_loc at ffffffffa05b7eeb [libcfs]
#4 [ffff881069405e00] lu_context_key_degister at ffffffffa070801d [obdclass]
#5 [ffff881069405e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass]
#6 [ffff881069405e80] echo_type_fini at ffffffffa07ea8c0 [obdecho]
#7 [ffff881069405e90] lu_device_type_fini at ffffffffa07065e3 [obdclass]
#8 [ffff881069405ea0] class_unregister_type at ffffffffa06cf921 [obdclass]
#9 [ffff881069405ec0] echo_client_exit at ffffffffa07ea175 [obdecho]
#10 [ffff881069405ed0] cleanup_module at ffffffffa07ea19e [obdecho]
#11 [ffff881069405ee0] sys_delete_module at ffffffff810bdf14
#12 [ffff881069405f80] system_call_fastpath at ffffffff8100b072
RIP: 00007f4b052d0077 RSP: 00007fff962eea48 RFLAGS: 00010206
RAX: 00000000000000b0 RBX: ffffffff8100b072 RCX: 00007fff962ee970
RDX: 00007fff962eea3f RSI: 0000000000000880 RDI: 00007fff962eea50
RBP: 00007fff962eeaa0 R8: 00007f4b0578c700 R9: 0000000000000100
R10: 00007fff962ee7d0 R11: 0000000000000206 R12: 00007fff962f06d6
R13: 0000000000000880 R14: 00007fff962eea50 R15: 0000000000000001
ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b
Sometimes we also have the following crash: lu_context_exit(): ASSERTION( key != ((void *)0) ) failed: Patches to fix this issue are available for lustre 2.1.6, lustre 2.5.3 and master branches. |
| Comment by Jodi Levi (Inactive) [ 18/Dec/14 ] |
|
Emoly, |
| Comment by Andreas Dilger [ 18/Dec/14 ] |
|
Patrick, when you write "Patches to fix this issue are available", are you referring to just the last lu_context_exit() assertion, or do you have patches for the original problem? I don't see any patches referenced here. |
| Comment by Bruno Faccini (Inactive) [ 18/Dec/14 ] |
|
I wonder if this could be related to |
| Comment by Patrick Valentin (Inactive) [ 19/Dec/14 ] |
|
Andreas, Bruno, |
| Comment by Gerrit Updater [ 22/Dec/14 ] |
|
patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13164 |
| Comment by Gerrit Updater [ 22/Dec/14 ] |
|
patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13165 |
| Comment by Gerrit Updater [ 22/Dec/14 ] |
|
patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13166 |
| Comment by Gerrit Updater [ 08/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13164/ |
| Comment by Peter Jones [ 20/May/15 ] |
|
Landed for 2.8 |