[LU-3614] Kernel Panic "osc_lock_detach" Created: 22/Jul/13  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Rustem Bikboulatov Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: None
Environment:

Linux 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP


Attachments: GIF File 20140113 - Hardware Diagram v0.1_R3.gif    
Severity: 3
Rank (Obsolete): 9290

 Description   

Hello,

Our OST-servers periodically crashes with kernel panic error:

general protection fault: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 9
Modules linked in: lmv(U) obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ses enclosure sg microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core ib_mthca ib_mad ib_core igb dca ext3 jbd mbcache sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 26235, comm: ldlm_bl_04 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0976751>] [<ffffffffa0976751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP: 0018:ffff88016223bd40 EFLAGS: 00010206
RAX: ffffffffa0993100 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
RDX: 000000000000e4f9 RSI: ffff8801f08ead78 RDI: ffffffffa0993100
RBP: ffff88016223bd70 R08: 0000000000000000 R09: ffff88014f9a2400
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff8801f08ead78
R13: ffff8801f41a9b58 R14: ffff8801f41a9b58 R15: ffff88020cf45240
FS: 00007f01e69b9700(0000) GS:ffff8800282a0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000042b650 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_04 (pid: 26235, threadinfo ffff88016223a000, task ffff8801ad57e080)
Stack:
ffff88016223bd40 ffff8801f08ead78 0000000000000000 ffff88022ce47c18
<d> ffff8801f41a9b58 ffff88020cf45240 ffff88016223bdc0 ffffffffa0976ab4
<d> ffffffffa04f944d 0000000000000000 ffff88016223bda0 ffff8801f41a9b58
Call Trace:
[<ffffffffa0976ab4>] osc_lock_cancel+0xa4/0x1b0 [osc]
[<ffffffffa04f944d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa04ff225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa04fff0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0977bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa062a123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa062a561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: fd 48 c7 c7 00 31 99 a0 e8 bd 60 b7 e0 49 8b 5c 24 28 48 85 db 0f 84 7f 00 00 00 49 c7 44 24 28 00 00 00 00 48 c7 c0 00 31 99 a0 <48> c7 83 70 01 00 00 00 00 00 00 49 c7 44 24 60 00 00 00 00 66
RIP [<ffffffffa0976751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP <ffff88016223bd40>



 Comments   
Comment by Rustem Bikboulatov [ 31/Jul/13 ]

Hello,

Today, we are faced with yet another case of "kernel panic" on another OSS server. Here is the output log. We have a crash dump file, and if you need to send it for analysis, we are ready to do it. These problems concern us, since we used Lustre to build news television production, and it is associated with risk of problems with the air. Please help in diagnosing and fixing the problem. Thanks.

general protection fault: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 6
Modules linked in: lmv(U) obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ses enclosure microcode serio_raw sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core ioatdma ib_mthca ib_mad ib_core igb dca ext3 jbd mbcache sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 8815, comm: ldlm_bl_28 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0978751>] [<ffffffffa0978751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP: 0018:ffff8802d8525d40 EFLAGS: 00010206
RAX: ffffffffa0995100 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
RDX: 00000000000069ea RSI: ffff8801a4fd7ee8 RDI: ffffffffa0995100
RBP: ffff8802d8525d70 R08: 0000000000000000 R09: ffff8802f4493000
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff8801a4fd7ee8
R13: ffff8802e39ecaa0 R14: ffff8802e39ecaa0 R15: ffff880118e55240
FS: 00007fcdae134700(0000) GS:ffff8801c5840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001ed4d38 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_28 (pid: 8815, threadinfo ffff8802d8524000, task ffff88003ee8e080)
Stack:
ffff8802d8525d40 ffff8801a4fd7ee8 0000000000000000 ffff88011c005dd8
<d> ffff8802e39ecaa0 ffff880118e55240 ffff8802d8525dc0 ffffffffa0978ab4
<d> ffffffffa04fb44d 0000000000000000 ffff8802d8525da0 ffff8802e39ecaa0
Call Trace:
[<ffffffffa0978ab4>] osc_lock_cancel+0xa4/0x1b0 [osc]
[<ffffffffa04fb44d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa0501225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa0501f0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0979bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa062c123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa062c561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa062c2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa062c2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa062c2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: fd 48 c7 c7 00 51 99 a0 e8 bd 40 b7 e0 49 8b 5c 24 28 48 85 db 0f 84 7f 00 00 00 49 c7 44 24 28 00 00 00 00 48 c7 c0 00 51 99 a0 <48> c7 83 70 01 00 00 00 00 00 00 49 c7 44 24 60 00 00 00 00 66
RIP [<ffffffffa0978751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP <ffff8802d8525d40>

Comment by Rustem Bikboulatov [ 07/Aug/13 ]

It happened again on another server.

[root@n23 lustre_2.1.5]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux /var/crash/127.0.0.1-2013-08-07-17\:00\:44/vmcore

crash 6.0.4-2.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2013-08-07-17:00:44/vmcore [PARTIAL DUMP]
CPUS: 16
DATE: Wed Aug 7 16:59:40 2013
UPTIME: 28 days, 17:16:23
LOAD AVERAGE: 0.13, 0.32, 0.33
TASKS: 848
NODENAME: n23
RELEASE: 2.6.32-279.19.1.el6_lustre.x86_64
VERSION: #1 SMP Wed Mar 20 16:37:18 PDT 2013
MACHINE: x86_64 (2400 Mhz)
MEMORY: 12 GB
PANIC: ""
PID: 26404
COMMAND: "ldlm_bl_10"
TASK: ffff880287ea6080 [THREAD_INFO: ffff880287ea0000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

general protection fault: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 0
Modules linked in: lmv(U) obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ses enclosure microcode sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core ib_mthca ib_mad ib_core igb dca ext3 jbd mbcache sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 26404, comm: ldlm_bl_10 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0976751>] [<ffffffffa0976751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP: 0018:ffff880287ea1d40 EFLAGS: 00010206
RAX: ffffffffa0993100 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
RDX: 0000000000006415 RSI: ffff880193938e30 RDI: ffffffffa0993100
RBP: ffff880287ea1d70 R08: 0000000000000000 R09: ffff8801bafa1200
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff880193938e30
R13: ffff8801842aec10 R14: ffff8801842aec10 R15: ffff88016101bd80
FS: 00007fd3193ef700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000024585e0 CR3: 0000000001a85000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_10 (pid: 26404, threadinfo ffff880287ea0000, task ffff880287ea6080)
Stack:
ffff880287ea1d40 ffff880193938e30 0000000000000000 ffff880179cb26d8
<d> ffff8801842aec10 ffff88016101bd80 ffff880287ea1dc0 ffffffffa0976ab4
<d> ffffffffa04f944d 0000000000000000 ffff880287ea1da0 ffff8801842aec10
Call Trace:
[<ffffffffa0976ab4>] osc_lock_cancel+0xa4/0x1b0 [osc]
[<ffffffffa04f944d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa04ff225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa04fff0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0977bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa062a123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa062a561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa062a2e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: fd 48 c7 c7 00 31 99 a0 e8 bd 60 b7 e0 49 8b 5c 24 28 48 85 db 0f 84 7f 00 00 00 49 c7 44 24 28 00 00 00 00 48 c7 c0 00 31 99 a0 <48> c7 83 70 01 00 00 00 00 00 00 49 c7 44 24 60 00 00 00 00 66
RIP [<ffffffffa0976751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP <ffff880287ea1d40>

crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 3014316 11.5 GB ----
FREE 596526 2.3 GB 19% of TOTAL MEM
USED 2417790 9.2 GB 80% of TOTAL MEM
SHARED 803219 3.1 GB 26% of TOTAL MEM
BUFFERS 3691 14.4 MB 0% of TOTAL MEM
CACHED 1671331 6.4 GB 55% of TOTAL MEM
SLAB 473084 1.8 GB 15% of TOTAL MEM

TOTAL SWAP 524286 2 GB ----
SWAP USED 11 44 KB 0% of TOTAL SWAP
SWAP FREE 524275 2 GB 99% of TOTAL SWAP

Hello, anybody there?

Comment by Rustem Bikboulatov [ 09/Aug/13 ]

Hello again!

We install a lustre client on a separate server from the OSS. This server only worked with lustre client and application software. A day later we got the same symptom - server "kernel panic". The problem is most likely related only to lustre client.

[root@r01 ~]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux /var/crash/127.0.0.1-2013-08-08-20\:42\:27/vmcore

crash 6.0.4-2.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2013-08-08-20:42:27/vmcore [PARTIAL DUMP]
CPUS: 16
DATE: Thu Aug 8 20:41:22 2013
UPTIME: 04:19:08
LOAD AVERAGE: 0.04, 0.95, 1.89
TASKS: 568
NODENAME: r01
RELEASE: 2.6.32-279.19.1.el6_lustre.x86_64
VERSION: #1 SMP Wed Mar 20 16:37:18 PDT 2013
MACHINE: x86_64 (2400 Mhz)
MEMORY: 12 GB
PANIC: ""
PID: 3170
COMMAND: "ldlm_bl_00"
TASK: ffff88033482d540 [THREAD_INFO: ffff880338c30000]
CPU: 5
STATE: TASK_RUNNING (PANIC)

general protection fault: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 5
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) vfat fat usb_storage autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa microcode serio_raw ses enclosure sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core ib_mthca ib_mad ib_core igb dca ext3 jbd mbcache sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 3170, comm: ldlm_bl_00 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0953751>] [<ffffffffa0953751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP: 0018:ffff880338c31d40 EFLAGS: 00010206
RAX: ffffffffa0970100 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
RDX: 000000000000e841 RSI: ffff88010049dcc0 RDI: ffffffffa0970100
RBP: ffff880338c31d70 R08: 0000000000000000 R09: ffff8801d7171400
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff88010049dcc0
R13: ffff8801b858eaa0 R14: ffff8801b858eaa0 R15: ffff8801b6309900
FS: 00007f8ad6ef7700(0000) GS:ffff8801c5820000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8d0a82d000 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_00 (pid: 3170, threadinfo ffff880338c30000, task ffff88033482d540)
Stack:
ffff880338c31d40 ffff88010049dcc0 0000000000000000 ffff880186358698
<d> ffff8801b858eaa0 ffff8801b6309900 ffff880338c31dc0 ffffffffa0953ab4
<d> ffffffffa04d644d 0000000000000000 ffff880338c31da0 ffff8801b858eaa0
Call Trace:
[<ffffffffa0953ab4>] osc_lock_cancel+0xa4/0x1b0 [osc]
[<ffffffffa04d644d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa04dc225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa04dcf0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0954bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa0607123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa0607561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa06072e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa06072e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa06072e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: fd 48 c7 c7 00 01 97 a0 e8 bd 90 b9 e0 49 8b 5c 24 28 48 85 db 0f 84 7f 00 00 00 49 c7 44 24 28 00 00 00 00 48 c7 c0 00 01 97 a0 <48> c7 83 70 01 00 00 00 00 00 00 49 c7 44 24 60 00 00 00 00 66
RIP [<ffffffffa0953751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP <ffff880338c31d40>

crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 3014310 11.5 GB ----
FREE 511917 2 GB 16% of TOTAL MEM
USED 2502393 9.5 GB 83% of TOTAL MEM
SHARED 1791768 6.8 GB 59% of TOTAL MEM
BUFFERS 357 1.4 MB 0% of TOTAL MEM
CACHED 1794256 6.8 GB 59% of TOTAL MEM
SLAB 599425 2.3 GB 19% of TOTAL MEM

TOTAL SWAP 524286 2 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 524286 2 GB 100% of TOTAL SWAP

Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ]

Running with latest b2_1 build #220 I got this same crash with reproducer of LU-4112.
Will submit a patch soon.

Comment by Rustem Bikboulatov [ 15/Nov/13 ]

Bruno, that's the good news. We will wait for the patch.
I also want to pay attention on bug LU-3766. Perhaps these two bugs are related, as they are appear in the case of active use the file locking feature by applications (flock mount option).

Comment by Patrick Farrell (Inactive) [ 07/Jan/14 ]

Bruno - Any word on this one?

For what it's worth, on the Cray side, we think this may be a cl_lock bug related issue, similar to LU-3889.

Rustem - Can you share details of what you are running and what the setup is when you hit this bug? We'd like to try to get debugging information, but do not have a reliable reproducer.

Comment by Rustem Bikboulatov [ 14/Jan/14 ]

Lustre Cluster Diagram

Comment by Rustem Bikboulatov [ 14/Jan/14 ]

Patrick, here the cluster configuration:

Lustre Server MGS/MDS - mmp-2
Lustre Servers OSS - n11, n12, n13, n14, n15, n21, n22, n23, n24, n25
Lustre Clients - r01, r02, r03, r04, mmp-1, vn-1, cln01, cln02, cln03, cln04

(refer to the diagram "20140113 - Hardware Diagram v0.1_R3.gif" in attachment)

Environment:
Linux 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP

Mount points:

OSS:
/dev/md11 on /lustre/ost type lustre (rw,noauto,_netdev,abort_recov)

MGS/MDS:

/dev/lustre_mgs on /lustre/mgs type lustre (rw,noauto,_netdev,abort_recov)
/dev/lustre_mdt1 on /lustre/mdt1 type lustre (rw,noauto,_netdev,abort_recov)

Clients:
mmp-2@tcp:mmp-1@tcp:/lustre1 on /array1 type lustre (rw,noauto,_netdev,flock,abort_recov,lazystatfs)

Stripe config:

[root@mmp-1 ~]# lfs getstripe /array1/.
/array1/.
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1

kdump config:

core_collector makedumpfile -c --message-level 1 -d 31

Application Software Description (LRVfarm):

LRVfarm - it's the software that processes media files (video+audio) and creates a proxy video ( low resolution video). Each running task contains a small task file, which is located on Lustre file system. Software LRVfarm runs several threads (processes) - 8 processes on each client (r01, r02, r03, r04). For a total of 32 LRVfarm processes running in cluster. Each process uses the file locking feature when performing tasks. LRVfarm process locks a task file and perform the task. Other LRVfarm processes (on local client and remote clients) also attempt to lock that task files, but it is not possible in the case when the task already running (for the reason that the file is already locked by another LRVfarm process). Periodically, all clients are crashes (kernel panic) with different errors. Here the crash statistics for the last time:

Client r01
==================

2013-12-30-13:27:09 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-12-18-17:53:58 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-11-27-11:33:23 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-11-22-19:13:07 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-29-18:38:51 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-06-13:47:27 osc_lock_detach+0x51/0x1b0 [osc]
2013-09-29-07:10:08 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-08-29-16:34:36 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-08-08-20:42:27 osc_lock_detach+0x51/0x1b0 [osc]

Client r02
==================

2014-01-13-16:50:05 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2014-01-07-23:05:13 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-12-25-10:01:04 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-12-16-00:34:43 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-11-25-12:54:33 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-11-14-21:59:24 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-09-11-15:57:46 ASSERTION( stripe < lio->lis_stripe_count ) failed:

Client r03
==================

2013-11-06-20:25:26 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-16-19:04:01 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-14-18:38:01 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-09-30-19:26:50 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-09-12-19:55:17 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-09-10-11:22:25 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-08-19-01:19:55 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-08-13-10:15:56 ASSERTION( stripe < lio->lis_stripe_count ) failed:

Client r04
==================

2013-12-08-06:19:11 cl_lock_mutex_get+0x2e/0xe0 [obdclass]
2013-12-03-13:32:04 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-11-16-08:49:25 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-21-22:14:46 ASSERTION( stripe < lio->lis_stripe_count ) failed:
2013-10-13-18:48:27 osc_lock_detach+0x51/0x1b0 [osc]
2013-09-01-14:26:14 osc_lock_detach+0x51/0x1b0 [osc]

The last crash with "osc_lock_detach" error occurred on October 13, 2013. At the time of creating of this ticket error "osc_lock_detach" was the most common one. Perhaps at that time was different load conditions on Lustre. But now the most frequently occurring errors - "ASSERTION( stripe < lio->lis_stripe_count ) failed" (LU-3766). To reproduce the "osc_lock_detach" error we have to wait a few months, and it does not guarantee that the error be reproduced at all. Nevertheless, I can send the crush dump (vmcore) of "osc_lock_detach" error from October 2013 in Kdump format (kdump config see above).

Comment by Rustem Bikboulatov [ 14/Apr/14 ]

Today there was a kernel panic "osc_lock_detach" again:

==================================================================

general protection fault: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 4
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ses enclosure ib_mthca ib_mad ib_core microcode igb sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache mpt2sas(U) scsi_transport_sas raid_class sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 23021, comm: ldlm_bl_37 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa0952751>] [<ffffffffa0952751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP: 0018:ffff8801dacafd40 EFLAGS: 00010206
RAX: ffffffffa096f100 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000000
RDX: 000000000000677e RSI: ffff880190cf8e30 RDI: ffffffffa096f100
RBP: ffff8801dacafd70 R08: 0000000000000000 R09: ffff88026637d200
R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff880190cf8e30
R13: ffff880129b02ef0 R14: ffff880129b02ef0 R15: ffff8801bca1d6c0
FS: 00007f89a61e3700(0000) GS:ffff880145800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc7675a9000 CR3: 000000011d86a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ldlm_bl_37 (pid: 23021, threadinfo ffff8801dacae000, task ffff880292206ae0)
Stack:
ffff8801dacafd40 ffff880190cf8e30 0000000000000000 ffff8801d5c20518
<d> ffff880129b02ef0 ffff8801bca1d6c0 ffff8801dacafdc0 ffffffffa0952ab4
<d> ffffffffa04d544d 0000000000000000 ffff8801dacafda0 ffff880129b02ef0
Call Trace:
[<ffffffffa0952ab4>] osc_lock_cancel+0xa4/0x1b0 [osc]
[<ffffffffa04d544d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
[<ffffffffa04db225>] cl_lock_cancel0+0x75/0x160 [obdclass]
[<ffffffffa04dbf0b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[<ffffffffa0953bba>] osc_ldlm_blocking_ast+0x13a/0x380 [osc]
[<ffffffffa0606123>] ldlm_handle_bl_callback+0x123/0x2e0 [ptlrpc]
[<ffffffffa0606561>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
[<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffffa06062e0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: fd 48 c7 c7 00 f1 96 a0 e8 bd a0 b9 e0 49 8b 5c 24 28 48 85 db 0f 84 7f 00 00 00 49 c7 44 24 28 00 00 00 00 48 c7 c0 00 f1 96 a0 <48> c7 83 70 01 00 00 00 00 00 00 49 c7 44 24 60 00 00 00 00 66
RIP [<ffffffffa0952751>] osc_lock_detach+0x51/0x1b0 [osc]
RSP <ffff8801dacafd40>

==========================================================================

Prior to this, we have changed the mount options for lustre clients on all servers. We excluded mode "flock", as it was suspected that a crash associated with it. Now mount options are:

Clients (r01, r02, r03, r04, mmp-1, vn-1):
mmp-2@tcp:mmp-1@tcp:/lustre1 on /array1 type lustre (rw,noauto,_netdev,noflock,abort_recov,lazystatfs)

Clients (cln01, cln02, cln03, cln04):
mmp-2@tcp:mmp-1@tcp:/lustre1 on /array1 type lustre (rw,noauto,_netdev,localflock,abort_recov,lazystatfs)

I also want to note that the issue LU-4886 (also my issues) is associated with this issue. These two errors are very similar trace log, and it happens on the same servers (r01, r02, r03, r04) with the lustre client 2.1.5.

Generated at Sat Feb 10 06:30:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.