[LU-2970] ASSERTION( !list_empty(&h->loh_layers) ) failed, followed by a kernel panic Created: 15/Mar/13  Updated: 28/Mar/13  Resolved: 28/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: ETHz Support (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: LB
Environment:

CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64)
Lustre Client: v2_3_61_0 (git version)


Severity: 3
Rank (Obsolete): 7239

 Description   

One of our lustre clients crashed yesterday with the following kernel panic:

2013-03-14T16:24:25+01:00 brutus3 LustreError: 4488:0:(lu_object.h:759:lu_object_top()) ASSERTION( !list_empty(&h->loh_layers) ) failed:
2013-03-14T16:24:25+01:00 brutus3 general protection fault: 0000 1 SMP
2013-03-14T16:24:25+01:00 brutus3 last sysfs file: /sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
2013-03-14T16:24:25+01:00 brutus3 CPU 28
2013-03-14T16:24:25+01:00 brutus3 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) sha512_generic sha256_generic libcfs(U) netconsole configfs panfs(P)(U) autofs4 nfs fscache nfs_acl auth_rpcg
ss lockd sunrpc bonding 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core power_meter sg hpilo hpwd
t netxen_nic microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [las
t unloaded: scsi_wait_scan]
2013-03-14T16:24:26+01:00 brutus3
2013-03-14T16:24:26+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P --------------- 2.6.32-279.22.1.el6.x86_64 #1 HP ProLiant DL585 G7
2013-03-14T16:24:26+01:00 brutus3 RIP: 0010:[<ffffffffa0d128cb>] [<ffffffffa0d128cb>] cl_object_top+0x1b/0x150 [obdclass]
2013-03-14T16:24:26+01:00 brutus3 RSP: 0018:ffff880551501ba0 EFLAGS: 00010206
2013-03-14T16:24:26+01:00 brutus3 RAX: 5a5a5a5a5a5a5a5a RBX: ffff880b44368400 RCX: ffff8801824afe08
2013-03-14T16:24:26+01:00 brutus3 RDX: 5a5a5a5a5a5a5a5a RSI: ffffffffa10dd860 RDI: ffff880388e153c8
2013-03-14T16:24:26+01:00 brutus3 RBP: ffff880551501bb0 R08: 0000000000000000 R09: 0000000000000000
2013-03-14T16:24:26+01:00 brutus3 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88099c331e00
2013-03-14T16:24:26+01:00 brutus3 R13: ffff8812ee62b290 R14: ffff880388e153c8 R15: ffff8811ca34ebc8
2013-03-14T16:24:27+01:00 brutus3 FS: 00007f248980f700(0000) GS:ffff88044e440000(0000) knlGS:0000000008ec3830
2013-03-14T16:24:27+01:00 brutus3 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2013-03-14T16:24:27+01:00 brutus3 CR2: 00000000006d3a30 CR3: 0000001835b46000 CR4: 00000000000006e0
2013-03-14T16:24:27+01:00 brutus3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2013-03-14T16:24:27+01:00 brutus3 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2013-03-14T16:24:27+01:00 brutus3 Process ldlm_bl_101 (pid: 9597, threadinfo ffff880551500000, task ffff8808351d6040)
2013-03-14T16:24:27+01:00 brutus3 Stack:
2013-03-14T16:24:27+01:00 brutus3 ffff880551501bc0 ffff880b44368400 ffff880551501bc0 ffffffffa0d12a2e
2013-03-14T16:24:27+01:00 brutus3 syslog-ng[2221]: Error processing log message: <d> ffff880551501c00 ffffffffa10b3b24 0000000000000000 ffff8812ee62b290
2013-03-14T16:24:27+01:00 brutus3 syslog-ng[2221]: Error processing log message: <d> ffff8812ee62b290 ffff88099c331e00 ffff88099c331e00 ffff880551501cb0
2013-03-14T16:24:27+01:00 brutus3 Call Trace:
2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa0d12a2e>] cl_object_attr_lock+0xe/0x20 [obdclass]
2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa10b3b24>] osc_lock_detach+0xf4/0x190 [osc]
2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa10b3c08>] osc_lock_delete+0x48/0xc0 [osc]
2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa0d1ab65>] cl_lock_delete0+0xb5/0x1d0 [obdclass]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1add3>] cl_lock_delete+0x153/0x1a0 [obdclass]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b5846>] osc_ldlm_blocking_ast+0x146/0x350 [osc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e33f2c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e42dda>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e478db>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b4259>] osc_lock_cancel+0xf9/0x1c0 [osc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d19645>] cl_lock_cancel0+0x75/0x160 [obdclass]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1a1eb>] cl_lock_cancel+0x13b/0x140 [obdclass]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b583a>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e4b070>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e4b5c1>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
2013-03-14T16:24:28+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:29+01:00 brutus3 [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:29+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14T16:24:29+01:00 brutus3 Code: c7 80 1f d6 a0 e8 a6 25 e9 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 07 0f 1f 80 00 00 00 00 48 89 c2 <48> 8b 80 88 00 00 00 48 85 c0 75 f1 48 8b 42 48 48 83 c2 48 48
2013-03-14T16:24:29+01:00 brutus3 RIP [<ffffffffa0d128cb>] cl_object_top+0x1b/0x150 [obdclass]
2013-03-14T16:24:29+01:00 brutus3 RSP <ffff880551501ba0>
2013-03-14T16:24:29+01:00 brutus3 --[ end trace 4537c3429b809b37 ]--
2013-03-14T16:24:29+01:00 brutus3 Kernel panic - not syncing: Fatal exception
2013-03-14T16:24:29+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P D --------------- 2.6.32-279.22.1.el6.x86_64 #1
2013-03-14T16:24:30+01:00 brutus3 Call Trace:
2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814e9903>] ? panic+0xa0/0x168
2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814eda94>] ? oops_end+0xe4/0x100
2013-03-14T16:24:30+01:00 brutus3 [<ffffffff8100f19b>] ? die+0x5b/0x90
2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814ed602>] ? do_general_protection+0x152/0x160
2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814ecdd5>] ? general_protection+0x25/0x30
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d128cb>] ? cl_object_top+0x1b/0x150 [obdclass]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d12a2e>] ? cl_object_attr_lock+0xe/0x20 [obdclass]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b3b24>] ? osc_lock_detach+0xf4/0x190 [osc]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b3c08>] ? osc_lock_delete+0x48/0xc0 [osc]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d1ab65>] ? cl_lock_delete0+0xb5/0x1d0 [obdclass]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d1add3>] ? cl_lock_delete+0x153/0x1a0 [obdclass]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b5846>] ? osc_ldlm_blocking_ast+0x146/0x350 [osc]
2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0e33f2c>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e42dda>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e478db>] ? ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa10b4259>] ? osc_lock_cancel+0xf9/0x1c0 [osc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d19645>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d1a1eb>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa10b583a>] ? osc_ldlm_blocking_ast+0x13a/0x350 [osc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b070>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b5c1>] ? ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:31+01:00 brutus3 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2013-03-14T16:24:32+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:32+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:32+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14T16:24:32+01:00 brutus3 -----------[ cut here ]-----------
2013-03-14T16:24:32+01:00 brutus3 WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60() (Tainted: P D --------------- )
2013-03-14T16:24:32+01:00 brutus3 Hardware name: ProLiant DL585 G7
2013-03-14T16:24:32+01:00 brutus3 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) sha512_generic sha256_generic libcfs(U) netconsole configfs panfs(P)(U) autofs4 nfs fscache nfs_acl auth_rpcgss lockd sunrpc bonding 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core power_meter sg hpilo hpwdt netxen_nic microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
2013-03-14T16:24:33+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P D --------------- 2.6.32-279.22.1.el6.x86_64 #1
2013-03-14T16:24:33+01:00 brutus3 Call Trace:
2013-03-14T16:24:33+01:00 brutus3 <IRQ> [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8106a2fa>] ? warn_slowpath_null+0x1a/0x20
2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8102a26c>] ? native_smp_send_reschedule+0x5c/0x60
2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8104e048>] ? resched_task+0x68/0x80
2013-03-14T16:24:33+01:00 brutus3 [<ffffffff81053a60>] ? check_preempt_wakeup+0x1c0/0x260
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8106210b>] ? enqueue_task_fair+0xfb/0x100
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8104e0fc>] ? check_preempt_curr+0x7c/0x90
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8105f873>] ? try_to_wake_up+0x213/0x3e0
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81094d40>] ? hrtimer_wakeup+0x0/0x30
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8105fa95>] ? wake_up_process+0x15/0x20
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81094d62>] ? hrtimer_wakeup+0x22/0x30
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8109535e>] ? __run_hrtimer+0x8e/0x1a0
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81012a69>] ? read_tsc+0x9/0x20
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81095706>] ? hrtimer_interrupt+0xe6/0x250
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff814f235b>] ? smp_apic_timer_interrupt+0x6b/0x9b
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
2013-03-14T16:24:34+01:00 brutus3 <EOI> [<ffffffff81274465>] ? delay_tsc+0x45/0x80
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81274447>] ? delay_tsc+0x27/0x80
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81274416>] ? __const_udelay+0x46/0x50
2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8102a3e3>] ? native_stop_other_cpus+0x83/0xd0
2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814e9919>] ? panic+0xb6/0x168
2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814eda94>] ? oops_end+0xe4/0x100
2013-03-14T16:24:35+01:00 brutus3 [<ffffffff8100f19b>] ? die+0x5b/0x90
2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814ed602>] ? do_general_protection+0x152/0x160
2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814ecdd5>] ? general_protection+0x25/0x30
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d128cb>] ? cl_object_top+0x1b/0x150 [obdclass]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d12a2e>] ? cl_object_attr_lock+0xe/0x20 [obdclass]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b3b24>] ? osc_lock_detach+0xf4/0x190 [osc]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b3c08>] ? osc_lock_delete+0x48/0xc0 [osc]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d1ab65>] ? cl_lock_delete0+0xb5/0x1d0 [obdclass]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d1add3>] ? cl_lock_delete+0x153/0x1a0 [obdclass]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b5846>] ? osc_ldlm_blocking_ast+0x146/0x350 [osc]
2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0e33f2c>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e42dda>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e478db>] ? ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa10b4259>] ? osc_lock_cancel+0xf9/0x1c0 [osc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d19645>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d1a1eb>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa10b583a>] ? osc_ldlm_blocking_ast+0x13a/0x350 [osc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b070>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b5c1>] ? ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:36+01:00 brutus3 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
2013-03-14T16:24:37+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:37+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
2013-03-14T16:24:37+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-14T16:24:37+01:00 brutus3 --[ end trace 4537c3429b809b38 ]--
2013-03-14T16:24:37+01:00 brutus3 panic occurred, switching back to text console

Unfortunately i have no idea what process caused the panic: The affected node is a login node and there were about 50 people logged in, so i have no easy way to reproduce the crash :-/

The lustre kernel module was compiled from the v2_3_61_0 git tag.



 Comments   
Comment by Peter Jones [ 17/Mar/13 ]

Adrian

Do I understand correctly that you are running a pre-release version of 2.4 in production?

Peter

Comment by Adrian Ulrich (Inactive) [ 17/Mar/13 ]

Yes, our compute nodes/clients are running git-versions of the lustre client (The servers are running stock 2.2.0 - we will ugprade them to 2.4.0 after the release)

I am aware of the fact that this might not be a good idea (well, someone has to test it ) - but the 2_3 git releases turned out to be MUCH more stable than the official 2.2 and 2.3 client releases. Our users managed to crash 2.2 client nodes multiple times per day - with 2.3 we still had about 5-6 kernel panics per week while we are down to ~1 panic per week with the current GIT version.

Comment by Peter Jones [ 17/Mar/13 ]

Adrian

As long as you are aware of the risks of running pre-release software then of course I am delighted that we are able to get feedback from a real production environment - 2.4 will be a better release for it.

While the focus of feature releases is always the new features provided, we also include all known bugfixes and the vast majority of the issues exposed by sites running 2.x releases have been issues in the underlying 2.0 code that we have built upon, rather than regressions associated with the new features. So, while I am disappointed to hear that you have had poor stability with 2.2 and 2.3 (others have reported a far better experience), I am not surprised to hear that things have been improving.

Do you mind if I mention publicly (on updates to the mailing lists, in presentations about Lustre 2.4) that ETHZ is doing this?

Oleg

Could you please review this report and advise next steps? Is there enough to work with here? If not, can you advise what Adrian should collect in the event of a future reoccurence?

Thanks

Peter

Comment by Adrian Ulrich (Inactive) [ 17/Mar/13 ]

> Do you mind if I mention publicly (on updates to the mailing lists, in presentations about Lustre 2.4) that ETHZ is doing this?

I don't mind: That's fine with me.

Comment by Peter Jones [ 17/Mar/13 ]

Great - thanks Adrian!

Comment by Andreas Dilger [ 18/Mar/13 ]

Jinshan, can you please take a look at this to see if anything is obvious?

Comment by Jinshan Xiong (Inactive) [ 18/Mar/13 ]

Obviously the object was already freed when this issue happened. Hmm.. did you set up crashdump on the machine or it's impossible to collect lustre log?

Comment by Oleg Drokin [ 19/Mar/13 ]

I hit a very similar bug last Sunday. have crashdump in /exports/crashdumps/192.168.10.218-2013-03-17-21\:29\:49/

[363112.577950] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[363112.578318] last sysfs file: /sys/devices/system/cpu/possible
[363112.578589] CPU 1
[363112.578637] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_
ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclas
s lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcac
he jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net vi
rtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_ha
sh dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic
 uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libisc
si scsi_transport_iscsi [last unloaded: libcfs]
[363112.580600]
[363112.580600] Pid: 451, comm: ldlm_bl_45 Not tainted 2.6.32-debug #6 Bochs Boch
s
[363112.580600] RIP: 0010:[<ffffffffa0f9c90b>]  [<ffffffffa0f9c90b>] cl_object_to
p+0x1b/0x150 [obdclass]
[363112.580600] RSP: 0018:ffff88009e0edb90  EFLAGS: 00010206
[363112.580600] RAX: 000130b38d4c0000 RBX: ffff88000bfb1db0 RCX: ffff880080abef60
[363112.580600] RDX: 000130b38d4c0000 RSI: ffffffffa04b1940 RDI: ffff88004ee8deb0
[363112.580600] RBP: ffff88009e0edba0 R08: 0000000000000000 R09: 0000000000000000
[363112.580600] R10: 0000000000000003 R11: 000000000000000f R12: ffff8800a89e6f30
[363112.580600] R13: ffff88003e04df50 R14: ffff88004ee8deb0 R15: ffff8800790bbc18
[363112.580600] FS:  00007f8c05205700(0000) GS:ffff880006280000(0000) knlGS:0000000000000000
[363112.580600] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[363112.580600] CR2: 00007ff2a83e2cf6 CR3: 0000000072edd000 CR4: 00000000000006e0
[363112.580600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[363112.580600] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[363112.587593] Process ldlm_bl_45 (pid: 451, threadinfo ffff88009e0ec000, task ffff880088694440)
[363112.587593] Stack:
[363112.587593]  ffff88009e0edbb0 ffff88000bfb1db0 ffff88009e0edbb0 ffffffffa0f9ca4e
[363112.587593] <d> ffff88009e0edbf0 ffffffffa0488788 0000000000000002 ffff88003e04df50 
[363112.587593] <d> ffff88003e04df50 ffff8800a89e6f30 ffff8800a89e6f30 ffff88009e0edca0 
[363112.587593] Call Trace:
[363112.587593]  [<ffffffffa0f9ca4e>] cl_object_attr_lock+0xe/0x20 [obdclass]
[363112.587593]  [<ffffffffa0488788>] osc_lock_detach+0xe8/0x1a0 [osc]
[363112.587593]  [<ffffffffa0488888>] osc_lock_delete+0x48/0xc0 [osc]
[363112.587593]  [<ffffffffa0fa4ce5>] cl_lock_delete0+0xb5/0x1d0 [obdclass]
[363112.587593]  [<ffffffffa0fa4f53>] cl_lock_delete+0x153/0x1a0 [obdclass]
[363112.587593]  [<ffffffffa048a4f6>] osc_ldlm_blocking_ast+0x146/0x350 [osc]
[363112.587593]  [<ffffffffa10c906c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
[363112.587593]  [<ffffffffa10e30da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
[363112.587593]  [<ffffffffa10e7bd0>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
[363112.587593]  [<ffffffffa0488ede>] osc_lock_cancel+0xfe/0x1c0 [osc]
[363112.587593]  [<ffffffffa0fa37c5>] cl_lock_cancel0+0x75/0x160 [obdclass]
[363112.587593]  [<ffffffffa0fa436b>] cl_lock_cancel+0x13b/0x140 [obdclass]
[363112.587593]  [<ffffffffa048a4ea>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
[363112.587593]  [<ffffffffa10eb970>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
[363112.587593]  [<ffffffffa10ebec9>] ldlm_bl_thread_main+0x289/0x3e0 [ptlrpc]
[363112.587593]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
[363112.587593]  [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[363112.587593]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[363112.587593]  [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[363112.587593]  [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc]
[363112.587593]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[363112.587593] Code: c7 a0 e2 fe a0 e8 e6 95 e9 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 07 0f 1f 80 00 00 00 00 48 89 c2 <48> 8b 80 b0 00 00 00 48 85 c0 75 f1 48 8b 42 48 48 83 c2 48 48
Comment by Oleg Drokin [ 19/Mar/13 ]

Adrian, with panics once per week, any other interesting panics you happen to have that you can share with us?

Comment by Adrian Ulrich (Inactive) [ 20/Mar/13 ]

@ Jinshan Xiong
> Hmm.. did you set up crashdump on the machine or it's impossible to collect lustre log?

Unfortunately, crashdump was not enabled on this kind of node. It is now enabled and i should be able to provide a crashdump if it happens again

@ Oleg Drokin
> Adrian, with panics once per week, any other interesting panics you happen to have that you can share with us?

No, i don't have any other interesting panics right now
All other 'recent' crashes happened on nodes with some older version of the lustre client (and in most cases it was already fixed in the -git version).
This was our first crash with 2_3_61

Comment by Jinshan Xiong (Inactive) [ 20/Mar/13 ]

Hi Adrian, I'm going to work out a debug patch. From the symptom so far, the top object was freed while a sublock was still being canceled. This must be race but I need more information.

Comment by Jinshan Xiong (Inactive) [ 21/Mar/13 ]

I;ve known the root cause of this problem, will compose a patch.

Comment by Jinshan Xiong (Inactive) [ 22/Mar/13 ]

patch is at: http://review.whamcloud.com/5812

Comment by Adrian Ulrich (Inactive) [ 25/Mar/13 ]

Thanks! I'll rebuild our client-RPM with the patch included ASAP.

Comment by Peter Jones [ 28/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 06:21:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.