Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2970

ASSERTION( !list_empty(&h->loh_layers) ) failed, followed by a kernel panic

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64)
      Lustre Client: v2_3_61_0 (git version)
    • 3
    • 7239

    Description

      One of our lustre clients crashed yesterday with the following kernel panic:

      2013-03-14T16:24:25+01:00 brutus3 LustreError: 4488:0:(lu_object.h:759:lu_object_top()) ASSERTION( !list_empty(&h->loh_layers) ) failed:
      2013-03-14T16:24:25+01:00 brutus3 general protection fault: 0000 1 SMP
      2013-03-14T16:24:25+01:00 brutus3 last sysfs file: /sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
      2013-03-14T16:24:25+01:00 brutus3 CPU 28
      2013-03-14T16:24:25+01:00 brutus3 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) sha512_generic sha256_generic libcfs(U) netconsole configfs panfs(P)(U) autofs4 nfs fscache nfs_acl auth_rpcg
      ss lockd sunrpc bonding 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core power_meter sg hpilo hpwd
      t netxen_nic microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [las
      t unloaded: scsi_wait_scan]
      2013-03-14T16:24:26+01:00 brutus3
      2013-03-14T16:24:26+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P --------------- 2.6.32-279.22.1.el6.x86_64 #1 HP ProLiant DL585 G7
      2013-03-14T16:24:26+01:00 brutus3 RIP: 0010:[<ffffffffa0d128cb>] [<ffffffffa0d128cb>] cl_object_top+0x1b/0x150 [obdclass]
      2013-03-14T16:24:26+01:00 brutus3 RSP: 0018:ffff880551501ba0 EFLAGS: 00010206
      2013-03-14T16:24:26+01:00 brutus3 RAX: 5a5a5a5a5a5a5a5a RBX: ffff880b44368400 RCX: ffff8801824afe08
      2013-03-14T16:24:26+01:00 brutus3 RDX: 5a5a5a5a5a5a5a5a RSI: ffffffffa10dd860 RDI: ffff880388e153c8
      2013-03-14T16:24:26+01:00 brutus3 RBP: ffff880551501bb0 R08: 0000000000000000 R09: 0000000000000000
      2013-03-14T16:24:26+01:00 brutus3 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88099c331e00
      2013-03-14T16:24:26+01:00 brutus3 R13: ffff8812ee62b290 R14: ffff880388e153c8 R15: ffff8811ca34ebc8
      2013-03-14T16:24:27+01:00 brutus3 FS: 00007f248980f700(0000) GS:ffff88044e440000(0000) knlGS:0000000008ec3830
      2013-03-14T16:24:27+01:00 brutus3 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-03-14T16:24:27+01:00 brutus3 CR2: 00000000006d3a30 CR3: 0000001835b46000 CR4: 00000000000006e0
      2013-03-14T16:24:27+01:00 brutus3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-03-14T16:24:27+01:00 brutus3 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-03-14T16:24:27+01:00 brutus3 Process ldlm_bl_101 (pid: 9597, threadinfo ffff880551500000, task ffff8808351d6040)
      2013-03-14T16:24:27+01:00 brutus3 Stack:
      2013-03-14T16:24:27+01:00 brutus3 ffff880551501bc0 ffff880b44368400 ffff880551501bc0 ffffffffa0d12a2e
      2013-03-14T16:24:27+01:00 brutus3 syslog-ng[2221]: Error processing log message: <d> ffff880551501c00 ffffffffa10b3b24 0000000000000000 ffff8812ee62b290
      2013-03-14T16:24:27+01:00 brutus3 syslog-ng[2221]: Error processing log message: <d> ffff8812ee62b290 ffff88099c331e00 ffff88099c331e00 ffff880551501cb0
      2013-03-14T16:24:27+01:00 brutus3 Call Trace:
      2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa0d12a2e>] cl_object_attr_lock+0xe/0x20 [obdclass]
      2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa10b3b24>] osc_lock_detach+0xf4/0x190 [osc]
      2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa10b3c08>] osc_lock_delete+0x48/0xc0 [osc]
      2013-03-14T16:24:27+01:00 brutus3 [<ffffffffa0d1ab65>] cl_lock_delete0+0xb5/0x1d0 [obdclass]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1add3>] cl_lock_delete+0x153/0x1a0 [obdclass]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b5846>] osc_ldlm_blocking_ast+0x146/0x350 [osc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e33f2c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e42dda>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e478db>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b4259>] osc_lock_cancel+0xf9/0x1c0 [osc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d19645>] cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0d1a1eb>] cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa10b583a>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e4b070>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffffa0e4b5c1>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      2013-03-14T16:24:28+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:29+01:00 brutus3 [<ffffffff8100c0ca>] child_rip+0xa/0x20
      2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:29+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:29+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14T16:24:29+01:00 brutus3 Code: c7 80 1f d6 a0 e8 a6 25 e9 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 07 0f 1f 80 00 00 00 00 48 89 c2 <48> 8b 80 88 00 00 00 48 85 c0 75 f1 48 8b 42 48 48 83 c2 48 48
      2013-03-14T16:24:29+01:00 brutus3 RIP [<ffffffffa0d128cb>] cl_object_top+0x1b/0x150 [obdclass]
      2013-03-14T16:24:29+01:00 brutus3 RSP <ffff880551501ba0>
      2013-03-14T16:24:29+01:00 brutus3 --[ end trace 4537c3429b809b37 ]--
      2013-03-14T16:24:29+01:00 brutus3 Kernel panic - not syncing: Fatal exception
      2013-03-14T16:24:29+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P D --------------- 2.6.32-279.22.1.el6.x86_64 #1
      2013-03-14T16:24:30+01:00 brutus3 Call Trace:
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814e9903>] ? panic+0xa0/0x168
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814eda94>] ? oops_end+0xe4/0x100
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffff8100f19b>] ? die+0x5b/0x90
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814ed602>] ? do_general_protection+0x152/0x160
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffff814ecdd5>] ? general_protection+0x25/0x30
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d128cb>] ? cl_object_top+0x1b/0x150 [obdclass]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d12a2e>] ? cl_object_attr_lock+0xe/0x20 [obdclass]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b3b24>] ? osc_lock_detach+0xf4/0x190 [osc]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b3c08>] ? osc_lock_delete+0x48/0xc0 [osc]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d1ab65>] ? cl_lock_delete0+0xb5/0x1d0 [obdclass]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0d1add3>] ? cl_lock_delete+0x153/0x1a0 [obdclass]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa10b5846>] ? osc_ldlm_blocking_ast+0x146/0x350 [osc]
      2013-03-14T16:24:30+01:00 brutus3 [<ffffffffa0e33f2c>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e42dda>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e478db>] ? ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa10b4259>] ? osc_lock_cancel+0xf9/0x1c0 [osc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d19645>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0d1a1eb>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa10b583a>] ? osc_ldlm_blocking_ast+0x13a/0x350 [osc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b070>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b5c1>] ? ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:31+01:00 brutus3 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      2013-03-14T16:24:32+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:32+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:32+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14T16:24:32+01:00 brutus3 -----------[ cut here ]-----------
      2013-03-14T16:24:32+01:00 brutus3 WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60() (Tainted: P D --------------- )
      2013-03-14T16:24:32+01:00 brutus3 Hardware name: ProLiant DL585 G7
      2013-03-14T16:24:32+01:00 brutus3 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) lnet(U) sha512_generic sha256_generic libcfs(U) netconsole configfs panfs(P)(U) autofs4 nfs fscache nfs_acl auth_rpcgss lockd sunrpc bonding 8021q garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core power_meter sg hpilo hpwdt netxen_nic microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      2013-03-14T16:24:33+01:00 brutus3 Pid: 9597, comm: ldlm_bl_101 Tainted: P D --------------- 2.6.32-279.22.1.el6.x86_64 #1
      2013-03-14T16:24:33+01:00 brutus3 Call Trace:
      2013-03-14T16:24:33+01:00 brutus3 <IRQ> [<ffffffff8106a2a7>] ? warn_slowpath_common+0x87/0xc0
      2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8106a2fa>] ? warn_slowpath_null+0x1a/0x20
      2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8102a26c>] ? native_smp_send_reschedule+0x5c/0x60
      2013-03-14T16:24:33+01:00 brutus3 [<ffffffff8104e048>] ? resched_task+0x68/0x80
      2013-03-14T16:24:33+01:00 brutus3 [<ffffffff81053a60>] ? check_preempt_wakeup+0x1c0/0x260
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8106210b>] ? enqueue_task_fair+0xfb/0x100
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8104e0fc>] ? check_preempt_curr+0x7c/0x90
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8105f873>] ? try_to_wake_up+0x213/0x3e0
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81094d40>] ? hrtimer_wakeup+0x0/0x30
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8105fa95>] ? wake_up_process+0x15/0x20
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81094d62>] ? hrtimer_wakeup+0x22/0x30
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8109535e>] ? __run_hrtimer+0x8e/0x1a0
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81012a69>] ? read_tsc+0x9/0x20
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81095706>] ? hrtimer_interrupt+0xe6/0x250
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff814f235b>] ? smp_apic_timer_interrupt+0x6b/0x9b
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
      2013-03-14T16:24:34+01:00 brutus3 <EOI> [<ffffffff81274465>] ? delay_tsc+0x45/0x80
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81274447>] ? delay_tsc+0x27/0x80
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff81274416>] ? __const_udelay+0x46/0x50
      2013-03-14T16:24:34+01:00 brutus3 [<ffffffff8102a3e3>] ? native_stop_other_cpus+0x83/0xd0
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814e9919>] ? panic+0xb6/0x168
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814eda94>] ? oops_end+0xe4/0x100
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffff8100f19b>] ? die+0x5b/0x90
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814ed602>] ? do_general_protection+0x152/0x160
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffff814ecdd5>] ? general_protection+0x25/0x30
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d128cb>] ? cl_object_top+0x1b/0x150 [obdclass]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d12a2e>] ? cl_object_attr_lock+0xe/0x20 [obdclass]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b3b24>] ? osc_lock_detach+0xf4/0x190 [osc]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b3c08>] ? osc_lock_delete+0x48/0xc0 [osc]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d1ab65>] ? cl_lock_delete0+0xb5/0x1d0 [obdclass]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0d1add3>] ? cl_lock_delete+0x153/0x1a0 [obdclass]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa10b5846>] ? osc_ldlm_blocking_ast+0x146/0x350 [osc]
      2013-03-14T16:24:35+01:00 brutus3 [<ffffffffa0e33f2c>] ? ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e42dda>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e478db>] ? ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa10b4259>] ? osc_lock_cancel+0xf9/0x1c0 [osc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d1392d>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d19645>] ? cl_lock_cancel0+0x75/0x160 [obdclass]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0d1a1eb>] ? cl_lock_cancel+0x13b/0x140 [obdclass]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa10b583a>] ? osc_ldlm_blocking_ast+0x13a/0x350 [osc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b070>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b5c1>] ? ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:36+01:00 brutus3 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      2013-03-14T16:24:37+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:37+01:00 brutus3 [<ffffffffa0e4b340>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
      2013-03-14T16:24:37+01:00 brutus3 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      2013-03-14T16:24:37+01:00 brutus3 --[ end trace 4537c3429b809b38 ]--
      2013-03-14T16:24:37+01:00 brutus3 panic occurred, switching back to text console

      Unfortunately i have no idea what process caused the panic: The affected node is a login node and there were about 50 people logged in, so i have no easy way to reproduce the crash :-/

      The lustre kernel module was compiled from the v2_3_61_0 git tag.

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            ethz.support ETHz Support (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: