[LU-1979] SWL - MDS crash after recovery osd_iam_lfix.c:190:iam_lfix_init()) Wrong magic in node 81689 (#56): 0x0 != 0x1976 or wrong count Created: 19/Sep/12 Updated: 20/Sep/12 Resolved: 20/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
LLNL Hyperion |
||
| Severity: | 3 |
| Rank (Obsolete): | 6318 |
| Description |
|
Mds crashes hard, after completing recovery. 2012-09-19 07:40:19 Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000e_UUID now active, resetting orphans Backtrace:
bt
PID: 4439 TASK: ffff88032a1ee040 CPU: 1 COMMAND: "mdt02_000"
#0 [ffff8802cf7756f0] machine_kexec at ffffffff8103281b
#1 [ffff8802cf775750] crash_kexec at ffffffff810ba792
#2 [ffff8802cf775820] oops_end at ffffffff81501700
#3 [ffff8802cf775850] no_context at ffffffff81043bab
#4 [ffff8802cf7758a0] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff8802cf7758f0] bad_area_nosemaphore at ffffffff81043f03
#6 [ffff8802cf775900] __do_page_fault at ffffffff81044661
#7 [ffff8802cf775a20] do_page_fault at ffffffff815036de
#8 [ffff8802cf775a50] page_fault at ffffffff81500a95
[exception RIP: lu_context_key_get+27]
RIP: ffffffffa072f00b RSP: ffff8802cf775b00 RFLAGS: 00010246
RAX: 0000000000000015 RBX: ffff88014362c8c0 RCX: ffffffffa076546f
RDX: 0000000000000000 RSI: ffffffffa0ee14e0 RDI: ffff880116f9f4c0
RBP: ffff8802cf775b00 R8: fffffffffffffffe R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000004 R12: ffff8802cf775b60
R13: ffff880116f9f4c0 R14: ffffffffa076546f R15: ffff88012f4436f0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff8802cf775b08] osd_xattr_get at ffffffffa0ebaf8f [osd_ldiskfs]
#10 [ffff8802cf775b58] dt_version_get at ffffffffa07330d4 [obdclass]
#11 [ffff8802cf775b88] mdt_obj_version_get at ffffffffa0e297cc [mdt]
#12 [ffff8802cf775bb8] mdt_version_get_check_save at ffffffffa0e29d0f [mdt]
#13 [ffff8802cf775be8] mdt_md_create at ffffffffa0e2a03d [mdt]
#14 [ffff8802cf775c68] mdt_reint_create at ffffffffa0e2a6b3 [mdt]
#15 [ffff8802cf775ca8] mdt_reint_rec at ffffffffa0e28151 [mdt]
#16 [ffff8802cf775cc8] mdt_reint_internal at ffffffffa0e219aa [mdt]
#17 [ffff8802cf775d18] mdt_reint at ffffffffa0e21cf4 [mdt]
#18 [ffff8802cf775d38] mdt_handle_common at ffffffffa0e15802 [mdt]
#19 [ffff8802cf775d88] mdt_regular_handle at ffffffffa0e166f5 [mdt]
#20 [ffff8802cf775d98] ptlrpc_server_handle_request at ffffffffa08b199d [ptlrpc]
#21 [ffff8802cf775e98] ptlrpc_main at ffffffffa08b2f89 [ptlrpc]
#22 [ffff8802cf775f48] kernel_thread at ffffffff8100c14a
|
| Comments |
| Comment by Peter Jones [ 19/Sep/12 ] |
|
Fanyong Could you please comment on this one too? Peter |
| Comment by Cliff White (Inactive) [ 19/Sep/12 ] |
|
The MDS is now in a state where all it does is crash, every time recovery completes.
2012-09-19 08:14:08 BUG: unable to handle kernel paging request at 00000000cee88efc
2012-09-19 08:14:08 IP: [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:08 PGD 0
2012-09-19 08:14:08 Oops: 0000 [#1]
2012-09-19 08:14:08 LustreError: 4138:0:(llog_lvfs.c:430:llog_lvfs_next_block()) Cant read llog block at log id 7340386/1716600893 offset 2048000
2012-09-19 08:14:08 LustreError: 4551:0:(llog_lvfs.c:430:llog_lvfs_next_block()) Cant read llog block at log id 7340336/1716602012 offset 2056192
2012-09-19 08:14:08 SMP
2012-09-19 08:14:08 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
2012-09-19 08:14:08 CPU 3
2012-09-19 08:14:08 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
2012-09-19 08:14:08
2012-09-19 08:14:08 Pid: 4610, comm: mdt02_029 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
2012-09-19 08:14:08 RIP: 0010:[<ffffffffa086168e>] [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:08 RSP: 0018:ffff8801613b3620 EFLAGS: 00010202
2012-09-19 08:14:08 RAX: ffffffffa09071d1 RBX: ffff88015f442d40 RCX: 0000000000000000
2012-09-19 08:14:08 RDX: ffffffffa09075ef RSI: ffffffffa092b4e0 RDI: ffff88015f442d40
2012-09-19 08:14:08 RBP: ffff8801613b3740 R08: 00000000fffffffb R09: 00000000fffffffe
2012-09-19 08:14:08 R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffa09075ef
2012-09-19 08:14:08 R13: ffffffffa092b4e0 R14: 00000000cee88ea8 R15: 0000000000000000
2012-09-19 08:14:08 FS: 00002aaaab47e700(0000) GS:ffff880028220000(0000) knlGS:0000000000000000
2012-09-19 08:14:08 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-09-19 08:14:08 CR2: 00000000cee88efc CR3: 000000032ab2f000 CR4: 00000000000006e0
2012-09-19 08:14:08 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-09-19 08:14:08 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-09-19 08:14:08 Process mdt02_029 (pid: 4610, threadinfo ffff8801613b2000, task ffff88013a370ae0)
2012-09-19 08:14:08 Stack:
2012-09-19 08:14:08 24f793cecee88ecb 0000000000000002 0000120200000001 0000052a00000000
2012-09-19 08:14:08 <d> ffffffffa09075c9 ffffffffa09075c9 000000020000346e 0000000000002bf9
2012-09-19 08:14:08 <d> 0000000000000002 ffff880100000038 ffffffffa09075eb 0000000004004000
2012-09-19 08:14:08 Call Trace:
2012-09-19 08:14:08 [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:08 [<ffffffffa086979c>] ldlm_resource_dump+0x12c/0x480 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0861c3f>] ldlm_granted_list_add_lock+0x5f/0x360 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa08651bc>] ldlm_grant_lock+0x38c/0x6f0 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa08927c2>] ldlm_process_inodebits_lock+0x212/0x400 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0865925>] ldlm_lock_enqueue+0x405/0x8f0 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa08848c9>] ldlm_cli_enqueue_local+0x179/0x560 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0884cb0>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0f65ab0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f687c0>] mdt_object_lock+0x320/0xb70 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f65ab0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2012-09-19 08:14:08 [<ffffffffa0884cb0>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0f69071>] mdt_object_find_lock+0x61/0x170 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f96fa9>] mdt_reint_open+0x499/0x18a0 [mdt]
2012-09-19 08:14:08 [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:08 [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f7af7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f77191>] mdt_intent_policy+0x371/0x6a0 [mdt]
2012-09-19 08:14:08 [<ffffffffa0865881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa088d9bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
2012-09-19 08:14:08 [<ffffffffa0f77506>] mdt_enqueue+0x46/0x130 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-19 08:14:08 [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-19 08:14:09 [<ffffffffa08bd99d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
2012-09-19 08:14:09 [<ffffffffa08b4f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-19 08:14:09 [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:09 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-19 08:14:09 [<ffffffffa08bef89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
2012-09-19 08:14:09 [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09 [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-19 08:14:09 [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09 [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09 [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-19 08:14:09 Code: 48 c7 c0 d1 71 90 a0 74 19 49 8b 87 d8 00 00 00 48 85 c0 0f 84 34 02 00 00 48 8b 78 18 e8 5b 2b b4 ff 4d 85 f6 0f 84 a2 04 00 00 <41> 8b 76 54 83 fe 0c 0f 84 45 02 00 00 83 fe 0d 0f 84 fc 00 00
2012-09-19 08:14:09 RIP [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:09 RSP <ffff8801613b3620>
2012-09-19 08:14:09 CR2: 00000000cee88efc
2012-09-19 08:14:09 Initializing cgroup subsys cpuset
2012-09-19 08:14:09 Initializing cgroup subsys cpu
|
| Comment by Cliff White (Inactive) [ 19/Sep/12 ] |
|
The system now appears to be in a state where it cannot complete recovery.
crash> bt
PID: 4610 TASK: ffff88013a370ae0 CPU: 3 COMMAND: "mdt02_029"
#0 [ffff8801613b3210] machine_kexec at ffffffff8103281b
#1 [ffff8801613b3270] crash_kexec at ffffffff810ba792
#2 [ffff8801613b3340] oops_end at ffffffff81501700
#3 [ffff8801613b3370] no_context at ffffffff81043bab
#4 [ffff8801613b33c0] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff8801613b3410] bad_area_nosemaphore at ffffffff81043f03
#6 [ffff8801613b3420] __do_page_fault at ffffffff81044661
#7 [ffff8801613b3540] do_page_fault at ffffffff815036de
#8 [ffff8801613b3570] page_fault at ffffffff81500a95
[exception RIP: _ldlm_lock_debug+126]
RIP: ffffffffa086168e RSP: ffff8801613b3620 RFLAGS: 00010202
RAX: ffffffffa09071d1 RBX: ffff88015f442d40 RCX: 0000000000000000
RDX: ffffffffa09075ef RSI: ffffffffa092b4e0 RDI: ffff88015f442d40
RBP: ffff8801613b3740 R8: 00000000fffffffb R9: 00000000fffffffe
R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffa09075ef
R13: ffffffffa092b4e0 R14: 00000000cee88ea8 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff8801613b3748] ldlm_resource_dump at ffffffffa086979c [ptlrpc]
#10 [ffff8801613b37a8] ldlm_granted_list_add_lock at ffffffffa0861c3f [ptlrpc]
#11 [ffff8801613b37d8] ldlm_grant_lock at ffffffffa08651bc [ptlrpc]
#12 [ffff8801613b3838] ldlm_process_inodebits_lock at ffffffffa08927c2 [ptlrpc]
#13 [ffff8801613b38b8] ldlm_lock_enqueue at ffffffffa0865925 [ptlrpc]
#14 [ffff8801613b3918] ldlm_cli_enqueue_local at ffffffffa08848c9 [ptlrpc]
#15 [ffff8801613b39a8] mdt_object_lock at ffffffffa0f687c0 [mdt]
#16 [ffff8801613b3a48] mdt_object_find_lock at ffffffffa0f69071 [mdt]
#17 [ffff8801613b3a78] mdt_reint_open at ffffffffa0f96fa9 [mdt]
#18 [ffff8801613b3b48] mdt_reint_rec at ffffffffa0f81151 [mdt]
#19 [ffff8801613b3b68] mdt_reint_internal at ffffffffa0f7a9aa [mdt]
#20 [ffff8801613b3bb8] mdt_intent_reint at ffffffffa0f7af7d [mdt]
#21 [ffff8801613b3c08] mdt_intent_policy at ffffffffa0f77191 [mdt]
#22 [ffff8801613b3c48] ldlm_lock_enqueue at ffffffffa0865881 [ptlrpc]
#23 [ffff8801613b3ca8] ldlm_handle_enqueue0 at ffffffffa088d9bf [ptlrpc]
#24 [ffff8801613b3d18] mdt_enqueue at ffffffffa0f77506 [mdt]
#25 [ffff8801613b3d38] mdt_handle_common at ffffffffa0f6e802 [mdt]
#26 [ffff8801613b3d88] mdt_regular_handle at ffffffffa0f6f6f5 [mdt]
#27 [ffff8801613b3d98] ptlrpc_server_handle_request at ffffffffa08bd99d [ptlrpc]
#28 [ffff8801613b3e98] ptlrpc_main at ffffffffa08bef89 [ptlrpc]
#29 [ffff8801613b3f48] kernel_thread at ffffffff8100c14a
|
| Comment by nasf (Inactive) [ 20/Sep/12 ] |
|
It is duplication of |