[LU-1979] SWL - MDS crash after recovery osd_iam_lfix.c:190:iam_lfix_init()) Wrong magic in node 81689 (#56): 0x0 != 0x1976 or wrong count Created: 19/Sep/12  Updated: 20/Sep/12  Resolved: 20/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

LLNL Hyperion


Severity: 3
Rank (Obsolete): 6318

 Description   

Mds crashes hard, after completing recovery.

2012-09-19 07:40:19 Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000e_UUID now active, resetting orphans
2012-09-19 07:40:19 Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0026_UUID now active, resetting orphans
2012-09-19 07:40:19 Lustre: Skipped 16 previous similar messages
2012-09-19 07:40:28 LustreError: 4748:0:(osd_iam_lfix.c:190:iam_lfix_init()) Wrong magic in node 81689 (#56): 0x0 != 0x1976 or wrong count: 0Initializing cgroup subsys cpuset

Backtrace:

 bt
PID: 4439   TASK: ffff88032a1ee040  CPU: 1   COMMAND: "mdt02_000"
 #0 [ffff8802cf7756f0] machine_kexec at ffffffff8103281b
 #1 [ffff8802cf775750] crash_kexec at ffffffff810ba792
 #2 [ffff8802cf775820] oops_end at ffffffff81501700
 #3 [ffff8802cf775850] no_context at ffffffff81043bab
 #4 [ffff8802cf7758a0] __bad_area_nosemaphore at ffffffff81043e35
 #5 [ffff8802cf7758f0] bad_area_nosemaphore at ffffffff81043f03
 #6 [ffff8802cf775900] __do_page_fault at ffffffff81044661
 #7 [ffff8802cf775a20] do_page_fault at ffffffff815036de
 #8 [ffff8802cf775a50] page_fault at ffffffff81500a95
    [exception RIP: lu_context_key_get+27]
    RIP: ffffffffa072f00b  RSP: ffff8802cf775b00  RFLAGS: 00010246
    RAX: 0000000000000015  RBX: ffff88014362c8c0  RCX: ffffffffa076546f
    RDX: 0000000000000000  RSI: ffffffffa0ee14e0  RDI: ffff880116f9f4c0
    RBP: ffff8802cf775b00   R8: fffffffffffffffe   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000004  R12: ffff8802cf775b60
    R13: ffff880116f9f4c0  R14: ffffffffa076546f  R15: ffff88012f4436f0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8802cf775b08] osd_xattr_get at ffffffffa0ebaf8f [osd_ldiskfs]
#10 [ffff8802cf775b58] dt_version_get at ffffffffa07330d4 [obdclass]
#11 [ffff8802cf775b88] mdt_obj_version_get at ffffffffa0e297cc [mdt]
#12 [ffff8802cf775bb8] mdt_version_get_check_save at ffffffffa0e29d0f [mdt]
#13 [ffff8802cf775be8] mdt_md_create at ffffffffa0e2a03d [mdt]
#14 [ffff8802cf775c68] mdt_reint_create at ffffffffa0e2a6b3 [mdt]
#15 [ffff8802cf775ca8] mdt_reint_rec at ffffffffa0e28151 [mdt]
#16 [ffff8802cf775cc8] mdt_reint_internal at ffffffffa0e219aa [mdt]
#17 [ffff8802cf775d18] mdt_reint at ffffffffa0e21cf4 [mdt]
#18 [ffff8802cf775d38] mdt_handle_common at ffffffffa0e15802 [mdt]
#19 [ffff8802cf775d88] mdt_regular_handle at ffffffffa0e166f5 [mdt]
#20 [ffff8802cf775d98] ptlrpc_server_handle_request at ffffffffa08b199d [ptlrpc]
#21 [ffff8802cf775e98] ptlrpc_main at ffffffffa08b2f89 [ptlrpc]
#22 [ffff8802cf775f48] kernel_thread at ffffffff8100c14a


 Comments   
Comment by Peter Jones [ 19/Sep/12 ]

Fanyong

Could you please comment on this one too?

Peter

Comment by Cliff White (Inactive) [ 19/Sep/12 ]

The MDS is now in a state where all it does is crash, every time recovery completes.
Latest:

2012-09-19 08:14:08 BUG: unable to handle kernel paging request at 00000000cee88efc
2012-09-19 08:14:08 IP: [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:08 PGD 0
2012-09-19 08:14:08 Oops: 0000 [#1]
2012-09-19 08:14:08 LustreError: 4138:0:(llog_lvfs.c:430:llog_lvfs_next_block()) Cant read llog block at log id 7340386/1716600893 offset 2048000
2012-09-19 08:14:08 LustreError: 4551:0:(llog_lvfs.c:430:llog_lvfs_next_block()) Cant read llog block at log id 7340336/1716602012 offset 2056192
2012-09-19 08:14:08 SMP
2012-09-19 08:14:08 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
2012-09-19 08:14:08 CPU 3
2012-09-19 08:14:08 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
2012-09-19 08:14:08
2012-09-19 08:14:08 Pid: 4610, comm: mdt02_029 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
2012-09-19 08:14:08 RIP: 0010:[<ffffffffa086168e>]  [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:08 RSP: 0018:ffff8801613b3620  EFLAGS: 00010202
2012-09-19 08:14:08 RAX: ffffffffa09071d1 RBX: ffff88015f442d40 RCX: 0000000000000000
2012-09-19 08:14:08 RDX: ffffffffa09075ef RSI: ffffffffa092b4e0 RDI: ffff88015f442d40
2012-09-19 08:14:08 RBP: ffff8801613b3740 R08: 00000000fffffffb R09: 00000000fffffffe
2012-09-19 08:14:08 R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffa09075ef
2012-09-19 08:14:08 R13: ffffffffa092b4e0 R14: 00000000cee88ea8 R15: 0000000000000000
2012-09-19 08:14:08 FS:  00002aaaab47e700(0000) GS:ffff880028220000(0000) knlGS:0000000000000000
2012-09-19 08:14:08 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-09-19 08:14:08 CR2: 00000000cee88efc CR3: 000000032ab2f000 CR4: 00000000000006e0
2012-09-19 08:14:08 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-09-19 08:14:08 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-09-19 08:14:08 Process mdt02_029 (pid: 4610, threadinfo ffff8801613b2000, task ffff88013a370ae0)
2012-09-19 08:14:08 Stack:
2012-09-19 08:14:08  24f793cecee88ecb 0000000000000002 0000120200000001 0000052a00000000
2012-09-19 08:14:08 <d> ffffffffa09075c9 ffffffffa09075c9 000000020000346e 0000000000002bf9
2012-09-19 08:14:08 <d> 0000000000000002 ffff880100000038 ffffffffa09075eb 0000000004004000
2012-09-19 08:14:08 Call Trace:

2012-09-19 08:14:08  [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:08  [<ffffffffa086979c>] ldlm_resource_dump+0x12c/0x480 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0861c3f>] ldlm_granted_list_add_lock+0x5f/0x360 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa08651bc>] ldlm_grant_lock+0x38c/0x6f0 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa08927c2>] ldlm_process_inodebits_lock+0x212/0x400 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0865925>] ldlm_lock_enqueue+0x405/0x8f0 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa08848c9>] ldlm_cli_enqueue_local+0x179/0x560 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0884cb0>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0f65ab0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f687c0>] mdt_object_lock+0x320/0xb70 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f65ab0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
2012-09-19 08:14:08  [<ffffffffa0884cb0>] ? ldlm_completion_ast+0x0/0x730 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0f69071>] mdt_object_find_lock+0x61/0x170 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f96fa9>] mdt_reint_open+0x499/0x18a0 [mdt]
2012-09-19 08:14:08  [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:08  [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f7af7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f77191>] mdt_intent_policy+0x371/0x6a0 [mdt]
2012-09-19 08:14:08  [<ffffffffa0865881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa088d9bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
2012-09-19 08:14:08  [<ffffffffa0f77506>] mdt_enqueue+0x46/0x130 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
2012-09-19 08:14:08  [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-09-19 08:14:09  [<ffffffffa08bd99d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
2012-09-19 08:14:09  [<ffffffffa08b4f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
2012-09-19 08:14:09  [<ffffffffa03a85b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2012-09-19 08:14:09  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
2012-09-19 08:14:09  [<ffffffffa08bef89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
2012-09-19 08:14:09  [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-09-19 08:14:09  [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09  [<ffffffffa08be430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
2012-09-19 08:14:09  [<ffffffff8100c140>] ? child_rip+0x0/0x20
2012-09-19 08:14:09 Code: 48 c7 c0 d1 71 90 a0 74 19 49 8b 87 d8 00 00 00 48 85 c0 0f 84 34 02 00 00 48 8b 78 18 e8 5b 2b b4 ff 4d 85 f6 0f 84 a2 04 00 00 <41> 8b 76 54 83 fe 0c 0f 84 45 02 00 00 83 fe 0d 0f 84 fc 00 00
2012-09-19 08:14:09 RIP  [<ffffffffa086168e>] _ldlm_lock_debug+0x7e/0x5d0 [ptlrpc]
2012-09-19 08:14:09  RSP <ffff8801613b3620>
2012-09-19 08:14:09 CR2: 00000000cee88efc
2012-09-19 08:14:09 Initializing cgroup subsys cpuset
2012-09-19 08:14:09 Initializing cgroup subsys cpu
Comment by Cliff White (Inactive) [ 19/Sep/12 ]

The system now appears to be in a state where it cannot complete recovery.
Latest crash:

crash> bt
PID: 4610   TASK: ffff88013a370ae0  CPU: 3   COMMAND: "mdt02_029"
 #0 [ffff8801613b3210] machine_kexec at ffffffff8103281b
 #1 [ffff8801613b3270] crash_kexec at ffffffff810ba792
 #2 [ffff8801613b3340] oops_end at ffffffff81501700
 #3 [ffff8801613b3370] no_context at ffffffff81043bab
 #4 [ffff8801613b33c0] __bad_area_nosemaphore at ffffffff81043e35
 #5 [ffff8801613b3410] bad_area_nosemaphore at ffffffff81043f03
 #6 [ffff8801613b3420] __do_page_fault at ffffffff81044661
 #7 [ffff8801613b3540] do_page_fault at ffffffff815036de
 #8 [ffff8801613b3570] page_fault at ffffffff81500a95
    [exception RIP: _ldlm_lock_debug+126]
    RIP: ffffffffa086168e  RSP: ffff8801613b3620  RFLAGS: 00010202
    RAX: ffffffffa09071d1  RBX: ffff88015f442d40  RCX: 0000000000000000
    RDX: ffffffffa09075ef  RSI: ffffffffa092b4e0  RDI: ffff88015f442d40
    RBP: ffff8801613b3740   R8: 00000000fffffffb   R9: 00000000fffffffe
    R10: 0000000000000000  R11: 0000000000000004  R12: ffffffffa09075ef
    R13: ffffffffa092b4e0  R14: 00000000cee88ea8  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8801613b3748] ldlm_resource_dump at ffffffffa086979c [ptlrpc]
#10 [ffff8801613b37a8] ldlm_granted_list_add_lock at ffffffffa0861c3f [ptlrpc]
#11 [ffff8801613b37d8] ldlm_grant_lock at ffffffffa08651bc [ptlrpc]
#12 [ffff8801613b3838] ldlm_process_inodebits_lock at ffffffffa08927c2 [ptlrpc]
#13 [ffff8801613b38b8] ldlm_lock_enqueue at ffffffffa0865925 [ptlrpc]
#14 [ffff8801613b3918] ldlm_cli_enqueue_local at ffffffffa08848c9 [ptlrpc]
#15 [ffff8801613b39a8] mdt_object_lock at ffffffffa0f687c0 [mdt]
#16 [ffff8801613b3a48] mdt_object_find_lock at ffffffffa0f69071 [mdt]
#17 [ffff8801613b3a78] mdt_reint_open at ffffffffa0f96fa9 [mdt]
#18 [ffff8801613b3b48] mdt_reint_rec at ffffffffa0f81151 [mdt]
#19 [ffff8801613b3b68] mdt_reint_internal at ffffffffa0f7a9aa [mdt]
#20 [ffff8801613b3bb8] mdt_intent_reint at ffffffffa0f7af7d [mdt]
#21 [ffff8801613b3c08] mdt_intent_policy at ffffffffa0f77191 [mdt]
#22 [ffff8801613b3c48] ldlm_lock_enqueue at ffffffffa0865881 [ptlrpc]
#23 [ffff8801613b3ca8] ldlm_handle_enqueue0 at ffffffffa088d9bf [ptlrpc]
#24 [ffff8801613b3d18] mdt_enqueue at ffffffffa0f77506 [mdt]
#25 [ffff8801613b3d38] mdt_handle_common at ffffffffa0f6e802 [mdt]
#26 [ffff8801613b3d88] mdt_regular_handle at ffffffffa0f6f6f5 [mdt]
#27 [ffff8801613b3d98] ptlrpc_server_handle_request at ffffffffa08bd99d [ptlrpc]
#28 [ffff8801613b3e98] ptlrpc_main at ffffffffa08bef89 [ptlrpc]
#29 [ffff8801613b3f48] kernel_thread at ffffffff8100c14a
Comment by nasf (Inactive) [ 20/Sep/12 ]

It is duplication of LU-1976

Generated at Sat Feb 10 01:21:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.