Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 1.8.9
-
None
-
3
-
10344
Description
NOAA ran into a kernel panic on an mds that appears to have something to do with the MGS procfs system:
Unable to handle kernel NULL pointer dereference at 0000000000000050 RIP: [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90 PGD 914f35067 PUD 914f36067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 21 Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ib_iser(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) ib_srp(U) rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_sa(U) dm_round_robin(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) joydev(U) sg(U) i2c_i801(U) igb(U) i2c_core(U) tpm_tis(U) tpm(U) tpm_bios(U) 8021q(U) mlx4_core(U) pcspkr(U) dca(U) serio_raw(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ahci(U) libata(U) shpchp(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11472, comm: ll_mgs_12 Tainted: G ---- 2.6.18-348.1.1.el5_lustre.es52 #1 RIP: 0010:[<ffffffff8af85f39>] [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90 RSP: 0018:ffff8102082f1ad0 EFLAGS: 00010202 RAX: ffff81121bc82cc0 RBX: ffff8104af91d400 RCX: 0000000000000681 RDX: 0000000000000000 RSI: ffff81121bc82cc8 RDI: ffff81121bc82cc8 RBP: ffff81120c246140 R08: 0000000000000001 R09: 0000000000000000 R10: ffff81120c246140 R11: 0000000000000058 R12: ffff8104af91d400 R13: ffff8103879c8038 R14: ffff810384ffd5b0 R15: ffff8102082f1b5c FS: 00002b68894c66e0(0000) GS:ffff81123fdda8c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000050 CR3: 0000000914f34000 CR4: 00000000000006a0 Process ll_mgs_12 (pid: 11472, threadinfo ffff8102082f0000, task ffff8102fcd63860) Stack: 0000000000000000 0000000000000000 ffffffff8afb3986 ffff81038a2055c0 0000003000000020 ffff8102082f1bd0 ffff8102082f1b10 ffff81120c246148 ffff81121bc82cc0 ffff8104af91d400 ffff81038485e128 ffff8102082f1ca0 Call Trace: [<ffffffff8b2f4a70>] :mgs:mgs_handle+0x0/0x16d0 [<ffffffff8b2f9450>] :mgs:mgs_export_stats_init+0x20/0xe0 [<ffffffff8b2f34de>] :mgs:mgs_reconnect+0x14e/0x1e0 [<ffffffff8b03c307>] :ptlrpc:lustre_msg_add_op_flags+0x47/0x120 [<ffffffff8b03cea5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff8b006cf0>] :ptlrpc:target_handle_connect+0x24c0/0x2e80 [<ffffffff8af27b00>] :lnet:lnet_match_blocked_msg+0x360/0x390 [<ffffffff80158202>] __next_cpu+0x19/0x28 [<ffffffff8b2f4f5e>] :mgs:mgs_handle+0x4ee/0x16d0 [<ffffffff800471ee>] try_to_wake_up+0x472/0x484 [<ffffffff8b046874>] :ptlrpc:ptlrpc_server_handle_request+0x984/0xe00 [<ffffffff8b046fd5>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310 [<ffffffff8008d7a6>] __wake_up_common+0x3e/0x68 [<ffffffff8b047f16>] :ptlrpc:ptlrpc_main+0xf16/0x10e0 [<ffffffff8005dfc1>] child_rip+0xa/0x11 [<ffffffff8b047000>] :ptlrpc:ptlrpc_main+0x0/0x10e0 [<ffffffff8005dfb7>] child_rip+0x0/0x11
I ran gdb on obdclass and it looks like the panic is here:
(gdb) list *(lprocfs_exp_setup+0x449)
0x2cf39 is in lprocfs_exp_setup (/vault/builds/workspace/Lustre_ES_1.5/build-area/BUILD/lustre-1.8.9/lustre/obdclass/lprocfs_status.c:1729).
1724 atomic_read(&new_stat->nid_exp_ref_count));
1725
1726 /* we need to release old stats because lprocfs_exp_cleanup() hasn't
1727 * been and will never be called. */
1728 if (exp->exp_nid_stats != NULL) {
1729 nidstat_putref(exp->exp_nid_stats);
1730 exp->exp_nid_stats = NULL;
1731 }
1732
1733 /* Return -EALREADY here so that we know that the /proc
Is it possible that lprocfs_exp_setup was called twice by two separate threads? If so, it seems like this could happen if 1730 was executed and then 1729.
I've attached crash bt and log files.