Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 1.8.9
-
None
-
3
-
10344
Description
NOAA ran into a kernel panic on an mds that appears to have something to do with the MGS procfs system:
Unable to handle kernel NULL pointer dereference at 0000000000000050 RIP: [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90 PGD 914f35067 PUD 914f36067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 21 Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ib_iser(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) ib_srp(U) rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_sa(U) dm_round_robin(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) joydev(U) sg(U) i2c_i801(U) igb(U) i2c_core(U) tpm_tis(U) tpm(U) tpm_bios(U) 8021q(U) mlx4_core(U) pcspkr(U) dca(U) serio_raw(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ahci(U) libata(U) shpchp(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11472, comm: ll_mgs_12 Tainted: G ---- 2.6.18-348.1.1.el5_lustre.es52 #1 RIP: 0010:[<ffffffff8af85f39>] [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90 RSP: 0018:ffff8102082f1ad0 EFLAGS: 00010202 RAX: ffff81121bc82cc0 RBX: ffff8104af91d400 RCX: 0000000000000681 RDX: 0000000000000000 RSI: ffff81121bc82cc8 RDI: ffff81121bc82cc8 RBP: ffff81120c246140 R08: 0000000000000001 R09: 0000000000000000 R10: ffff81120c246140 R11: 0000000000000058 R12: ffff8104af91d400 R13: ffff8103879c8038 R14: ffff810384ffd5b0 R15: ffff8102082f1b5c FS: 00002b68894c66e0(0000) GS:ffff81123fdda8c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000050 CR3: 0000000914f34000 CR4: 00000000000006a0 Process ll_mgs_12 (pid: 11472, threadinfo ffff8102082f0000, task ffff8102fcd63860) Stack: 0000000000000000 0000000000000000 ffffffff8afb3986 ffff81038a2055c0 0000003000000020 ffff8102082f1bd0 ffff8102082f1b10 ffff81120c246148 ffff81121bc82cc0 ffff8104af91d400 ffff81038485e128 ffff8102082f1ca0 Call Trace: [<ffffffff8b2f4a70>] :mgs:mgs_handle+0x0/0x16d0 [<ffffffff8b2f9450>] :mgs:mgs_export_stats_init+0x20/0xe0 [<ffffffff8b2f34de>] :mgs:mgs_reconnect+0x14e/0x1e0 [<ffffffff8b03c307>] :ptlrpc:lustre_msg_add_op_flags+0x47/0x120 [<ffffffff8b03cea5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 [<ffffffff8b006cf0>] :ptlrpc:target_handle_connect+0x24c0/0x2e80 [<ffffffff8af27b00>] :lnet:lnet_match_blocked_msg+0x360/0x390 [<ffffffff80158202>] __next_cpu+0x19/0x28 [<ffffffff8b2f4f5e>] :mgs:mgs_handle+0x4ee/0x16d0 [<ffffffff800471ee>] try_to_wake_up+0x472/0x484 [<ffffffff8b046874>] :ptlrpc:ptlrpc_server_handle_request+0x984/0xe00 [<ffffffff8b046fd5>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310 [<ffffffff8008d7a6>] __wake_up_common+0x3e/0x68 [<ffffffff8b047f16>] :ptlrpc:ptlrpc_main+0xf16/0x10e0 [<ffffffff8005dfc1>] child_rip+0xa/0x11 [<ffffffff8b047000>] :ptlrpc:ptlrpc_main+0x0/0x10e0 [<ffffffff8005dfb7>] child_rip+0x0/0x11
I ran gdb on obdclass and it looks like the panic is here:
(gdb) list *(lprocfs_exp_setup+0x449)
0x2cf39 is in lprocfs_exp_setup (/vault/builds/workspace/Lustre_ES_1.5/build-area/BUILD/lustre-1.8.9/lustre/obdclass/lprocfs_status.c:1729).
1724 atomic_read(&new_stat->nid_exp_ref_count));
1725
1726 /* we need to release old stats because lprocfs_exp_cleanup() hasn't
1727 * been and will never be called. */
1728 if (exp->exp_nid_stats != NULL) {
1729 nidstat_putref(exp->exp_nid_stats);
1730 exp->exp_nid_stats = NULL;
1731 }
1732
1733 /* Return -EALREADY here so that we know that the /proc
Is it possible that lprocfs_exp_setup was called twice by two separate threads? If so, it seems like this could happen if 1730 was executed and then 1729.
I've attached crash bt and log files.
Landed for 2.5