Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.5.0
Affects Version/s: Lustre 1.8.9
Labels:
None

Severity:
3
Rank (Obsolete):
10344

Description

NOAA ran into a kernel panic on an mds that appears to have something to do with the MGS procfs system:

Unable to handle kernel NULL pointer dereference at 0000000000000050 RIP: 
 [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90
PGD 914f35067 PUD 914f36067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/infiniband_mad/umad0/port
CPU 21 
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ib_iser(U) libiscsi2(U) scsi_transport_iscsi2(U) scsi_transport_iscsi(U) ib_srp(U) rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_sa(U) dm_round_robin(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_en(U) joydev(U) sg(U) i2c_i801(U) igb(U) i2c_core(U) tpm_tis(U) tpm(U) tpm_bios(U) 8021q(U) mlx4_core(U) pcspkr(U) dca(U) serio_raw(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ahci(U) libata(U) shpchp(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 11472, comm: ll_mgs_12 Tainted: G     ---- 2.6.18-348.1.1.el5_lustre.es52 #1
RIP: 0010:[<ffffffff8af85f39>]  [<ffffffff8af85f39>] :obdclass:lprocfs_exp_setup+0x449/0xd90
RSP: 0018:ffff8102082f1ad0  EFLAGS: 00010202
RAX: ffff81121bc82cc0 RBX: ffff8104af91d400 RCX: 0000000000000681
RDX: 0000000000000000 RSI: ffff81121bc82cc8 RDI: ffff81121bc82cc8
RBP: ffff81120c246140 R08: 0000000000000001 R09: 0000000000000000
R10: ffff81120c246140 R11: 0000000000000058 R12: ffff8104af91d400
R13: ffff8103879c8038 R14: ffff810384ffd5b0 R15: ffff8102082f1b5c
FS:  00002b68894c66e0(0000) GS:ffff81123fdda8c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000914f34000 CR4: 00000000000006a0
Process ll_mgs_12 (pid: 11472, threadinfo ffff8102082f0000, task ffff8102fcd63860)
Stack:  0000000000000000 0000000000000000 ffffffff8afb3986 ffff81038a2055c0
 0000003000000020 ffff8102082f1bd0 ffff8102082f1b10 ffff81120c246148
 ffff81121bc82cc0 ffff8104af91d400 ffff81038485e128 ffff8102082f1ca0
Call Trace:
 [<ffffffff8b2f4a70>] :mgs:mgs_handle+0x0/0x16d0
 [<ffffffff8b2f9450>] :mgs:mgs_export_stats_init+0x20/0xe0
 [<ffffffff8b2f34de>] :mgs:mgs_reconnect+0x14e/0x1e0
 [<ffffffff8b03c307>] :ptlrpc:lustre_msg_add_op_flags+0x47/0x120
 [<ffffffff8b03cea5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [<ffffffff8b006cf0>] :ptlrpc:target_handle_connect+0x24c0/0x2e80
 [<ffffffff8af27b00>] :lnet:lnet_match_blocked_msg+0x360/0x390
 [<ffffffff80158202>] __next_cpu+0x19/0x28
 [<ffffffff8b2f4f5e>] :mgs:mgs_handle+0x4ee/0x16d0
 [<ffffffff800471ee>] try_to_wake_up+0x472/0x484
 [<ffffffff8b046874>] :ptlrpc:ptlrpc_server_handle_request+0x984/0xe00
 [<ffffffff8b046fd5>] :ptlrpc:ptlrpc_wait_event+0x2e5/0x310
 [<ffffffff8008d7a6>] __wake_up_common+0x3e/0x68
 [<ffffffff8b047f16>] :ptlrpc:ptlrpc_main+0xf16/0x10e0
 [<ffffffff8005dfc1>] child_rip+0xa/0x11
 [<ffffffff8b047000>] :ptlrpc:ptlrpc_main+0x0/0x10e0
 [<ffffffff8005dfb7>] child_rip+0x0/0x11

I ran gdb on obdclass and it looks like the panic is here:

(gdb) list *(lprocfs_exp_setup+0x449)
0x2cf39 is in lprocfs_exp_setup (/vault/builds/workspace/Lustre_ES_1.5/build-area/BUILD/lustre-1.8.9/lustre/obdclass/lprocfs_status.c:1729).
1724                   atomic_read(&new_stat->nid_exp_ref_count));
1725    
1726            /* we need to release old stats because lprocfs_exp_cleanup() hasn't
1727             * been and will never be called. */
1728            if (exp->exp_nid_stats != NULL) {
1729                    nidstat_putref(exp->exp_nid_stats);
1730                    exp->exp_nid_stats = NULL;
1731            }
1732    
1733            /* Return -EALREADY here so that we know that the /proc

Is it possible that lprocfs_exp_setup was called twice by two separate threads? If so, it seems like this could happen if 1730 was executed and then 1729.

I've attached crash bt and log files.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2013-09-04.foreach_bt.out
421 kB
10/Sep/13 1:08 PM
2013-09-04.log.out
234 kB
10/Sep/13 1:08 PM

MDS crash, RIP :obdclass:lprocfs_exp_setup+0x449/0xd90

Details

Description

Attachments

Attachments

Activity

People

Dates