Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.5.5
-
TOSS 2.4-7 (RHEL 6.7), Lustre 2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64
-
3
-
9223372036854775807
Description
On one compute cluster we are seeing multiple crashes per day. Console log shows:
<ConMan> Console [syrah143] log at 2016-03-16 03:00:00 PDT. 2016-03-16 03:47:44 BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 2016-03-16 03:47:44 IP: [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 PGD fe9c89067 PUD fc112d067 PMD 0 2016-03-16 03:47:44 Oops: 0002 [#1] SMP 2016-03-16 03:47:44 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/infiniband/qib0/ports/1/state 2016-03-16 03:47:44 CPU 28 2016-03-16 03:47:44 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput microcode iTCO_wdt iTCO_vendor_support wmi isci libsas scsi_transport_sas ahci joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma ib_qib(U) ib_mad ib_core ib_addr xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_LOG xt_multiport iptable_filter ip_tables ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca i2c_algo_bit i2c_core ptp pps_core [last unloaded: cpufreq_ondemand] 2016-03-16 03:47:44 Pid: 88396, comm: react_charge_tr Not tainted 2.6.32-573.18.1.1chaos.ch5.4.x86_64 #1 Cray Inc. SERVER-GB512X-CN/S2600JF 2016-03-16 03:47:44 RIP: 0010:[<ffffffffa0b987cd>] [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 RSP: 0018:ffff880818b3f788 EFLAGS: 00010202 2016-03-16 03:47:44 RAX: 0000000000000000 RBX: 0000000000d162ac RCX: 0000000000000001 2016-03-16 03:47:44 RDX: ffff880a1321c9d4 RSI: ffffffffa0bc95a0 RDI: ffff880a1321c9d4 2016-03-16 03:47:44 RBP: ffff880818b3f7d8 R08: ffff880b49965650 R09: ffff880e736d0ee8 2016-03-16 03:47:44 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880b49965608 2016-03-16 03:47:44 R13: ffff880a1321c638 R14: ffff880e736d0ee8 R15: 0000000000001000 2016-03-16 03:47:44 FS: 00002aaaaf64db20(0000) GS:ffff88085c580000(0000) knlGS:0000000000000000 2016-03-16 03:47:44 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2016-03-16 03:47:44 CR2: 0000000000000080 CR3: 0000000fc6de5000 CR4: 00000000000407e0 2016-03-16 03:47:44 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2016-03-16 03:47:44 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2016-03-16 03:47:44 Process react_charge_tr (pid: 88396, threadinfo ffff880818b3c000, task ffff88083045f520) 2016-03-16 03:47:44 Stack: 2016-03-16 03:47:44 ffff88093562cf08 ffff880b49965608 ffff88093562cf08 ffff88093562cf08 2016-03-16 03:47:44 <d> ffff880818b3f808 ffff880e736d0ee8 ffff880b49965608 ffff88093562cf08 2016-03-16 03:47:44 <d> ffff880b49965608 ffff880b61cadd80 ffff880818b3f808 ffffffffa0671e5a 2016-03-16 03:47:44 Call Trace: 2016-03-16 03:47:44 [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b67579>] ll_cl_init+0x3a9/0x570 [lustre] 2016-03-16 03:47:44 [<ffffffffa066cd72>] ? cl_lock_mutex_try+0x112/0x120 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b67933>] ll_prepare_write+0x53/0x170 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b845ee>] ll_write_begin+0x7e/0x1a0 [lustre] 2016-03-16 03:47:44 [<ffffffff81128443>] generic_file_buffered_write+0x123/0x300 2016-03-16 03:47:44 [<ffffffff8107ecf7>] ? current_fs_time+0x27/0x30 2016-03-16 03:47:44 [<ffffffff81129f60>] __generic_file_aio_write+0x260/0x490 2016-03-16 03:47:44 [<ffffffff8112a218>] generic_file_aio_write+0x88/0x100 2016-03-16 03:47:44 [<ffffffffa0b98676>] vvp_io_write_start+0x156/0x3d0 [lustre] 2016-03-16 03:47:44 [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0676964>] cl_io_loop+0xb4/0x1b0 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b35356>] ll_file_io_generic+0x1a6/0x750 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b40169>] ll_file_aio_write+0x1b9/0x6f0 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b3ffb0>] ? ll_file_aio_write+0x0/0x6f0 [lustre] 2016-03-16 03:47:44 [<ffffffff81192a3b>] do_sync_readv_writev+0xfb/0x140 2016-03-16 03:47:44 [<ffffffffa0ab8a83>] ? lov_io_fini+0x383/0x480 [lov] 2016-03-16 03:47:44 [<ffffffffa0b6a95b>] ? ll_stats_ops_tally+0x7b/0xa0 [lustre] 2016-03-16 03:47:44 [<ffffffff810a1d50>] ? autoremove_wake_function+0x0/0x40 2016-03-16 03:47:44 [<ffffffff81233a16>] ? security_file_permission+0x16/0x20 2016-03-16 03:47:44 [<ffffffff81193b26>] do_readv_writev+0xd6/0x1f0 2016-03-16 03:47:44 [<ffffffff811930fc>] ? generic_file_llseek_size+0x8c/0xd0 2016-03-16 03:47:44 [<ffffffffa0b359d2>] ? ll_file_seek+0xd2/0x2b0 [lustre] 2016-03-16 03:47:44 [<ffffffff81193c86>] vfs_writev+0x46/0x60 2016-03-16 03:47:44 [<ffffffff81193db1>] sys_writev+0x51/0xd0 2016-03-16 03:47:44 [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b 2016-03-16 03:47:44 Code: a4 24 f4 00 00 00 fe e9 d9 fe ff ff 66 90 41 8b 4c 24 78 85 c9 0f 84 0e fe ff ff 49 8b 5d 68 49 89 5c 24 60 49 8b 86 b0 00 00 00 <48> 89 98 80 00 00 00 e9 f2 fd ff ff 0f 1f 80 00 00 00 00 8b 15 2016-03-16 03:47:44 RIP [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 RSP <ffff880818b3f788> 2016-03-16 03:47:44 CR2: 0000000000000080
Latest node crashdump is 1.2GB compressed, so please advise if you want it.
Attachments
Issue Links
- is related to
-
LU-5108 osc: Performance tune for LRU
- Resolved