[LU-7951] NULL pointer dereference in vvp_io_write_start Created: 29/Mar/16 Updated: 01/Jul/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cameron Harr | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, llnlfixready | ||
| Environment: |
TOSS 2.4-7 (RHEL 6.7), Lustre 2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
On one compute cluster we are seeing multiple crashes per day. Console log shows: <ConMan> Console [syrah143] log at 2016-03-16 03:00:00 PDT. 2016-03-16 03:47:44 BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 2016-03-16 03:47:44 IP: [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 PGD fe9c89067 PUD fc112d067 PMD 0 2016-03-16 03:47:44 Oops: 0002 [#1] SMP 2016-03-16 03:47:44 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/infiniband/qib0/ports/1/state 2016-03-16 03:47:44 CPU 28 2016-03-16 03:47:44 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput microcode iTCO_wdt iTCO_vendor_support wmi isci libsas scsi_transport_sas ahci joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma ib_qib(U) ib_mad ib_core ib_addr xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_LOG xt_multiport iptable_filter ip_tables ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca i2c_algo_bit i2c_core ptp pps_core [last unloaded: cpufreq_ondemand] 2016-03-16 03:47:44 Pid: 88396, comm: react_charge_tr Not tainted 2.6.32-573.18.1.1chaos.ch5.4.x86_64 #1 Cray Inc. SERVER-GB512X-CN/S2600JF 2016-03-16 03:47:44 RIP: 0010:[<ffffffffa0b987cd>] [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 RSP: 0018:ffff880818b3f788 EFLAGS: 00010202 2016-03-16 03:47:44 RAX: 0000000000000000 RBX: 0000000000d162ac RCX: 0000000000000001 2016-03-16 03:47:44 RDX: ffff880a1321c9d4 RSI: ffffffffa0bc95a0 RDI: ffff880a1321c9d4 2016-03-16 03:47:44 RBP: ffff880818b3f7d8 R08: ffff880b49965650 R09: ffff880e736d0ee8 2016-03-16 03:47:44 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880b49965608 2016-03-16 03:47:44 R13: ffff880a1321c638 R14: ffff880e736d0ee8 R15: 0000000000001000 2016-03-16 03:47:44 FS: 00002aaaaf64db20(0000) GS:ffff88085c580000(0000) knlGS:0000000000000000 2016-03-16 03:47:44 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2016-03-16 03:47:44 CR2: 0000000000000080 CR3: 0000000fc6de5000 CR4: 00000000000407e0 2016-03-16 03:47:44 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2016-03-16 03:47:44 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2016-03-16 03:47:44 Process react_charge_tr (pid: 88396, threadinfo ffff880818b3c000, task ffff88083045f520) 2016-03-16 03:47:44 Stack: 2016-03-16 03:47:44 ffff88093562cf08 ffff880b49965608 ffff88093562cf08 ffff88093562cf08 2016-03-16 03:47:44 <d> ffff880818b3f808 ffff880e736d0ee8 ffff880b49965608 ffff88093562cf08 2016-03-16 03:47:44 <d> ffff880b49965608 ffff880b61cadd80 ffff880818b3f808 ffffffffa0671e5a 2016-03-16 03:47:44 Call Trace: 2016-03-16 03:47:44 [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b67579>] ll_cl_init+0x3a9/0x570 [lustre] 2016-03-16 03:47:44 [<ffffffffa066cd72>] ? cl_lock_mutex_try+0x112/0x120 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b67933>] ll_prepare_write+0x53/0x170 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b845ee>] ll_write_begin+0x7e/0x1a0 [lustre] 2016-03-16 03:47:44 [<ffffffff81128443>] generic_file_buffered_write+0x123/0x300 2016-03-16 03:47:44 [<ffffffff8107ecf7>] ? current_fs_time+0x27/0x30 2016-03-16 03:47:44 [<ffffffff81129f60>] __generic_file_aio_write+0x260/0x490 2016-03-16 03:47:44 [<ffffffff8112a218>] generic_file_aio_write+0x88/0x100 2016-03-16 03:47:44 [<ffffffffa0b98676>] vvp_io_write_start+0x156/0x3d0 [lustre] 2016-03-16 03:47:44 [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0676964>] cl_io_loop+0xb4/0x1b0 [obdclass] 2016-03-16 03:47:44 [<ffffffffa0b35356>] ll_file_io_generic+0x1a6/0x750 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b40169>] ll_file_aio_write+0x1b9/0x6f0 [lustre] 2016-03-16 03:47:44 [<ffffffffa0b3ffb0>] ? ll_file_aio_write+0x0/0x6f0 [lustre] 2016-03-16 03:47:44 [<ffffffff81192a3b>] do_sync_readv_writev+0xfb/0x140 2016-03-16 03:47:44 [<ffffffffa0ab8a83>] ? lov_io_fini+0x383/0x480 [lov] 2016-03-16 03:47:44 [<ffffffffa0b6a95b>] ? ll_stats_ops_tally+0x7b/0xa0 [lustre] 2016-03-16 03:47:44 [<ffffffff810a1d50>] ? autoremove_wake_function+0x0/0x40 2016-03-16 03:47:44 [<ffffffff81233a16>] ? security_file_permission+0x16/0x20 2016-03-16 03:47:44 [<ffffffff81193b26>] do_readv_writev+0xd6/0x1f0 2016-03-16 03:47:44 [<ffffffff811930fc>] ? generic_file_llseek_size+0x8c/0xd0 2016-03-16 03:47:44 [<ffffffffa0b359d2>] ? ll_file_seek+0xd2/0x2b0 [lustre] 2016-03-16 03:47:44 [<ffffffff81193c86>] vfs_writev+0x46/0x60 2016-03-16 03:47:44 [<ffffffff81193db1>] sys_writev+0x51/0xd0 2016-03-16 03:47:44 [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b 2016-03-16 03:47:44 Code: a4 24 f4 00 00 00 fe e9 d9 fe ff ff 66 90 41 8b 4c 24 78 85 c9 0f 84 0e fe ff ff 49 8b 5d 68 49 89 5c 24 60 49 8b 86 b0 00 00 00 <48> 89 98 80 00 00 00 e9 f2 fd ff ff 0f 1f 80 00 00 00 00 8b 15 2016-03-16 03:47:44 RIP [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre] 2016-03-16 03:47:44 RSP <ffff880818b3f788> 2016-03-16 03:47:44 CR2: 0000000000000080 Latest node crashdump is 1.2GB compressed, so please advise if you want it. |
| Comments |
| Comment by Peter Jones [ 30/Mar/16 ] |
|
Bobijam Could you please advise on this one? Thanks Peter |
| Comment by Zhenyu Xu [ 30/Mar/16 ] |
|
Can you get what code line vvp_io_write_start+0x2ad points to? |
| Comment by Cameron Harr [ 30/Mar/16 ] |
|
Sorry for the delay, I wasn't sure you were talking to me as I believe you all have access to our branch. Peter can direct you to it if need be. (gdb) l *(vvp_io_write_start+0x2ad) 0x797fd is in vvp_io_write_start (/usr/src/debug/lustre-2.5.5/lustre/llite/vvp_io.c:610). 605 /* 606 * PARALLEL IO This has to be changed for parallel IO doing 607 * out-of-order writes. 608 */ 609 pos = io->u.ci_wr.wr.crw_pos = i_size_read(inode); 610 cio->cui_iocb->ki_pos = pos; 611 } 612 613 CDEBUG(D_VFSTRACE, "write: [%lli, %lli)\n", pos, pos + (long long)cnt); |
| Comment by Zhenyu Xu [ 31/Mar/16 ] |
|
Jinshan, would please review this patch http://review.whamcloud.com/#/c/19254/ ? |
| Comment by Olaf Faaland [ 31/Mar/16 ] |
|
Peter, |
| Comment by Andreas Dilger [ 01/Apr/16 ] |
|
Olaf, I added you to the project to access the http://review.whamcloud.com/19254 patch. Please try again. I wasn't able to add Cameron since he doesn't have a Gerrit account. |
| Comment by Olaf Faaland [ 01/Apr/16 ] |
|
Andreas, |
| Comment by Cameron Harr [ 01/Apr/16 ] |
|
Andreas, |
| Comment by Andreas Dilger [ 01/Apr/16 ] |
|
Done. |
| Comment by Cameron Harr [ 04/Apr/16 ] |
|
Thank you Andreas. |