[LU-7951] NULL pointer dereference in vvp_io_write_start Created: 29/Mar/16  Updated: 01/Jul/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.5
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cameron Harr Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: llnl, llnlfixready
Environment:

TOSS 2.4-7 (RHEL 6.7), Lustre 2.5.5-3chaos_2.6.32_573.18.1.1chaos.ch5.4.x86_64.x86_64


Issue Links:
Related
is related to LU-5108 osc: Performance tune for LRU Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On one compute cluster we are seeing multiple crashes per day. Console log shows:

<ConMan> Console [syrah143] log at 2016-03-16 03:00:00 PDT.
2016-03-16 03:47:44 BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
2016-03-16 03:47:44 IP: [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre]
2016-03-16 03:47:44 PGD fe9c89067 PUD fc112d067 PMD 0
2016-03-16 03:47:44 Oops: 0002 [#1] SMP
2016-03-16 03:47:44 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/infiniband/qib0/ports/1/state
2016-03-16 03:47:44 CPU 28
2016-03-16 03:47:44 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ipmi_devintf ipmi_si ipmi_msghandler ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput microcode iTCO_wdt iTCO_vendor_support wmi isci libsas scsi_transport_sas ahci joydev sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma ib_qib(U) ib_mad ib_core ib_addr xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_LOG xt_multiport iptable_filter ip_tables ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca i2c_algo_bit i2c_core ptp pps_core [last unloaded: cpufreq_ondemand]
 
2016-03-16 03:47:44 Pid: 88396, comm: react_charge_tr Not tainted 2.6.32-573.18.1.1chaos.ch5.4.x86_64 #1 Cray Inc. SERVER-GB512X-CN/S2600JF
2016-03-16 03:47:44 RIP: 0010:[<ffffffffa0b987cd>]  [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre]
2016-03-16 03:47:44 RSP: 0018:ffff880818b3f788  EFLAGS: 00010202
2016-03-16 03:47:44 RAX: 0000000000000000 RBX: 0000000000d162ac RCX: 0000000000000001
2016-03-16 03:47:44 RDX: ffff880a1321c9d4 RSI: ffffffffa0bc95a0 RDI: ffff880a1321c9d4
2016-03-16 03:47:44 RBP: ffff880818b3f7d8 R08: ffff880b49965650 R09: ffff880e736d0ee8
2016-03-16 03:47:44 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880b49965608
2016-03-16 03:47:44 R13: ffff880a1321c638 R14: ffff880e736d0ee8 R15: 0000000000001000
2016-03-16 03:47:44 FS:  00002aaaaf64db20(0000) GS:ffff88085c580000(0000) knlGS:0000000000000000
2016-03-16 03:47:44 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-03-16 03:47:44 CR2: 0000000000000080 CR3: 0000000fc6de5000 CR4: 00000000000407e0
2016-03-16 03:47:44 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2016-03-16 03:47:44 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2016-03-16 03:47:44 Process react_charge_tr (pid: 88396, threadinfo ffff880818b3c000, task ffff88083045f520)
2016-03-16 03:47:44 Stack:
2016-03-16 03:47:44  ffff88093562cf08 ffff880b49965608 ffff88093562cf08 ffff88093562cf08
2016-03-16 03:47:44 <d> ffff880818b3f808 ffff880e736d0ee8 ffff880b49965608 ffff88093562cf08
2016-03-16 03:47:44 <d> ffff880b49965608 ffff880b61cadd80 ffff880818b3f808 ffffffffa0671e5a
2016-03-16 03:47:44 Call Trace:
2016-03-16 03:47:44  [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass]
2016-03-16 03:47:44  [<ffffffffa0b67579>] ll_cl_init+0x3a9/0x570 [lustre]
2016-03-16 03:47:44  [<ffffffffa066cd72>] ? cl_lock_mutex_try+0x112/0x120 [obdclass]
2016-03-16 03:47:44  [<ffffffffa0b67933>] ll_prepare_write+0x53/0x170 [lustre]
2016-03-16 03:47:44  [<ffffffffa0b845ee>] ll_write_begin+0x7e/0x1a0 [lustre]
2016-03-16 03:47:44  [<ffffffff81128443>] generic_file_buffered_write+0x123/0x300
2016-03-16 03:47:44  [<ffffffff8107ecf7>] ? current_fs_time+0x27/0x30
2016-03-16 03:47:44  [<ffffffff81129f60>] __generic_file_aio_write+0x260/0x490
2016-03-16 03:47:44  [<ffffffff8112a218>] generic_file_aio_write+0x88/0x100
2016-03-16 03:47:44  [<ffffffffa0b98676>] vvp_io_write_start+0x156/0x3d0 [lustre]
2016-03-16 03:47:44  [<ffffffffa0671e5a>] cl_io_start+0x6a/0x140 [obdclass]
2016-03-16 03:47:44  [<ffffffffa0676964>] cl_io_loop+0xb4/0x1b0 [obdclass]
2016-03-16 03:47:44  [<ffffffffa0b35356>] ll_file_io_generic+0x1a6/0x750 [lustre]
2016-03-16 03:47:44  [<ffffffffa0b40169>] ll_file_aio_write+0x1b9/0x6f0 [lustre]
2016-03-16 03:47:44  [<ffffffffa0b3ffb0>] ? ll_file_aio_write+0x0/0x6f0 [lustre]
2016-03-16 03:47:44  [<ffffffff81192a3b>] do_sync_readv_writev+0xfb/0x140
2016-03-16 03:47:44  [<ffffffffa0ab8a83>] ? lov_io_fini+0x383/0x480 [lov]
2016-03-16 03:47:44  [<ffffffffa0b6a95b>] ? ll_stats_ops_tally+0x7b/0xa0 [lustre]
2016-03-16 03:47:44  [<ffffffff810a1d50>] ? autoremove_wake_function+0x0/0x40
2016-03-16 03:47:44  [<ffffffff81233a16>] ? security_file_permission+0x16/0x20
2016-03-16 03:47:44  [<ffffffff81193b26>] do_readv_writev+0xd6/0x1f0
2016-03-16 03:47:44  [<ffffffff811930fc>] ? generic_file_llseek_size+0x8c/0xd0
2016-03-16 03:47:44  [<ffffffffa0b359d2>] ? ll_file_seek+0xd2/0x2b0 [lustre]
2016-03-16 03:47:44  [<ffffffff81193c86>] vfs_writev+0x46/0x60
2016-03-16 03:47:44  [<ffffffff81193db1>] sys_writev+0x51/0xd0
2016-03-16 03:47:44  [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b
2016-03-16 03:47:44 Code: a4 24 f4 00 00 00 fe e9 d9 fe ff ff 66 90 41 8b 4c 24 78 85 c9 0f 84 0e fe ff ff 49 8b 5d 68 49 89 5c 24 60 49 8b 86 b0 00 00 00 <48> 89 98 80 00 00 00 e9 f2 fd ff ff 0f 1f 80 00 00 00 00 8b 15
2016-03-16 03:47:44 RIP  [<ffffffffa0b987cd>] vvp_io_write_start+0x2ad/0x3d0 [lustre]
2016-03-16 03:47:44  RSP <ffff880818b3f788>
2016-03-16 03:47:44 CR2: 0000000000000080

Latest node crashdump is 1.2GB compressed, so please advise if you want it.



 Comments   
Comment by Peter Jones [ 30/Mar/16 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 30/Mar/16 ]

Can you get what code line vvp_io_write_start+0x2ad points to?

Comment by Cameron Harr [ 30/Mar/16 ]

Sorry for the delay, I wasn't sure you were talking to me as I believe you all have access to our branch. Peter can direct you to it if need be.

(gdb) l *(vvp_io_write_start+0x2ad)
0x797fd is in vvp_io_write_start (/usr/src/debug/lustre-2.5.5/lustre/llite/vvp_io.c:610).
605	                /*
606	                 * PARALLEL IO This has to be changed for parallel IO doing
607	                 * out-of-order writes.
608	                 */
609	                pos = io->u.ci_wr.wr.crw_pos = i_size_read(inode);
610	                cio->cui_iocb->ki_pos = pos;
611	        }
612	
613	        CDEBUG(D_VFSTRACE, "write: [%lli, %lli)\n", pos, pos + (long long)cnt);
Comment by Zhenyu Xu [ 31/Mar/16 ]

Jinshan, would please review this patch http://review.whamcloud.com/#/c/19254/ ?

Comment by Olaf Faaland [ 31/Mar/16 ]

Peter,
Do you know why I'm unable to view the patch Zhenyu submitted?
thanks,
Olaf

Comment by Andreas Dilger [ 01/Apr/16 ]

Olaf, I added you to the project to access the http://review.whamcloud.com/19254 patch. Please try again. I wasn't able to add Cameron since he doesn't have a Gerrit account.

Comment by Olaf Faaland [ 01/Apr/16 ]

Andreas,
I can view the patch in gerritt now. Thank you.
-Olaf

Comment by Cameron Harr [ 01/Apr/16 ]

Andreas,
I was able to just log in to Gerrit, using charr@llnl.gov as an email address, if you want to add me.
Thanks.

Comment by Andreas Dilger [ 01/Apr/16 ]

Done.

Comment by Cameron Harr [ 04/Apr/16 ]

Thank you Andreas.

Generated at Sat Feb 10 02:13:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.