[LU-5260] Null pointer dereference in ll_cl_find Created: 26/Jun/14  Updated: 08/Jul/14  Resolved: 08/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLES11 SP3 2.6 clients, CentOS 2.6 servers with striped directories.


Issue Links:
Related
is related to LU-5108 osc: Performance tune for LRU Resolved
Severity: 3
Rank (Obsolete): 14674

 Description   

During testing of 2.6 clients and servers (with striped directories), we lost a client to a null pointer dereference here:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffffa0801500>] ll_cl_find+0x50/0x80 [lustre]
PGD 4ef80f067 PUD 91ff68067 PMD 0
Oops: 0000 1 SMP
CPU 5
Modules linked in: xpmem dvspn(P) dvsof(P) dvsutil(P) dvsipc(P) dvsipc_lnet(P) d
vsproc(P) bpmcdmod nic_compat cmsr osc mgc lustre lov mdc fid lmv fld kgnilnd pt
lrpc obdclass lnet sha1_generic libcfs ib_core pcie_link_bw_monitor kdreg gpcd_a
ri ipogif_ari kgni_ari hwerr(P) rca hss_os(P) heartbeat simplex(P) ghal_ari cray
trace

Pid: 13639, comm: memfill3 Tainted: P 3.0.101-0.15.1_1.0502.8131-cray
_ari_c #1 Cray Inc. Cascade/Cascade
RIP: 0010:[<ffffffffa0801500>] [<ffffffffa0801500>] ll_cl_find+0x50/0x80 [lustr
e]
RSP: 0018:ffff8806e5c51ad8 EFLAGS: 00010217
RAX: ffff8804ef80c7f0 RBX: ffff880c611fdec0 RCX: ffff880c611fdf88
RDX: 0000000000000000 RSI: ffff880c611fbcc0 RDI: ffff880c611fdf84
RBP: ffff8806e5c51ae8 R08: 0000000000000000 R09: ffff8806e5c51c00
R10: 0000000000000320 R11: ffff880fb387bf68 R12: ffff880c611fdf84
R13: 0000000005d6df80 R14: 0000000000005d6d R15: 0000000000000000
FS: 00000000402569e0(0063) GS:ffff88107faa0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 00000004ef80e000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Process memfill3 (pid: 13639, threadinfo ffff8806e5c50000, task ffff8804ef80c7f0
)
Stack:
ffff880c611fbcc0 ffff880fbfd149c0 ffff8806e5c51b78 ffffffffa08204d3
0000000000000000 ffff8806e5c51c00 0000008000000000 0000000000000000
0000000000000000 ffff880c7759cf48 00000f80e5c51ba8 ffffffffa0816c27
Call Trace:
[<ffffffffa08204d3>] ll_write_begin+0x83/0x760 [lustre]
[<ffffffff810fdede>] generic_file_buffered_write+0x10e/0x240
[<ffffffff81100cd9>] __generic_file_aio_write+0x259/0x450
[<ffffffff81100f29>] generic_file_aio_write+0x59/0xc0
[<ffffffffa08358ec>] vvp_io_write_start+0xfc/0x3e0 [lustre]
[<ffffffffa0377582>] cl_io_start+0x72/0x140 [obdclass]
[<ffffffffa037b134>] cl_io_loop+0xb4/0x1b0 [obdclass]
[<ffffffffa07d0c02>] ll_file_io_generic+0x5a2/0x8d0 [lustre]
[<ffffffffa07d115f>] ll_file_aio_write+0x22f/0x290 [lustre]
[<ffffffffa07d1b25>] ll_file_write+0x1e5/0x270 [lustre]
[<ffffffff811586bb>] vfs_write+0xcb/0x180
[<ffffffff81158865>] sys_write+0x55/0x90
[<ffffffff81427bab>] system_call_fastpath+0x16/0x1b
[<0000000020035811>] 0x20035810
Code: c8 00 00 00 48 8d 8b c8 00 00 00 48 39 ca 74 2d 65 48 8b 04 25 c0 b5 00 00
48 39 42 10 75 12 eb 2a 66 2e 0f 1f 84 00 00 00 00 00
8> 39 42 10 74 1a 48 8b 12 48 39 ca 0f 1f 40 00 75 ee 31 c0 f0
RIP [<ffffffffa0801500>] ll_cl_find+0x50/0x80 [lustre]
RSP <ffff8806e5c51ad8>
CR2: 0000000000000010
--[ end trace 52e397e4bde9254f ]--

This is an untouched copy of master from last week, no other patches. The most recent commit:
Subject: LU-3188 osc: shorten IO calling path
Description:
By using osc_io_unplug_aync() for osc_queue_sync_pages() to shorten
the IO calling path, to reduce the chance of stack overflow.

This is revive of git commit 83ae17df2bdce837e62473aec27c03d67312c8ea.

Signed-off-by: Bobi Jam <bobijam.xu@intel.com>
Change-Id: I2ac32866f7adbc4701370704612c849a18a5d1ac
Reviewed-on: http://review.whamcloud.com/10292



 Comments   
Comment by Patrick Farrell (Inactive) [ 26/Jun/14 ]

I'll make the dump available shortly.

A few thoughts: This was encountered during testing of striped directories. Not clearly relevant.
We have not previously seen this in our 2.6 client testing, so unless it's related to DNE2, it's likely to be rare.

Comment by Patrick Farrell (Inactive) [ 26/Jun/14 ]

Dump:
ftp.whamcloud.com:/uploads/LU-5260/LU_5260_140626.tar.gz

Comment by Jodi Levi (Inactive) [ 26/Jun/14 ]

Jinshan,
Can you please comment on this one?
Thank you!

Comment by Jinshan Xiong (Inactive) [ 26/Jun/14 ]

it looks like that ll_cl_context{} list was corrupted. Please try the following debug patch and see what'll happen.

http://review.whamcloud.com/10856

Comment by Patrick Farrell (Inactive) [ 26/Jun/14 ]

Jinshan -

We've only hit this once so far.. Do you think that patch would be OK to temporarily commit to Cray's copy of master, so it's part of all of our master testing? If it's expected to have a large perf impact, I'd have to limit it to test runs looking for this issue.

Comment by Patrick Farrell (Inactive) [ 30/Jun/14 ]

On another test run, we hit what I assume is the same bug; a CPU stall when searching the list in ll_cl_find. We're going to arrange a test run with Jinshan's debug patch.

Comment by Jinshan Xiong (Inactive) [ 02/Jul/14 ]

After taking a further look, I think I found the problem. I'm creating the patch and will share it with you shortly.

Comment by Jinshan Xiong (Inactive) [ 02/Jul/14 ]

patch is at: http://review.whamcloud.com/10955

Comment by Andreas Dilger [ 04/Jul/14 ]

Was introduced by http://review.whamcloud.com/10503 .

Comment by Jodi Levi (Inactive) [ 08/Jul/14 ]

Patch landed to Master.

Generated at Sat Feb 10 01:49:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.