[LU-6227] Master testing: (osc_request.c:1219:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off) Created: 09/Feb/15  Updated: 01/Jul/16  Resolved: 07/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch
Environment:

CentOS 6.5 servers & clients, current master (tag 2.6.94).


Issue Links:
Related
is related to LU-6666 osc_brw_prep_request()) ASSERTION( pa... Resolved
Severity: 3
Rank (Obsolete): 17428

 Description   

Spawning multiple copies of diotest1 from LTP in the same directory causes the assertion from LU-3192 to reappear.

I was able to replicate with full debug enabled and will make the dump & KO files available momentarily. I'll also include the diotest1 binary, but note that it is unchanged from LTP.

Stack trace:
<0>LustreError: 7700:0:(osc_request.c:1219:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 3 p_c 10 pg ffffea00017a5208 [pri 0 ind 2771] off 16384 prev_pg ffffea00017a51d0 [pri 0 ind 2256] off 16384
<0>LustreError: 7700:0:(osc_request.c:1219:osc_brw_prep_request()) LBUG
<4>Pid: 7700, comm: diotest1
<4>
<4>Call Trace:
<4> [<ffffffffa0302895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0302e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0b14dd1>] osc_brw_prep_request+0xba1/0x10b0 [osc]
<4> [<ffffffffa0b15b40>] osc_build_rpc+0x860/0x15c0 [osc]
<4> [<ffffffffa0b30ab4>] osc_io_unplug0+0xe64/0x1b30 [osc]
<4> [<ffffffffa03131c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffffa0b33d21>] osc_io_unplug+0x11/0x20 [osc]
<4> [<ffffffffa0b38986>] osc_cache_writeback_range+0xda6/0x1280 [osc]
<4> [<ffffffffa0b25d30>] osc_io_fsync_start+0x90/0x360 [osc]
<4> [<ffffffffa047cb40>] ? cl_io_start+0x0/0x140 [obdclass]
<4> [<ffffffffa047cbaa>] cl_io_start+0x6a/0x140 [obdclass]
<4> [<ffffffffa047cb40>] ? cl_io_start+0x0/0x140 [obdclass]
<4> [<ffffffffa09790fe>] lov_io_call+0x8e/0x130 [lov]
<4> [<ffffffffa097cd8c>] lov_io_start+0xcc/0x180 [lov]
<4> [<ffffffffa047cbaa>] cl_io_start+0x6a/0x140 [obdclass]
<4> [<ffffffffa04808b4>] cl_io_loop+0xb4/0x1b0 [obdclass]
<4> [<ffffffffa09f283b>] cl_sync_file_range+0x31b/0x500 [lustre]
<4> [<ffffffffa0a1e9cc>] ll_writepages+0x9c/0x220 [lustre]
<4> [<ffffffff81134eb1>] do_writepages+0x21/0x40
<4> [<ffffffff8112031b>] __filemap_fdatawrite_range+0x5b/0x60
<4> [<ffffffff8112037a>] filemap_write_and_wait_range+0x5a/0x90
<4> [<ffffffff81121728>] generic_file_aio_read+0x418/0x700
<4> [<ffffffff81078fd7>] ? current_fs_time+0x27/0x30
<4> [<ffffffff811a5ef1>] ? touch_atime+0x71/0x1a0
<4> [<ffffffffa0a4f053>] vvp_io_read_start+0x233/0x460 [lustre]
<4> [<ffffffffa047cbaa>] cl_io_start+0x6a/0x140 [obdclass]
<4> [<ffffffffa04808b4>] cl_io_loop+0xb4/0x1b0 [obdclass]
<4> [<ffffffffa09efef1>] ll_file_io_generic+0x461/0xa40 [lustre]
<4> [<ffffffffa09f0600>] ll_file_aio_read+0x130/0x2b0 [lustre]
<4> [<ffffffffa09f0aa9>] ll_file_read+0x159/0x290 [lustre]
<4> [<ffffffff81189a75>] vfs_read+0xb5/0x1a0
<4> [<ffffffff81189bb1>] sys_read+0x51/0x90
<4> [<ffffffff810e202e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Given the fact that this problem has reoccurred, it seems sensible to add a test for this to the test suite.



 Comments   
Comment by Patrick Farrell (Inactive) [ 09/Feb/15 ]

On ftp.whamcloud.com (uploading now):

uploads/LU-6227/

Dump, log, vmlinux and ko files:
150209_full_debug.tar.gz

Test script and test:
test.sh
diotest1

Comment by Patrick Farrell (Inactive) [ 09/Feb/15 ]

Transfer died... Trying again: 150209_full_debug2.tar.gz

Comment by Patrick Farrell (Inactive) [ 09/Feb/15 ]

I've also made it available for non-Intel folks on the Cray FTP.

ftp.cray.com
anonynmous

outbound/xyratex/lu-6227
Files:
150209_full_debug.tar.gz
diotest1
test.sh

Comment by Alexander Boyko [ 10/Feb/15 ]

Patrick, the test was added with patch - test_241 sanity.sh dio vs bio. And we saw regression for it without the fix. Yeap, the issue looks like LU-3192, and requires further analyze.

Comment by Andreas Dilger [ 24/Feb/15 ]

It looks like the client is not invalidating the buffered page from cache before submitting the dio page. There are a couple of solutions to that. One us to actually drop the buffered page, the other is to disallow merging these into a single RPC.

Comment by Patrick Farrell (Inactive) [ 24/Feb/15 ]

My understanding is that http://review.whamcloud.com/#/c/10930 took the second route - Disallowing merging - but it appears that it's not working any more. (It fixed the problem originally, at least in 2.5 for Cray. I'm not sure we tested it on master at the time.)

Comment by Alexander Boyko [ 18/Mar/15 ]

Base on the assert message

<0>LustreError: 7700:0:(osc_request.c:1219:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 3 p_c 10 pg ffffea00017a5208 [pri 0 ind 2771] off 16384 prev_pg ffffea00017a51d0 [pri 0 ind 2256] off 16384

The type of both pages is direct io, and they has the same offset. So this is not LU-3192, when we have mix with dio and buffered page.

I have found the commit which introduce this regression.

commit e6f592fc57b048223e2676408232b6662aad712d
Author: Prakash Surya <surya1@llnl.gov>
Date:   Wed Oct 2 17:16:51 2013 -0700

    LU-1669 vvp: Use lockless __generic_file_aio_write
    
    Testing multi-threaded single shard file write performance has shown
    the inode mutex to be a limiting factor when using the
    generic_file_aio_write function. To work around this bottle neck, this
    change replaces the locked version of that call with the lock less
    version, specifically, __generic_file_aio_write.
Comment by Patrick Farrell (Inactive) [ 18/Mar/15 ]

Hmmm. Unfortunately, that makes some sense... The purpose of that patch was to allow multiple threads on a single client to write to one file at the same time, so there are some possible concurrency issues.

I wonder if this issue is only in the context of direct IO... Obviously this crash is with direct IO pages, but I wonder about the underlying problem.

Comment by Oleg Drokin [ 24/Mar/15 ]

I think the parllel io is still guarded with some sort of range lock, so should we better understand why did that not help?

Comment by Patrick Farrell (Inactive) [ 24/Mar/15 ]

I looked at this the other day, and the range lock is taken in ll_file_io_generic, which is not used in the direct IO path.

The two conflicting pages in this instance are created by calls in to ll_direct_IO_26, then the assert fires while pages are being written out after, somehow in the process of an ll_file_read call.

The ll_file_read call uses the range lock, but two conflicting pages are already present at that time. The direct IO path does not use the range lock. Should it?

I actually don't see how the patch called out by Alex Boyko above changes the behavior of the direct IO path in a way that would have prevented this issue - It only takes the inode mutex on the read path. What about the write path? (I'm not arguing the patch isn't source of the problem, just saying I can't see how. It seems there's something, perhaps several somethings, I don't understand here. )

Comment by Patrick Farrell (Inactive) [ 07/Apr/15 ]

Looking closer...
The particular crash is two reads competing.

00000008:00040000:3.0:1423457026.801250:0:7700:0:(osc_request.c:1219:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 3 p_c 10 pg ffffea00017a5208 [pri 0 ind 2771] off 16384 prev_pg ffffea00017a51d0 [pri 0 ind 2256] off 16384

00000080:00000001:2.0:1423457026.799792:0:1562:0:(file.c:1285:ll_file_aio_read()) Process entered
00000020:00001000:2.0:1423457026.799792:0:1562:0:(cl_object.c:849:cl_env_peek()) 2@ffff88013b8cebb8
00000080:00000001:2.0:1423457026.799793:0:1562:0:(file.c:1145:ll_file_io_generic()) Process entered
00000080:00200000:2.0:1423457026.799793:0:1562:0:(file.c:1148:ll_file_io_generic()) file: infile, type: 0 ppos: 16384, count: 8192
00000020:00008000:2.0:1423457026.800026:0:1562:0:(cl_page.c:250:cl_page_find()) 4@[0x20000a040:0x316:0x0] ffffea00017a51d0 0 2
00000008:00000040:2.0:1423457026.800036:0:1562:0:(osc_cache.c:2272:osc_prep_async_page()) oap ffff8800bf9d7550 page ffffea00017a51d0 obj off 16384

00000080:00000001:3.0:1423457026.800382:0:1915:0:(file.c:1285:ll_file_aio_read()) Process entered
00000020:00001000:3.0:1423457026.800383:0:1915:0:(cl_object.c:849:cl_env_peek()) 2@ffff88005c8eb9c0
00000080:00000001:3.0:1423457026.800384:0:1915:0:(file.c:1145:ll_file_io_generic()) Process entered
00000080:00200000:3.0:1423457026.800384:0:1915:0:(file.c:1148:ll_file_io_generic()) file: infile, type: 0 ppos: 16384, count: 8192
00000020:00008000:3.0:1423457026.800593:0:1915:0:(cl_page.c:250:cl_page_find()) 4@[0x20000a040:0x316:0x0] ffffea00017a5208 0 2
00000008:00000040:3.0:1423457026.800603:0:1915:0:(osc_cache.c:2272:osc_prep_async_page()) oap ffff8800bf31e750 page ffffea00017a5208 obj off 16384

Reads do not take the range lock, but for direct IO, this can mean they try to read the same page at the same time.

This doesn't happen for non direct IO reads - Presumably this happens for direct IO because some of the other locking machinery is disabled/not used.

I'll submit a patch to use the range lock for direct IO reads... But if someone else feels this concurrency should be handled at another layer, please let me know where and how.

Comment by Gerrit Updater [ 07/Apr/15 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: http://review.whamcloud.com/14385
Subject: LU-6227 vvp: Use range lock for direct IO reads
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 462a5a1cac1e20c07fcc1eab5715dc370b4f71d4

Comment by Patrick Farrell (Inactive) [ 07/Apr/15 ]

The patch submitted above passes testing with Cray's reproducer for this bug.

Comment by Patrick Farrell (Inactive) [ 07/Apr/15 ]

The bug was introduced specifically in this part of the change highlighted by Alexander:
http://review.whamcloud.com/#/c/6672/12/lustre/llite/rw26.c

The comment there does not mention that DIO reads must also be kept separate from one another, so it appears safe to remove the concurrency control, as the range locking covers the cases described in that comment.

Comment by Jinshan Xiong (Inactive) [ 13/Apr/15 ]

Hi Patrick,

I didn't read all the comments above but the patch. Isn't it a better idea to take inode mutex in direct IO path?

Comment by Patrick Farrell (Inactive) [ 13/Apr/15 ]

Jinshan -

I think not, for the same reason we don't take it (any more) in the non-direct IO path. It allows parallelism between multiple threads doing direct IO. Why do you think we should take the mutex?

Also, if the old comment removed by LU-1669 was right, we also need to protect it from the buffered IO path, and since that path doesn't take the mutex any more, that means the range lock:
/* 0. Need locking between buffered and direct access. and race with

  • size changing by concurrent truncates and writes.
  • 1. Need inode mutex to operate transient pages.
    */
    if (rw == READ)
    mutex_lock(&inode->i_mutex);
Comment by Gerrit Updater [ 28/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14385/
Subject: LU-6227 vvp: Use range lock for direct IO reads
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b464b45746153889e3dead4e2254d3ebda77f8d

Comment by Peter Jones [ 07/Jul/15 ]

Landed for 2.8

Comment by James A Simmons [ 20/Jan/16 ]

ORNL just ran into this for our 2.7 clients. Looks like a back port is needed.

Comment by Patrick Farrell (Inactive) [ 20/Jan/16 ]

James, the affects version is 2.7.0. So, of course it's needed there. The patch shouldn't need any particular porting, though.

Generated at Sat Feb 10 01:58:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.