[LU-2170] osc_extent_merge()) ASSERTION( cur->oe_osclock == victim->oe_osclock) while running racer Created: 13/Oct/12  Updated: 17/Jul/17  Resolved: 13/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 5202

 Description   

Hit this assertion running racer in a loop in a single node setup:

[ 8788.630974] LustreError: 4352:0:(vvp_io.c:1038:vvp_io_commit_write()) Skipped 833 previous similar messages
[ 8834.780249] LustreError: 22605:0:(osc_cache.c:506:osc_extent_merge()) ASSERTION( cur->oe_osclock == victim->oe_osclock ) failed: 
[ 8834.780830] LustreError: 22605:0:(osc_cache.c:506:osc_extent_merge()) LBUG
[ 8834.781139] Pid: 22605, comm: cat
[ 8834.781364] 
[ 8834.781364] Call Trace:
[ 8834.781773]  [<ffffffffa0c9e915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[ 8834.782075]  [<ffffffffa0c9ef27>] lbug_with_loc+0x47/0xb0 [libcfs]
[ 8834.782366]  [<ffffffffa04ce847>] osc_extent_merge+0x697/0x6a0 [osc]
[ 8834.782657]  [<ffffffff81112df1>] ? generic_file_buffered_write+0x1f1/0x300
[ 8834.782947]  [<ffffffff8106fea7>] ? current_fs_time+0x27/0x30
[ 8834.783209]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 8834.783474]  [<ffffffffa04d601c>] osc_extent_release+0xfc/0x3e0 [osc]
[ 8834.783782]  [<ffffffffa059a610>] ? lov_io_end_wrapper+0x0/0x100 [lov]
[ 8834.784070]  [<ffffffffa04c7ddf>] osc_io_end+0x1f/0x30 [osc]
[ 8834.784371]  [<ffffffffa11dd7a0>] cl_io_end+0x60/0x150 [obdclass]
[ 8834.784661]  [<ffffffffa059a701>] lov_io_end_wrapper+0xf1/0x100 [lov]
[ 8834.785159]  [<ffffffffa059a1ae>] lov_io_call+0x8e/0x130 [lov]
[ 8834.785465]  [<ffffffffa059bdec>] lov_io_end+0x4c/0x110 [lov]
[ 8834.785758]  [<ffffffffa11dd7a0>] cl_io_end+0x60/0x150 [obdclass]
[ 8834.786013]  [<ffffffffa11e28d2>] cl_io_loop+0xc2/0x1b0 [obdclass]
[ 8834.786353]  [<ffffffffa0a7822b>] ll_file_io_generic+0x42b/0x550 [lustre]
[ 8834.786644]  [<ffffffffa0a7911c>] ll_file_aio_write+0x13c/0x2c0 [lustre]
[ 8834.811295] Call Trace:
[ 8834.811521]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[ 8834.811788]  [<ffffffffa0c9ef7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[ 8834.812079]  [<ffffffffa04ce847>] ? osc_extent_merge+0x697/0x6a0 [osc]
[ 8834.812364]  [<ffffffff81112df1>] ? generic_file_buffered_write+0x1f1/0x300
[ 8834.812605]  [<ffffffff8106fea7>] ? current_fs_time+0x27/0x30
[ 8834.812928]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 8834.813301]  [<ffffffffa04d601c>] ? osc_extent_release+0xfc/0x3e0 [osc]
[ 8834.813615]  [<ffffffffa059a610>] ? lov_io_end_wrapper+0x0/0x100 [lov]
[ 8834.813903]  [<ffffffffa04c7ddf>] ? osc_io_end+0x1f/0x30 [osc]
[ 8834.814164]  [<ffffffffa11dd7a0>] ? cl_io_end+0x60/0x150 [obdclass]
[ 8834.816218]  [<ffffffffa059a701>] ? lov_io_end_wrapper+0xf1/0x100 [lov]
[ 8834.816513]  [<ffffffffa059a1ae>] ? lov_io_call+0x8e/0x130 [lov]
[ 8834.816814]  [<ffffffffa059bdec>] ? lov_io_end+0x4c/0x110 [lov]
[ 8834.817184]  [<ffffffffa11dd7a0>] ? cl_io_end+0x60/0x150 [obdclass]
[ 8834.817462]  [<ffffffffa11e28d2>] ? cl_io_loop+0xc2/0x1b0 [obdclass]
[ 8834.817769]  [<ffffffffa0a7822b>] ? ll_file_io_generic+0x42b/0x550 [lustre]
[ 8834.818078]  [<ffffffffa0a7911c>] ? ll_file_aio_write+0x13c/0x2c0 [lustre]
[ 8834.818385]  [<ffffffffa0a79409>] ? ll_file_write+0x169/0x2a0 [lustre]
[ 8834.818676]  [<ffffffff8117b2e8>] ? vfs_write+0xb8/0x1a0
[ 8834.818903]  [<ffffffff810d5192>] ? audit_syscall_entry+0x272/0x2a0
[ 8834.819185]  [<ffffffff8117bbb1>] ? sys_write+0x51/0x90
[ 8834.819425]  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

I have a crashdump for this one



 Comments   
Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ]

patch is at: http://review.whamcloud.com/4316

Comment by Cory Spitz [ 01/Nov/12 ]

The patch has been effective at eliminating our reproducer.

BTW, here are the details of the reproducer from our test engineer:

fsx_mpi - fsx(file system exerciser) This test is open sourced and part of LTP.

iozone - file system benchmark, freely available

growfiles_mpi - growfiles extends and truncates a file or files.
growfiles can do data validation. We have several
versions of growfiles(pre/post the Cary/SGI split), but I
think the one used here was the LTP version with the
addition of MPI to use unique file names if desired.

doio_mpi - doio_mpi uses the following system calls to do the I/O:
read, write, readv, writev, mmread, mmwrite, fsync2,
fdatasync

The program locks the regions of the file that are being
operated on, to prevent overlapping requests from
clobbering each other. This test does data checking.
doio_mpi was based on iogen/doio which are part of LTP,
but it was a large re-write to add the MPI functionality.

mmstress - This is a test program that performs general stress with
memory race conditions. It contains seven testcases that
will test race conditions between simultaneous read fault,
write fault, copy on write (COW) fault e.t.c. system and
can be easily ported to work on other operating systems
as well.
Part of LTP.

For the commands I was using, 'ubrun' is a tool that can do a little
setup/cleanup, and identifies which binaries to use when running on
different systems. 'aptun' is our application launching tool, it
determines the batch scheduler used if any, creates a job input
file(if needed), then launches the job.

I used a tool, FSLOAD, that randomly picks a command from the pool
of commands, and runs them on 5%-10% of the PEs on the system. I
typically had 5-10 jobs running at any time until the problem hit.

The commands I was using:
ubrun -t -A "rm ./fsx*" -e LTPROOT_CL -s "All operations completed -OK!" aptrun -n 4 LTPROOT_CL=testcases/bin/fsx-linux_mpi -q -WR -N 30000 ./fsx1.$$

ubrun -t -e APPS_CL -s "iozone test complete" aptrun -n 1 -M 1 APPS_CL=iozone/src/current/RUN/iozone -A2 -f ./ioz1.$$

ubrun -t -e OSTEST_CL -s "Hit time value" aptrun -n 4 OSTEST_CL=bin/growfiles_mpi -b -e 1 -i 0 -u -L 300 -d .

ubrun -t -e LTPROOT_CL -o aptrun -n 4 LTPROOT_CL=testcases/bin/doio_mpi -i 50000 10000b:./do1

ubrun -s "Hit time value of" -x -e OSTEST_CL -o -T gf1 -t -D aptrun -n 8 OSTEST_CL=bin/growfiles_mpi -b -i 0 -L 300 -g 4090 -T 100 -t 408990 -C 10 -c 1000 -d dir1 -S 200

ubrun s "Hit time value of" -x -e OSTEST_CL -o -T gf5 -t -D aptrun -n 8 OSTEST_CL=bin/growfiles_mpi -b -r 1-5000 -R 0-1 -i 0 -L 300 -C 1 -S 2

ubrun -t -D -o -T CL_LTPmtest05 -e LTPROOT_CL -p. -B "cp $LTPROOT_CL/testcases/bin/dummy ." aptrun -n 1 LTPROOT_CL=testcases/bin/mmstress

ubrun -t -D -o -T CL_LTPmtest05 -e LTPROOT_CL -p. -B "cp $LTPROOT_CL/testcases/bin/dummy ." aptrun -n 1 LTPROOT_CL=testcases/bin/mmstress

Comment by Jinshan Xiong (Inactive) [ 05/Nov/12 ]

Cory, thank you so much about the reproduce info.

Comment by Jinshan Xiong (Inactive) [ 13/Nov/12 ]

patch has been landed to master

Generated at Sat Feb 10 01:22:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.