[LU-8460] (osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG Created: 01/Aug/16  Updated: 14/Jun/18  Resolved: 17/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: John Salinas (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: LS_RZ

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As part of CORAL testing I was trying to use ior to to write/read some 16MB RPCs. As soon as the test started it appeared all the Lustre clients took an LBUG, crashed, and rebooted.

[263032.826900] LustreError: 54616:0:(mgc_request.c:1536:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib
[263032.839625] Lustre: 54616:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2
[263032.940768] Lustre: Mounted lsdraid-client
[270650.452658] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
[270650.465098] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
[270650.472830] Pid: 64198, comm: ior
[270650.476666] 
Call Trace:
[270650.482816]  [<ffffffffa08267d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[270650.492278]  [<ffffffffa0826d75>] lbug_with_loc+0x45/0xc0 [libcfs]
[270650.502128]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
[270650.510720]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
[270650.518827]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
[270650.527638]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
[270650.537084]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
[270650.545977]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
[270650.554180]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
[270650.563226]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
[270650.572635]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
[270650.581481]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
[270650.590561]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
[270650.600035]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
[270650.609317]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
[270650.618397]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
[270650.627864]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
[270650.637324]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
[270650.646690]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
[270650.655374]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
[270650.663281]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
[270650.671879]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
[270650.680178]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
[270650.688744]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
[270650.696904]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
[270650.704945]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
[270650.713656]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
[270650.722373]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
[270650.731000]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
[270650.739126]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
[270650.745994]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
[270650.752766]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
[270650.760571] 
[270650.763508] Kernel panic - not syncing: LBUG
[270650.769875] CPU: 17 PID: 64198 Comm: ior Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
[270650.782689] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[270650.796038]  ffffffffa0843def 0000000067048a74 ffff88103fd9f560 ffffffff816360fc
[270650.806102]  ffff88103fd9f5e0 ffffffff8162f977 ffffffff00000008 ffff88103fd9f5f0
[270650.816170]  ffff88103fd9f590 0000000067048a74 ffffffffa0dc63b0 0000000000000000
[270650.826189] Call Trace:
[270650.830566]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
[270650.837897]  [<ffffffff8162f977>] panic+0xd8/0x1e7
[270650.844870]  [<ffffffffa0826ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[270650.853345]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
[270650.862000]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
[270650.869976]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
[270650.878736]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
[270650.888101]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
[270650.896965]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
[270650.905069]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
[270650.914081]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[270650.923740]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
[270650.932528]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
[270650.941515]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[270650.951198]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
[270650.960478]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
[270650.969481]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[270650.979160]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[270650.989192]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
[270650.998797]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
[270651.008434]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
[270651.017239]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
[270651.026725]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
[270651.035938]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
[270651.045484]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
[270651.054443]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
[270651.063300]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
[270651.072803]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
[270651.082339]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
[270651.091766]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
[270651.100727]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
[270651.108378]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
[270651.115918]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b 

Here is what IOR saw:

IOR-3.0.1: MPI Coordinated Test of Parallel I/O

Began: Mon Aug  1 18:28:46 2016
Command line used: /home/johnsali/wolf-3/ior/src/ior -i 5 -v -b 4096M -t 16m
Machine: Linux wolf-6.wolf.hpdd.intel.com
Start time skew across all tasks: 0.00 sec

Test 0 started: Mon Aug  1 18:28:46 2016
Path: /mnt/lustre
FS: 13.9 TiB   Used FS: 0.0%   Inodes: 4.2 Mi   Used Inodes: 0.0%
Participating tasks: 3
Summary:
	api                = POSIX
	test filename      = testFile
	access             = single-shared-file
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 3 (1 per node)
	repetitions        = 5
	xfersize           = 16 MiB
	blocksize          = 4 GiB
	aggregate filesize = 12 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
Commencing write performance test: Mon Aug  1 18:28:46 2016

Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
 kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 

Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
 kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG

Attempts to change the max_pages_per_rpc have not been successful:

[root@wolf-8 ~]#  lctl set_param osc.*.max_pages_per_rpc=1024
error: set_param: setting /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88104f0c1000/max_pages_per_rpc=1024: Numerical result out of range 

System crash files can be found at (all clients hit the LBUG/crash/reboot at the same time):
/scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-01-18:29:12
/scratch/dumps/wolf-7.wolf.hpdd.intel.com/10.8.1.7-2016-08-01-18:29:10
/scratch/dumps/wolf-8.wolf.hpdd.intel.com/10.8.1.8-2016-08-01-18:29:11



 Comments   
Comment by John Salinas (Inactive) [ 01/Aug/16 ]

This is currently blocking Lustre Streaming testing for CORAL

Comment by Gerrit Updater [ 02/Aug/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/21637
Subject: LU-8460 osc: improve LASSERT for RPC size check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 75bb06909b96e21b6b10e28969ed770da384115b

Comment by Andreas Dilger [ 02/Aug/16 ]

What version of Lustre are you running?

Also, based on your last comments do you have 16MB RPCs enabled at all? The stock Lustre code should handle up to 4MB RPC size out of the box, so it is strange that this is failing when essentially just doing 16MB sized writes?

If you are able to repeat this, could you please provide the steps used for setup (e.g. the current value of obdfilter.*.brw_size on the OSS, osc.*.max_pages_per_rpc and other relevant settings on the client) and improve the LASSERT() in osc_extent_find() with the patch http://review.whamcloud.com/21637 to help debug this problem.

Comment by John Salinas (Inactive) [ 02/Aug/16 ]

This specific hit was on master, it appears we have the same issue for 2.9 or master though. We are not using 2.8 because it does not have the large 16MB RPC patches we are looking to use for Lustre streaming.

It appears that setting this on the clients failed: lctl set_param osc.*.max_pages_per_rpc=1024 because we had not set this on the OSS nodes: lctl set_param obdfilter.lsdraid-OST0000.brw_size=16. The default was "1" which is what it was when we hit this issue. Last night we set both of these and were able to do a couple of IOR runs and we saw 16MB pages being used:

OSS0
[root@wolf-3 ~]# cat /proc/fs/lustre/obdfilter/lsdraid-OST0000/brw_stats
snapshot_time: 1470125909.249936 (secs.usecs)

read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 5 0 0 | 597115 50 50
2: 2 0 0 | 2 0 50
4: 0 0 0 | 0 0 50
8: 0 0 0 | 0 0 50
16: 0 0 0 | 0 0 50
32: 0 0 0 | 0 0 50
64: 0 0 0 | 0 0 50
128: 0 0 0 | 0 0 50
256: 43 0 0 | 82 0 50
512: 2 0 0 | 153548 12 62
1K: 2 0 0 | 49 0 62
2K: 39 0 0 | 141 0 62
4K: 369260 99 100 | 443071 37 100

read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 369353 100 100 | 597117 50 50
1: 0 0 100 | 0 0 50
2: 0 0 100 | 0 0 50
3: 0 0 100 | 0 0 50
4: 0 0 100 | 0 0 50
5: 0 0 100 | 0 0 50
6: 0 0 100 | 0 0 50
7: 0 0 100 | 0 0 50
8: 0 0 100 | 0 0 50
9: 0 0 100 | 0 0 50
10: 0 0 100 | 0 0 50
11: 0 0 100 | 0 0 50
12: 0 0 100 | 0 0 50
13: 0 0 100 | 0 0 50
14: 0 0 100 | 0 0 50
15: 0 0 100 | 0 0 50
16: 0 0 100 | 0 0 50
17: 0 0 100 | 0 0 50
18: 0 0 100 | 0 0 50
19: 0 0 100 | 0 0 50
20: 0 0 100 | 0 0 50
21: 0 0 100 | 0 0 50
22: 0 0 100 | 0 0 50
23: 0 0 100 | 0 0 50
24: 0 0 100 | 0 0 50
25: 0 0 100 | 0 0 50
26: 0 0 100 | 0 0 50
27: 0 0 100 | 0 0 50
28: 0 0 100 | 0 0 50
29: 0 0 100 | 0 0 50
30: 0 0 100 | 0 0 50
31: 0 0 100 | 596891 49 100

read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 13706 3 3 | 644001 53 53
2: 32744 8 12 | 204951 17 71
3: 26076 7 19 | 65415 5 76
4: 41077 11 30 | 9928 0 77
5: 62304 16 47 | 6889 0 77
6: 71872 19 67 | 5011 0 78
7: 63965 17 84 | 3906 0 78
8: 41954 11 95 | 3087 0 78
9: 15655 4 100 | 2566 0 79
10: 0 0 100 | 2222 0 79
11: 0 0 100 | 2485 0 79
12: 0 0 100 | 3936 0 79
13: 0 0 100 | 4464 0 80
14: 0 0 100 | 4682 0 80
15: 0 0 100 | 3413 0 80
16: 0 0 100 | 2469 0 81
17: 0 0 100 | 2331 0 81
18: 0 0 100 | 2830 0 81
19: 0 0 100 | 7537 0 82
20: 0 0 100 | 57506 4 87
21: 0 0 100 | 123813 10 97
22: 0 0 100 | 21147 1 99
23: 0 0 100 | 4920 0 99
24: 0 0 100 | 4496 0 99
25: 0 0 100 | 3 0 100

read | write
I/O time (1/1000s) ios % cum % | ios % cum %
1: 4 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 9997 2 2 | 0 0 0
64: 136162 36 39 | 0 0 0
128: 172673 46 86 | 0 0 0
256: 49096 13 99 | 0 0 0
512: 1317 0 99 | 0 0 0
1K: 22 0 99 | 0 0 0
2K: 21 0 100 | 0 0 0

read | write
disk I/O size ios % cum % | ios % cum %
4: 0 0 0 | 25 0 0
8: 0 0 0 | 45 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 1 0 0
64: 0 0 0 | 1 0 0
128: 0 0 0 | 597016 50 50
256: 0 0 0 | 18 0 50
512: 0 0 0 | 9 0 50
1K: 1 0 0 | 0 0 50
2K: 0 0 0 | 0 0 50
4K: 1 0 0 | 0 0 50
8K: 2 0 0 | 2 0 50
16K: 0 0 0 | 0 0 50
32K: 0 0 0 | 0 0 50
64K: 0 0 0 | 0 0 50
128K: 0 0 0 | 0 0 50
256K: 0 0 0 | 0 0 50
512K: 0 0 0 | 0 0 50
1M: 43 0 0 | 82 0 50
2M: 2 0 0 | 153548 12 62
4M: 2 0 0 | 49 0 62
8M: 39 0 0 | 141 0 62
16M: 369260 99 100 | 443071 37 100

From one client:
[root@wolf-6 mdisolation_step2_draid_testing]# lctl get_param osc.*.rpc_stats
osc.lsdraid-OST0000-osc-ffff88104e13d800.rpc_stats=
snapshot_time: 1470098591.939087 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 1
pending write pages: 0
pending read pages: 0

read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 17 0 0 | 10 0 0
512: 2 0 0 | 2 0 0
1024: 2 0 0 | 5 0 0
2048: 13 0 0 | 53 0 0
4096: 116941 99 100 | 146350 99 100

read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 14281 12 12 | 66644 45 45
2: 31530 26 39 | 25379 17 62
3: 71163 60 99 | 3173 2 65
4: 1 0 100 | 4759 3 68
5: 0 0 100 | 2206 1 69
6: 0 0 100 | 1957 1 71
7: 0 0 100 | 3053 2 73
8: 0 0 100 | 39249 26 100

read write
offset rpcs % cum % | rpcs % cum %
0: 20 0 0 | 40 0 0
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 5 0 0 | 4 0 0
512: 8 0 0 | 0 0 0
1024: 4 0 0 | 0 0 0
2048: 6 0 0 | 3 0 0
4096: 19 0 0 | 41 0 0
8192: 40 0 0 | 82 0 0
16384: 80 0 0 | 164 0 0
32768: 159 0 0 | 328 0 0
65536: 320 0 0 | 656 0 0
131072: 637 0 1 | 1311 0 1
262144: 1274 1 2 | 2621 1 3
524288: 2553 2 4 | 5241 3 7
1048576: 5107 4 8 | 8413 5 12
2097152: 10211 8 17 | 16784 11 24
4194304: 18374 15 33 | 22058 15 39
8388608: 36778 31 64 | 43061 29 68
16777216: 41380 35 100 | 45613 31 100

However, now it appears mdtest is hanging – we will have to find some time to collect data on that.

Comment by John Salinas (Inactive) [ 03/Aug/16 ]

Today I have hit this several times trying to reproduce a different hang. In these case I do NOT have 16MB blocks set. It is a completely vanilla run and as soon as mdtest starts I hit this:

]# mpirun -np 4 -machinefile hosts /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
– started at 08/03/2016 21:55:07 –

mdtest-1.8.3 was launched with 4 total task(s) on 4 nodes
Command line used: /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
Path: /mnt/lustre
FS: 13.9 TiB Used FS: 0.0% Inodes: 55.8 Mi Used Inodes: 89.5%
4 tasks, 24 files/directories

Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed:
Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) LBUG

Comment by Andreas Dilger [ 03/Aug/16 ]

John, could you please reproduce the problem with the above patch applied, so that we can see why the assertion is failing.

Comment by Peter Jones [ 03/Aug/16 ]

Bobijam

Could you please assist with this issue?

John

Were you using Andreas's diagnostic patch when you were running?

Peter

Comment by Jinshan Xiong (Inactive) [ 04/Aug/16 ]

the issue is that the brw_size is 16M on the server side but the max_pages_per_rpc is set to 1024, which means 4M, on the client side

there is a defect to handle chunk size on the client side, which should be set to the maximum of “size reported from OST” and “max_pages_per_rpc”

Comment by John Salinas (Inactive) [ 08/Aug/16 ]

With the patch I see:

kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG

This is with default settings (no 16MB RPCs set):

client [root@wolf-6 ~]# cat /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88084eccc800/max_pages_per_rpc
256 

OSS:

[root@wolf-3 kernel]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
obdfilter.lsdraid-OST0000.brw_size=1
 [root@wolf-4 ~]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
obdfilter.lsdraid-OST0000.brw_size=1
[413569.867436] Lustre: Mounted lsdraid-client
[414679.351734] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
[414679.373266] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG
[414679.382571] Pid: 137570, comm: mdtest
[414679.388269] 
Call Trace:
[414679.395587]  [<ffffffffa071d7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[414679.404872]  [<ffffffffa071dd75>] lbug_with_loc+0x45/0xc0 [libcfs]
[414679.413187]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
[414679.421671]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
[414679.430159]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
[414679.438756]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
[414679.446884]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
[414679.454653]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
[414679.463245]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
[414679.471818]  [<ffffffffa0e17b60>] ? osc_page_init+0x0/0x210 [osc]
[414679.479775]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
[414679.488686]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
[414679.496969]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
[414679.505470]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
[414679.514355]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
[414679.523097]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
[414679.531506]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
[414679.540302]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
[414679.549061]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
[414679.557714]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
[414679.566324]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
[414679.574846]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
[414679.582796]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
[414679.590648]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
[414679.599179]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
[414679.607620]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
[414679.615561]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
[414679.622248]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
[414679.628843]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
[414679.636498] 
[414679.639249] Kernel panic - not syncing: LBUG
[414679.644949] CPU: 15 PID: 137570 Comm: mdtest Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
[414679.657442] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[414679.670035]  ffffffffa073adef 00000000a1cd2686 ffff88103ee6f6f8 ffffffff816360fc
[414679.679336]  ffff88103ee6f778 ffffffff8162f977 ffffffff00000008 ffff88103ee6f788
[414679.688629]  ffff88103ee6f728 00000000a1cd2686 ffffffffa0e373b0 0000000000000000
[414679.697931] Call Trace:
[414679.701624]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
[414679.708306]  [<ffffffff8162f977>] panic+0xd8/0x1e7
[414679.714580]  [<ffffffffa071dddb>] lbug_with_loc+0xab/0xc0 [libcfs]
[414679.722387]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
[414679.730387]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
[414679.738450]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
[414679.746598]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
[414679.754346]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
[414679.761704]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
[414679.769940]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
[414679.778188]  [<ffffffffa0e17b60>] ? osc_lru_alloc+0x390/0x390 [osc]
[414679.786036]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[414679.794960]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
[414679.803203]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
[414679.811621]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[414679.820732]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
[414679.829440]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
[414679.837867]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[414679.846976]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
[414679.856076]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
[414679.864791]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
[414679.873498]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
[414679.882106]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
[414679.890136]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
[414679.898055]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
[414679.906650]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
[414679.915158]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
[414679.923158]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
[414679.929946]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
[414679.936584]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b

/scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-08-17:12:27/
Comment by Jinshan Xiong (Inactive) [ 08/Aug/16 ]

chunk_bits is 24 means 16MB block size on the OST side and max_pages_per_rpc was set to 256 unfortunately.

Comment by Gerrit Updater [ 09/Aug/16 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21825
Subject: LU-8460 osc: chunk size should not bigger than RPC size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bde1f813f387f8e15492a864e8c3a78eb0eb3ff4

Comment by Jinshan Xiong (Inactive) [ 11/Aug/16 ]

jsalians_intel can you please verify that patch 21825 actually fixes the problem?

Comment by John Salinas (Inactive) [ 11/Aug/16 ]

Currently this bug is blocking our progress: https://jira.hpdd.intel.com/browse/LU-8498

Comment by Andreas Dilger [ 12/Aug/16 ]

It seems that it would be possible to test this with a ZFS OST with recordsize=1024k and then manually setting brw_size=512K on an unpatched OST, and verifying that the client no longer crashes. Similarly, testing that it isn't possible to set brw_size smaller than recordsize on a patched OST would be useful, including changing recordsize after mount.

Comment by Gerrit Updater [ 15/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21825/
Subject: LU-8460 osc: max_pages_per_rpc should be chunk size aligned
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13834f5aeef42d3c358574ac59475c0758dce300

Comment by Peter Jones [ 17/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:17:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.