Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
3
-
9223372036854775807
Description
As part of CORAL testing I was trying to use ior to to write/read some 16MB RPCs. As soon as the test started it appeared all the Lustre clients took an LBUG, crashed, and rebooted.
[263032.826900] LustreError: 54616:0:(mgc_request.c:1536:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib [263032.839625] Lustre: 54616:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2 [263032.940768] Lustre: Mounted lsdraid-client [270650.452658] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: [270650.465098] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG [270650.472830] Pid: 64198, comm: ior [270650.476666] Call Trace: [270650.482816] [<ffffffffa08267d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [270650.492278] [<ffffffffa0826d75>] lbug_with_loc+0x45/0xc0 [libcfs] [270650.502128] [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc] [270650.510720] [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0 [270650.518827] [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc] [270650.527638] [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70 [270650.537084] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [270650.545977] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [270650.554180] [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc] [270650.563226] [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre] [270650.572635] [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc] [270650.581481] [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc] [270650.590561] [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre] [270650.600035] [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass] [270650.609317] [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [270650.618397] [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre] [270650.627864] [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre] [270650.637324] [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass] [270650.646690] [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [270650.655374] [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre] [270650.663281] [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290 [270650.671879] [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0 [270650.680178] [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre] [270650.688744] [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass] [270650.696904] [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass] [270650.704945] [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [270650.713656] [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50 [270650.722373] [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [270650.731000] [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre] [270650.739126] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [270650.745994] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [270650.752766] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b [270650.760571] [270650.763508] Kernel panic - not syncing: LBUG [270650.769875] CPU: 17 PID: 64198 Comm: ior Tainted: G IOE ------------ 3.10.0-327.22.2.el7.x86_64 #1 [270650.782689] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [270650.796038] ffffffffa0843def 0000000067048a74 ffff88103fd9f560 ffffffff816360fc [270650.806102] ffff88103fd9f5e0 ffffffff8162f977 ffffffff00000008 ffff88103fd9f5f0 [270650.816170] ffff88103fd9f590 0000000067048a74 ffffffffa0dc63b0 0000000000000000 [270650.826189] Call Trace: [270650.830566] [<ffffffff816360fc>] dump_stack+0x19/0x1b [270650.837897] [<ffffffff8162f977>] panic+0xd8/0x1e7 [270650.844870] [<ffffffffa0826ddb>] lbug_with_loc+0xab/0xc0 [libcfs] [270650.853345] [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc] [270650.862000] [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0 [270650.869976] [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc] [270650.878736] [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70 [270650.888101] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [270650.896965] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [270650.905069] [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc] [270650.914081] [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [270650.923740] [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc] [270650.932528] [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc] [270650.941515] [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [270650.951198] [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass] [270650.960478] [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [270650.969481] [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [270650.979160] [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [270650.989192] [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass] [270650.998797] [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [270651.008434] [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre] [270651.017239] [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290 [270651.026725] [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0 [270651.035938] [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre] [270651.045484] [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass] [270651.054443] [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass] [270651.063300] [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [270651.072803] [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50 [270651.082339] [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [270651.091766] [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre] [270651.100727] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [270651.108378] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [270651.115918] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
Here is what IOR saw:
IOR-3.0.1: MPI Coordinated Test of Parallel I/O Began: Mon Aug 1 18:28:46 2016 Command line used: /home/johnsali/wolf-3/ior/src/ior -i 5 -v -b 4096M -t 16m Machine: Linux wolf-6.wolf.hpdd.intel.com Start time skew across all tasks: 0.00 sec Test 0 started: Mon Aug 1 18:28:46 2016 Path: /mnt/lustre FS: 13.9 TiB Used FS: 0.0% Inodes: 4.2 Mi Used Inodes: 0.0% Participating tasks: 3 Summary: api = POSIX test filename = testFile access = single-shared-file pattern = segmented (1 segment) ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 3 (1 per node) repetitions = 5 xfersize = 16 MiB blocksize = 4 GiB aggregate filesize = 12 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- Commencing write performance test: Mon Aug 1 18:28:46 2016 Message from syslogd@wolf-6 at Aug 1 18:28:48 ... kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: Message from syslogd@wolf-6 at Aug 1 18:28:48 ... kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
Attempts to change the max_pages_per_rpc have not been successful:
[root@wolf-8 ~]# lctl set_param osc.*.max_pages_per_rpc=1024 error: set_param: setting /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88104f0c1000/max_pages_per_rpc=1024: Numerical result out of range
System crash files can be found at (all clients hit the LBUG/crash/reboot at the same time):
/scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-01-18:29:12
/scratch/dumps/wolf-7.wolf.hpdd.intel.com/10.8.1.7-2016-08-01-18:29:10
/scratch/dumps/wolf-8.wolf.hpdd.intel.com/10.8.1.8-2016-08-01-18:29:11