Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0
-
Bull Lustre distribution based on Lustre 2.7.2
-
3
-
9223372036854775807
Description
In the last month one of our customer hit more than 100 times a crash with the following signature:
[506626.555125] SLUB: Unable to allocate memory on node -1 (gfp=0x80c0) [506626.562216] cache: kvm_mmu_page_header(22:step_batch), object size: 168, buffer size: 168, default order: 1, min order: 0 [506626.574729] node 0: slabs: 0, objs: 0, free: 0 [506626.579974] node 1: slabs: 0, objs: 0, free: 0 [506626.585219] node 2: slabs: 60, objs: 2880, free: 0 [506626.590852] node 3: slabs: 0, objs: 0, free: 0 [506626.596112] LustreError: 41604:0:(osc_cache.c:1290:osc_completion()) ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) ) failed: cp_state:0, cmd:1 [506626.612512] LustreError: 41604:0:(osc_cache.c:1290:osc_completion()) LBUG [506626.620186] Pid: 41604, comm: cat [506626.623978] Call Trace: [506626.628573] [<ffffffffa05eb853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [506626.636448] [<ffffffffa05ebdf5>] lbug_with_loc+0x45/0xc0 [libcfs] [506626.643456] [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0 [osc] [506626.651526] [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc] [506626.659108] [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc] [506626.666037] [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass] [506626.673531] [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov] [506626.680450] [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass] [506626.687961] [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre] [506626.694964] [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750 [506626.702078] [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre] [506626.709674] [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass] [506626.716785] [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass] [506626.723797] [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre] [506626.731477] [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre] [506626.738962] [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre] [506626.745962] [<ffffffff811dea1c>] vfs_read+0x9c/0x170 [506626.751700] [<ffffffff811df56f>] SyS_read+0x7f/0xe0 [506626.757345] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b [506626.764138] [506626.765990] Kernel panic - not syncing: LBUG [506626.770850] CPU: 53 PID: 41604 Comm: cat Tainted: G OE ------------ 3.10.0-327.22.2.el7.x86_64 #1 [506626.782104] Hardware name: BULL bullx blade/CHPU, BIOS BIOSX07.037.01.003 10/23/2015 [506626.790838] ffffffffa0610ced 000000000f6a3070 ffff8817799eb8c0 ffffffff816360f4 [506626.799228] ffff8817799eb940 ffffffff8162f96a ffffffff00000008 ffff8817799eb950 [506626.807618] ffff8817799eb8f0 000000000f6a3070 ffffffffa0e01466 0000000000000246 [506626.816005] Call Trace: [506626.818839] [<ffffffff816360f4>] dump_stack+0x19/0x1b [506626.824668] [<ffffffff8162f96a>] panic+0xd8/0x1e7 [506626.830128] [<ffffffffa05ebe5b>] lbug_with_loc+0xab/0xc0 [libcfs] [506626.837129] [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0 [osc] [506626.845192] [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc] [506626.852766] [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc] [506626.859702] [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass] [506626.867184] [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov] [506626.874099] [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass] [506626.881611] [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre] [506626.888609] [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750 [506626.895721] [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre] [506626.903322] [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass] [506626.910418] [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass] [506626.917420] [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre] [506626.925091] [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre] [506626.932575] [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre] [506626.939569] [<ffffffff811dea1c>] vfs_read+0x9c/0x170 [506626.945300] [<ffffffff811df56f>] SyS_read+0x7f/0xe0 [506626.950938] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
The customer being a black site, we can't provide the crashdump, but will happily provide any text output you would find useful.
Bruno, this was exactly the purpose of this test. It seems it discover other memory management issues in client code. I/O is not really expected to succeed under such constraints, but only returing EIO or ENOMEM, not crashing