Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
While finding a root cause of client crash, Elena creates an simple reproducer.
+test_16bb() { + local fsxN=${FSX_NP:-5} + local fsxNfiles=${FSX_N_FILES:-200000} + local dropsleep=${DROP_SLEEP:-3} + local dpipd + local -a fsxpids + + while true; do + echo 3 > /proc/sys/vm/drop_caches + sleep $dropsleep + done & + dpipd=$! + + for i in $(seq 1 $fsxN); do + $FSX -N $fsxNfiles $DIR/fsxfile.${i} -l $((64 * 1024 * 1024)) & + fsxpids[$i]=$! + done + for i in $(seq 1 $fsxN); do + wait ${fsxpids[$i]} + echo fsx $i done + date + done + + kill -9 $dpipd +} +run_test 16bb "Parallel FSX"
It crashes a client in the different traces like
[ 884.267817] LustreError: 27636:0:(osc_cache.c:2508:osc_teardown_async_page()) extent 000000002888b266@{[11604 -> 11613/12287], [2|0|-|cache|wi|000000004ef92cbf], [65536|10|+|-|000000006eedea66|1024|000000005037061e]} trunc at 11604. [ 884.274776] LustreError: 27636:0:(osc_cache.c:2508:osc_teardown_async_page()) ### extent: 000000002888b266 ns: lustre-OST0003-osc-ffff947f5083f800 lock: 000000006eedea66/0x27c72ffb4af2f70f lrc: 4/0,1 mode: PW/PW res: [0x23:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x800020080000000 nid: local remote: 0x5637c3b005c04faa expref: -99 pid: 27636 timeout: 0 lvb_type: 1 [ 884.278875] sanityn.sh (27630): drop_caches: 3 [ 884.284307] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) page@00000000c74a8010[3 00000000664f3ef0 5 1 000000005037061e] [ 884.291905] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) vvp-page@0000000014641109(1:0) vm@000000005b1e88d7 17ffffc0001035 2:0 ffff947f4546a000 11604 lru [ 884.303045] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) lov-page@0000000067f0c66a [ 884.306045] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) osc-page@00000000a5e5fcf1 11604: 1< 0x845fed 2 + - > 2< 47529984 0 4096 0x0 0x40420 | 000000005037061e 0000000010910875 000000004ef92cbf > 3< 0 0 0 > 4< 0 0 8 482254848 - | - - + - > 5< - - + - | 0 - | 10 - -> [ 884.313281] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) end page@00000000c74a8010 [ 884.316247] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16 [ 884.318706] LustreError: 27636:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: [ 884.321020] LustreError: 27636:0:(osc_page.c:183:osc_page_delete()) LBUG [ 884.322937] Pid: 27636, comm: fsx 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 11:31:47 EST 2022 [ 884.325390] Call Trace TBD: [ 884.326710] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [ 884.328296] [<0>] lbug_with_loc+0x43/0x80 [libcfs] [ 884.329882] [<0>] osc_page_delete+0x4c4/0x4d0 [osc] [ 884.331629] [<0>] cl_page_delete0+0x82/0x210 [obdclass] [ 884.333249] [<0>] cl_page_delete+0x2a/0xf0 [obdclass] [ 884.334984] [<0>] ll_invalidatepage+0x9a/0x180 [lustre] [ 884.336640] [<0>] truncate_cleanup_page+0x6f/0xc0 [ 884.338178] [<0>] generic_error_remove_page+0x2e/0x50 [ 884.339811] [<0>] cl_page_discard+0x47/0x80 [obdclass] [ 884.341459] [<0>] cl_page_list_discard+0x4d/0x130 [obdclass] [ 884.343217] [<0>] ll_io_read_page+0x1cf/0x910 [lustre] [ 884.344887] [<0>] ll_readpage+0xfa/0x850 [lustre] [ 884.346433] [<0>] filemap_fault+0x5ed/0xa20 [ 884.347853] [<0>] ll_filemap_fault+0x21/0x50 [lustre] [ 884.349505] [<0>] vvp_io_fault_start+0x34f/0xc60 [lustre] [ 884.351254] [<0>] cl_io_start+0x5d/0x120 [obdclass] [ 884.352848] [<0>] cl_io_loop+0x9a/0x1f0 [obdclass] [ 884.354435] [<0>] ll_fault+0x689/0x850 [lustre] [ 884.355950] [<0>] __do_fault+0x38/0xb0 [ 884.357295] [<0>] do_fault+0x123/0x3c0 [ 884.358620] [<0>] __handle_mm_fault+0x4d5/0x820 [ 884.360139] [<0>] handle_mm_fault+0xbe/0x1e0 [ 884.361617] [<0>] __do_page_fault+0x1ed/0x4c0 [ 884.363089] [<0>] do_page_fault+0x37/0x130 [ 884.364513] [<0>] page_fault+0x1e/0x30 [ 884.365812] Kernel panic - not syncing: LBUG
or
2022-11-28 10:05:13 [128265.023680] LustreError: 61801:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16 2022-11-28 10:05:13 [128265.035263] LustreError: 61801:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: 2022-11-28 10:05:13 [128265.046208] LustreError: 61801:0:(osc_page.c:183:osc_page_delete()) LBUG 2022-11-28 10:05:13 [128265.055234] Pid: 61801, comm: IOR 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 2022-11-28 10:05:13 [128265.066025] Call Trace: 2022-11-28 10:05:13 [128265.070300] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] 2022-11-28 10:05:13 [128265.077602] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] 2022-11-28 10:05:13 [128265.084475] [<0>] osc_page_delete+0x49b/0x500 [osc] 2022-11-28 10:05:13 [128265.091428] [<0>] cl_page_delete0+0x85/0x250 [obdclass] 2022-11-28 10:05:13 [128265.098755] [<0>] cl_page_delete+0x33/0x110 [obdclass] 2022-11-28 10:05:13 [128265.105967] [<0>] ll_invalidatepage+0x7f/0x170 [lustre] 2022-11-28 10:05:13 [128265.113147] [<0>] do_invalidatepage_range+0x7d/0x90 2022-11-28 10:05:13 [128265.119971] [<0>] truncate_inode_page+0x77/0x80 2022-11-28 10:05:13 [128265.126346] [<0>] truncate_inode_pages_range+0x1ea/0x700 2022-11-28 10:05:13 [128265.133620] [<0>] truncate_inode_pages_final+0x4f/0x60 2022-11-28 10:05:13 [128265.140707] [<0>] ll_truncate_inode_pages_final+0x21/0xe0 [lustre] 2022-11-28 10:05:13 [128265.148908] [<0>] vvp_prune+0x73/0x220 [lustre] 2022-11-28 10:05:13 [128265.155186] [<0>] cl_object_prune+0x60/0x140 [obdclass] 2022-11-28 10:05:13 [128265.162243] [<0>] ll_layout_refresh+0x58c/0xaf0 [lustre] 2022-11-28 10:05:13 [128265.169355] [<0>] vvp_io_init+0x357/0x490 [lustre] 2022-11-28 10:05:13 [128265.175881] [<0>] cl_io_init0.isra.7+0x93/0x170 [obdclass] 2022-11-28 10:05:13 [128265.183283] [<0>] cl_io_init+0x43/0x80 [obdclass] 2022-11-28 10:05:13 [128265.189755] [<0>] cl_io_rw_init+0x77/0x210 [obdclass] 2022-11-28 10:05:13 [128265.196639] [<0>] ll_file_io_generic+0x2a8/0x1040 [lustre] 2022-11-28 10:05:13 [128265.203912] [<0>] ll_file_aio_write+0x5a0/0xa10 [lustre] 2022-11-28 10:05:13 [128265.211021] [<0>] ll_file_write+0x111/0x1e0 [lustre] 2022-11-28 10:05:13 [128265.217625] [<0>] vfs_write+0xc0/0x1f0 2022-11-28 10:05:13 [128265.222870] [<0>] SyS_write+0x7f/0xf0 2022-11-28 10:05:13 [128265.228010] [<0>] system_call_fastpath+0x22/0x27 2022-11-28 10:05:13 [128265.234187] [<0>] 0xfffffffffffffffe
All crashes related to incorrect osc extent state with removing page.
if (!list_empty(&oap->oap_rpc_item)) { CDEBUG(D_CACHE, "oap %p is not in cache.\n", oap); rc = -EBUSY; } else if (!list_empty(&oap->oap_pending_item)) { struct osc_extent *ext = NULL; osc_object_lock(obj); ext = osc_extent_lookup(obj, osc_index(oap2osc(oap))); osc_object_unlock(obj); /* only truncated pages are allowed to be taken out. * See osc_extent_truncate() and osc_cache_truncate_start() * for details. */ if (ext != NULL && ext->oe_state != OES_TRUNC) { OSC_EXTENT_DUMP(D_ERROR, ext, "trunc at %lu.\n", osc_index(oap2osc(oap))); rc = -EBUSY; } if (ext != NULL) osc_extent_put(env, ext); }
As I seen these extents created by readahead and wait to be read. It's related to the "uncovered page" issue (LU-16362). Extent created but don't sent into network - memory reclaim start to flush a pages and it caused a 100% panic as invalid extent state.
Same issue should be hit in FLR error case when read io completion tries to remove a pages which failed to read, but extent into RPC phase and it likely cause a panic.