Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16401

various crashes with cl_page_discard vs readhead race

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While finding a root cause of client crash, Elena creates an simple reproducer.

      +test_16bb() {
      +       local fsxN=${FSX_NP:-5}
      +       local fsxNfiles=${FSX_N_FILES:-200000}
      +       local dropsleep=${DROP_SLEEP:-3}
      +       local dpipd
      +       local -a fsxpids
      +
      +       while true; do
      +               echo 3 > /proc/sys/vm/drop_caches
      +               sleep $dropsleep
      +        done &
      +       dpipd=$!
      +
      +       for i in $(seq 1 $fsxN); do
      +               $FSX -N $fsxNfiles $DIR/fsxfile.${i} -l $((64 * 1024 * 1024)) &
      +               fsxpids[$i]=$!
      +       done
      +       for i in $(seq 1 $fsxN); do
      +               wait ${fsxpids[$i]}
      +               echo fsx $i done
      +               date
      +       done
      +
      +       kill -9 $dpipd
      +}
      +run_test 16bb "Parallel FSX"
      

      It crashes a client in the different traces like

      [  884.267817] LustreError: 27636:0:(osc_cache.c:2508:osc_teardown_async_page()) extent 000000002888b266@{[11604 -> 11613/12287], [2|0|-|cache|wi|000000004ef92cbf], [65536|10|+|-|000000006eedea66|1024|000000005037061e]} trunc at 11604.
      [  884.274776] LustreError: 27636:0:(osc_cache.c:2508:osc_teardown_async_page()) ### extent: 000000002888b266 ns: lustre-OST0003-osc-ffff947f5083f800 lock: 000000006eedea66/0x27c72ffb4af2f70f lrc: 4/0,1 mode: PW/PW res: [0x23:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x800020080000000 nid: local remote: 0x5637c3b005c04faa expref: -99 pid: 27636 timeout: 0 lvb_type: 1
      [  884.278875] sanityn.sh (27630): drop_caches: 3
      [  884.284307] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) page@00000000c74a8010[3 00000000664f3ef0 5 1 000000005037061e]
      
      [  884.291905] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) vvp-page@0000000014641109(1:0) vm@000000005b1e88d7 17ffffc0001035 2:0 ffff947f4546a000 11604 lru
      
      [  884.303045] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) lov-page@0000000067f0c66a
      
      [  884.306045] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) osc-page@00000000a5e5fcf1 11604: 1< 0x845fed 2 + - > 2< 47529984 0 4096 0x0 0x40420 | 000000005037061e 0000000010910875 000000004ef92cbf > 3< 0 0 0 > 4< 0 0 8 482254848 - | - - + - > 5< - - + - | 0 - | 10 - ->
      
      [  884.313281] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) end page@00000000c74a8010
      
      [  884.316247] LustreError: 27636:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16
      [  884.318706] LustreError: 27636:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: 
      [  884.321020] LustreError: 27636:0:(osc_page.c:183:osc_page_delete()) LBUG
      [  884.322937] Pid: 27636, comm: fsx 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 11:31:47 EST 2022
      [  884.325390] Call Trace TBD:
      [  884.326710] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
      [  884.328296] [<0>] lbug_with_loc+0x43/0x80 [libcfs]
      [  884.329882] [<0>] osc_page_delete+0x4c4/0x4d0 [osc]
      [  884.331629] [<0>] cl_page_delete0+0x82/0x210 [obdclass]
      [  884.333249] [<0>] cl_page_delete+0x2a/0xf0 [obdclass]
      [  884.334984] [<0>] ll_invalidatepage+0x9a/0x180 [lustre]
      [  884.336640] [<0>] truncate_cleanup_page+0x6f/0xc0
      [  884.338178] [<0>] generic_error_remove_page+0x2e/0x50
      [  884.339811] [<0>] cl_page_discard+0x47/0x80 [obdclass]
      [  884.341459] [<0>] cl_page_list_discard+0x4d/0x130 [obdclass]
      [  884.343217] [<0>] ll_io_read_page+0x1cf/0x910 [lustre]
      [  884.344887] [<0>] ll_readpage+0xfa/0x850 [lustre]
      [  884.346433] [<0>] filemap_fault+0x5ed/0xa20
      [  884.347853] [<0>] ll_filemap_fault+0x21/0x50 [lustre]
      [  884.349505] [<0>] vvp_io_fault_start+0x34f/0xc60 [lustre]
      [  884.351254] [<0>] cl_io_start+0x5d/0x120 [obdclass]
      [  884.352848] [<0>] cl_io_loop+0x9a/0x1f0 [obdclass]
      [  884.354435] [<0>] ll_fault+0x689/0x850 [lustre]
      [  884.355950] [<0>] __do_fault+0x38/0xb0
      [  884.357295] [<0>] do_fault+0x123/0x3c0
      [  884.358620] [<0>] __handle_mm_fault+0x4d5/0x820
      [  884.360139] [<0>] handle_mm_fault+0xbe/0x1e0
      [  884.361617] [<0>] __do_page_fault+0x1ed/0x4c0
      [  884.363089] [<0>] do_page_fault+0x37/0x130
      [  884.364513] [<0>] page_fault+0x1e/0x30
      [  884.365812] Kernel panic - not syncing: LBUG
      

      or

      2022-11-28 10:05:13 [128265.023680] LustreError: 61801:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16
      2022-11-28 10:05:13 [128265.035263] LustreError: 61801:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: 
      2022-11-28 10:05:13 [128265.046208] LustreError: 61801:0:(osc_page.c:183:osc_page_delete()) LBUG
      2022-11-28 10:05:13 [128265.055234] Pid: 61801, comm: IOR 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018
      2022-11-28 10:05:13 [128265.066025] Call Trace:
      2022-11-28 10:05:13 [128265.070300] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      2022-11-28 10:05:13 [128265.077602] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      2022-11-28 10:05:13 [128265.084475] [<0>] osc_page_delete+0x49b/0x500 [osc]
      2022-11-28 10:05:13 [128265.091428] [<0>] cl_page_delete0+0x85/0x250 [obdclass]
      2022-11-28 10:05:13 [128265.098755] [<0>] cl_page_delete+0x33/0x110 [obdclass]
      2022-11-28 10:05:13 [128265.105967] [<0>] ll_invalidatepage+0x7f/0x170 [lustre]
      2022-11-28 10:05:13 [128265.113147] [<0>] do_invalidatepage_range+0x7d/0x90
      2022-11-28 10:05:13 [128265.119971] [<0>] truncate_inode_page+0x77/0x80
      2022-11-28 10:05:13 [128265.126346] [<0>] truncate_inode_pages_range+0x1ea/0x700
      2022-11-28 10:05:13 [128265.133620] [<0>] truncate_inode_pages_final+0x4f/0x60
      2022-11-28 10:05:13 [128265.140707] [<0>] ll_truncate_inode_pages_final+0x21/0xe0 [lustre]
      2022-11-28 10:05:13 [128265.148908] [<0>] vvp_prune+0x73/0x220 [lustre]
      2022-11-28 10:05:13 [128265.155186] [<0>] cl_object_prune+0x60/0x140 [obdclass]
      2022-11-28 10:05:13 [128265.162243] [<0>] ll_layout_refresh+0x58c/0xaf0 [lustre]
      2022-11-28 10:05:13 [128265.169355] [<0>] vvp_io_init+0x357/0x490 [lustre]
      2022-11-28 10:05:13 [128265.175881] [<0>] cl_io_init0.isra.7+0x93/0x170 [obdclass]
      2022-11-28 10:05:13 [128265.183283] [<0>] cl_io_init+0x43/0x80 [obdclass]
      2022-11-28 10:05:13 [128265.189755] [<0>] cl_io_rw_init+0x77/0x210 [obdclass]
      2022-11-28 10:05:13 [128265.196639] [<0>] ll_file_io_generic+0x2a8/0x1040 [lustre]
      2022-11-28 10:05:13 [128265.203912] [<0>] ll_file_aio_write+0x5a0/0xa10 [lustre]
      2022-11-28 10:05:13 [128265.211021] [<0>] ll_file_write+0x111/0x1e0 [lustre]
      2022-11-28 10:05:13 [128265.217625] [<0>] vfs_write+0xc0/0x1f0
      2022-11-28 10:05:13 [128265.222870] [<0>] SyS_write+0x7f/0xf0
      2022-11-28 10:05:13 [128265.228010] [<0>] system_call_fastpath+0x22/0x27
      2022-11-28 10:05:13 [128265.234187] [<0>] 0xfffffffffffffffe
      

      All crashes related to incorrect osc extent state with removing page.

              if (!list_empty(&oap->oap_rpc_item)) {
                      CDEBUG(D_CACHE, "oap %p is not in cache.\n", oap);
                      rc = -EBUSY;
              } else if (!list_empty(&oap->oap_pending_item)) {
                      struct osc_extent *ext = NULL;
      
                      osc_object_lock(obj);
                      ext = osc_extent_lookup(obj, osc_index(oap2osc(oap)));
                      osc_object_unlock(obj);
                      /* only truncated pages are allowed to be taken out.
                       * See osc_extent_truncate() and osc_cache_truncate_start()
                       * for details. */
                      if (ext != NULL && ext->oe_state != OES_TRUNC) {
                              OSC_EXTENT_DUMP(D_ERROR, ext, "trunc at %lu.\n",
                                              osc_index(oap2osc(oap)));
                              rc = -EBUSY;
                      }
                      if (ext != NULL)
                              osc_extent_put(env, ext);
              }
      

      As I seen these extents created by readahead and wait to be read. It's related to the "uncovered page" issue (LU-16362). Extent created but don't sent into network - memory reclaim start to flush a pages and it caused a 100% panic as invalid extent state.

      Same issue should be hit in FLR error case when read io completion tries to remove a pages which failed to read, but extent into RPC phase and it likely cause a panic.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: