Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4856

osc_lru_reserve()) ASSERTION( atomic_read(cli->cl_lru_left) >= 0 ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
    • 3
    • 13394

    Description

      The atomic_t used to count LRU entries is overflowing on systems with large memory configurations:

      LustreError: 22141:0:(osc_page.c:892:osc_lru_reserve()) ASSERTION(atomic_read(cli->cl_lru_left) >= 0 ) failed:

      PID: 54214 TASK: ffff88fdef4e4100 CPU: 40 COMMAND: "cat"
      #3 [ffff88fdf0823900] lbug_with_loc at ffffffffa07fedc3 [libcfs]
      #4 [ffff88fdf0823920] osc_lru_reserve at ffffffffa0c2a28a [osc]
      #5 [ffff88fdf08239a0] cl_page_alloc at ffffffffa09a7122 [obdclass]
      #6 [ffff88fdf08239e0] cl_page_find0 at ffffffffa09a742d [obdclass]
      #7 [ffff88fdf0823a40] lov_page_init_raid0 at ffffffffa0cc0f21 [lov]
      #8 [ffff88fdf0823aa0] cl_page_alloc at ffffffffa09a7122 [obdclass]
      #9 [ffff88fdf0823ae0] cl_page_find0 at ffffffffa09a742d [obdclass]
      #10 [ffff88fdf0823b40] ll_cl_init at ffffffffa0d74123 [lustre]
      #11 [ffff88fdf0823bd0] ll_readpage at ffffffffa0d74485 [lustre]
      #12 [ffff88fdf0823c00] do_generic_file_read at ffffffff810fa39e
      #13 [ffff88fdf0823c80] generic_file_aio_read at ffffffff810fad4c
      #14 [ffff88fdf0823d40] vvp_io_read_start at ffffffffa0da2fb0 [lustre]
      #15 [ffff88fdf0823da0] cl_io_start at ffffffffa09af979 [obdclass]
      #16 [ffff88fdf0823dd0] cl_io_loop at ffffffffa09b3d33 [obdclass]
      #17 [ffff88fdf0823e00] ll_file_io_generic at ffffffffa0d49c32 [lustre]
      #18 [ffff88fdf0823e70] ll_file_aio_read at ffffffffa0d4a3b3 [lustre]
      #19 [ffff88fdf0823ec0] ll_file_read at ffffffffa0d4aec3 [lustre]
      #20 [ffff88fdf0823f10] vfs_read at ffffffff8115b237
      #21 [ffff88fdf0823f40] sys_read at ffffffff8115b3a3

      In this case, the atomic_t (signed int) held:
      crash> pd (int)0xffff943de11780fc
      $10 = -1506317746

      We've triggered this specific problem with configurations down to 11TB of physmem. A 10.5TB system can cat a small file without crashing.

      I noticed several other cases where page counts are handled using a signed int, and suspect anything more than 4TB is problematic. The kernel itself is consistently using unsigned long for page counts on all architectures.

      Attachments

        Activity

          People

            yujian Jian Yu
            schamp Stephen Champion
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: