Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5138

hang in osc_lru_reserve despite recoverable state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • None
    • 3
    • 14168

    Description

      This process has been waiting in osc_lru_reserve for a very long time:

      PID: 22025 TASK: ffff88017aeba480 CPU: 5 COMMAND: "reads"
      #0 [ffff88017b2ff838] schedule at ffffffff8145ec7b
      #1 [ffff88017b2ff980] osc_lru_reserve at ffffffffa0e16ee5 [osc]
      #2 [ffff88017b2ffa00] osc_page_init at ffffffffa0e1710d [osc]
      #3 [ffff88017b2ffa40] lov_page_init_raid0 at ffffffffa0ea48b0 [lov]
      #4 [ffff88017b2ffaa0] cl_page_alloc at ffffffffa0aae632 [obdclass]
      #5 [ffff88017b2ffae0] cl_page_find at ffffffffa0aae91b [obdclass]
      #6 [ffff88017b2ffb30] ll_write_begin at ffffffffa0f96f8d [lustre]
      #7 [ffff88017b2ffb90] generic_perform_write at ffffffff810f8242
      #8 [ffff88017b2ffc10] generic_file_buffered_write at ffffffff810f83a1
      #9 [ffff88017b2ffc60] __generic_file_aio_write at ffffffff810fb336
      #10 [ffff88017b2ffd10] generic_file_aio_write at ffffffff810fb57c
      #11 [ffff88017b2ffd50] vvp_io_write_start at ffffffffa0faae48 [lustre]
      #12 [ffff88017b2ffda0] cl_io_start at ffffffffa0ab65f9 [obdclass]
      #13 [ffff88017b2ffdd0] cl_io_loop at ffffffffa0aba123 [obdclass]
      #14 [ffff88017b2ffe00] ll_file_io_generic at ffffffffa0f46af1 [lustre]
      #15 [ffff88017b2ffe70] ll_file_aio_write at ffffffffa0f47037 [lustre]
      #16 [ffff88017b2ffec0] ll_file_write at ffffffffa0f47a00 [lustre]
      #17 [ffff88017b2fff10] vfs_write at ffffffff8115aeae
      #18 [ffff88017b2fff40] sys_write at ffffffff8115b023
      #19 [ffff88017b2fff80] system_call_fastpath at ffffffff81468d92
      RIP: 00002aaaaad99630 RSP: 00007fffffffc568 RFLAGS: 00010246
      RAX: 0000000000000001 RBX: ffffffff81468d92 RCX: 00007fffffffc510
      RDX: 0000000000010000 RSI: 0000000000603040 RDI: 0000000000000003
      RBP: 0000000000010000 R8: 0000000000000000 R9: 0101010101010101
      R10: 00007fffffffc3b0 R11: 0000000000000246 R12: 0000000000010000
      R13: 0000000000000001 R14: 00000000063b0000 R15: 00000000063c0000
      ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

      While testing for LU-4856, the bug described in LU-5123 caused sanity 101a to run with ccc_lru_max = 32 (pages), I have not tried, but it should be possible to reproduce this in master by modifying 101a to set max_dirty_mb to 128k.

      This is a pathological condition, but I think it exposed a real bug. Namely, it appears that the wakeup from the sleep in osc_lru_reserve can be incidental - causes by another process that just happens to do something that triggers an osc_lru_shrink, rather than the deliberate and specific process of the conditions which caused the sleep being addressed when it becomes possible to do so.

      I have a core and debug log from a system in this state, and will attach the debug log, and paste my notes in a comment.

      Attachments

        1. debug.txt
          0.2 kB
        2. notes.txt
          68 kB

        Activity

          People

            pjones Peter Jones
            schamp Stephen Champion
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: