Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-727

application hang waiting on page lock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • lustre 2.0.0.1, kernel 2.6.32-71, lustre client
    • 3
    • 6548

    Description

      Application hang with the following stack:

      PID: 17906 TASK: ffff88063e3f94e0 CPU: 0 COMMAND: "Migration_Clien"
      #0 [ffff88060e017918] schedule at ffffffff8147b5d9
      #1 [ffff88060e0179e0] io_schedule at ffffffff8147bd83
      #2 [ffff88060e017a00] sync_page at ffffffff810facdd
      #3 [ffff88060e017a10] sync_page_killable at ffffffff810facfe
      #4 [ffff88060e017a20] __wait_on_bit_lock at ffffffff8147c48a
      #5 [ffff88060e017a70] __lock_page_killable at ffffffff810fac07
      #6 [ffff88060e017ad0] generic_file_aio_read at ffffffff810fc864
      #7 [ffff88060e017bb0] vvp_io_read_start at ffffffffa08a971b
      #8 [ffff88060e017c60] cl_io_start at ffffffffa04ec9c8
      #9 [ffff88060e017cc0] cl_io_loop at ffffffffa04f0ad0
      #10 [ffff88060e017d30] ll_file_io_generic at ffffffffa0853db2
      #11 [ffff88060e017dd0] ll_file_aio_read at ffffffffa085406c
      #12 [ffff88060e017e60] ll_file_read at ffffffffa085a751
      #13 [ffff88060e017ef0] vfs_read at ffffffff8115d565
      #14 [ffff88060e017f30] sys_pread64 at ffffffff8115d892
      #15 [ffff88060e017f80] system_call_fastpath at ffffffff81003172
      RIP: 0000003fef40ee13 RSP: 00007fff9037d5f8 RFLAGS: 00010246
      RAX: 0000000000000011 RBX: ffffffff81003172 RCX: 0000000000000000
      RDX: 0000000000000e10 RSI: 00007fff9037d620 RDI: 0000000000000016
      RBP: 0000000000000000 R8: 00007fff9037e508 R9: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000016
      R13: 00007fff9037d620 R14: 0000000000000000 R15: 00007fff90381c50
      ORIG_RAX: 0000000000000011 CS: 0033 SS: 002b

      The file being read is on a lustre filesystem
      crash> files 17906
      22 ffff8803370ccb00 ffff88033e20cec0 ffff880337f48930 REG /scratch_lustre/ossau/HPC2012_BENCH/data/Test_MIGAL/INPUT/Seismic/Line_3745.segy

      The page structure, whose address was retrieved from the process stack, shows the flags PG_locked and PG_lru are set.
      crash> p/x *(struct page *)0xffffea0004f06f18
      $4 = {
      flags = 0x800000000000021, PG_locked | PG_lru
      _count =

      { counter = 0x2 }

      ,
      {
      _mapcount =

      { counter = 0xffffffff }

      ,

      { inuse = 0xffff, objects = 0xffff }

      },
      {

      { private = 0x0, mapping = 0xffff880337f48a50 }

      ,
      ptl = {
      raw_lock =

      { slock = 0x0 }

      },
      slab = 0x0,
      first_page = 0x0
      },

      { index = 0x0, freelist = 0x0 }

      ,
      lru =

      { next = 0xffffea0005dc40c8, prev = 0xffffea0004f06f08 }

      }

      Unfortunately, the page lock is never released. Looking at the dump info I am not able to find the current owner of the PG_locked page lock.

      The application is the MIGAL benchmark.
      I am going to upload the dump file in attachment to this ticket. The hang can be easily reproduced if more information is needed.

      The same problem has also been produced at CEA using DDT a debugger environment. The stack is similar:
      PID: 7234 TASK: ffff881856477600 CPU: 14 COMMAND: "gdb"
      #0 [ffff88185ae558c8] schedule at ffffffff81467211
      #1 [ffff88185ae55990] io_schedule at ffffffff814679a3
      #2 [ffff88185ae559b0] sync_page at ffffffff810fa35d
      #3 [ffff88185ae559c0] __wait_on_bit_lock at ffffffff814680ba
      #4 [ffff88185ae55a10] __lock_page at ffffffff810fa2f7
      #5 [ffff88185ae55a70] filemap_fault at ffffffff810fc09f
      #6 [ffff88185ae55ae0] vvp_io_fault_start at ffffffffa080ab1f
      #7 [ffff88185ae55b80] cl_io_start at ffffffffa04949c8
      #8 [ffff88185ae55be0] cl_io_loop at ffffffffa0498ad0
      #9 [ffff88185ae55c50] ll_fault0 at ffffffffa07eea76
      #10 [ffff88185ae55ce0] ll_fault at ffffffffa07eee68
      #11 [ffff88185ae55d50] __do_fault at ffffffff811237b4
      #12 [ffff88185ae55de0] handle_pte_fault at ffffffff81123cfd
      #13 [ffff88185ae55eb0] handle_mm_fault at ffffffff811248dd
      #14 [ffff88185ae55f00] do_page_fault at ffffffff8146c3d3
      #15 [ffff88185ae55f50] page_fault at ffffffff81469e45
      RIP: 00000000005cff90 RSP: 00007fff2a3e0758 RFLAGS: 00010206
      RAX: 0000000000868820 RBX: 0000000000d8a370 RCX: 0000000000001030
      RDX: 00007fff2a3e083c RSI: 00002b672fc89052 RDI: 00002b672fc89052
      RBP: 00002b672fc89052 R8: 0000000000000003 R9: 00000000fffffff7
      R10: 0000000000000fe0 R11: 000000380217b448 R12: 00007fff2a3e083c
      R13: 00002b672f9d1f69 R14: 0000000000daebf0 R15: 00000000000003f2
      ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            pichong Gregoire Pichon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: