Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1299

running truncated executable causes spewing of lock debug messages

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.4
    • Lustre 2.1.1
    • None
    • 3
    • 4599

    Description

      A user was running an executable that appeared to be truncated. This caused the node to OOM and the Lustre client subsequently started spewing lock debugging messages:

      2012-04-10 09:32:58 LustreError: 26191:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -4
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lock@ffff8805b8374c18[3 5 -4 1 0 00000006] P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     vvp@ffff88058a495790: 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lov@ffff8803e7fd0e98: 2
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     0 0: lock@ffff880227a9ce78[1 5 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1003d0000:0x91707b:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lovsub@ffff8801f0d30c60: [0 ffff8803e7fd03a0 R(1):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     osc@ffff8801e0ae7648: ffff88011fe84900 00101001 0x40e1ce45be5b325f 4 (null) size: 1048576 mtime: 1333818831 atime: 1334070919 ctime: 1333818831 blocks: 2048
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9ce78
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     1 0: lock@ffff880227a9cd98[2 3 0 2 2 00000000] R(1):[0, 18446744073709551615]@[0x101d10000:0x93426:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lovsub@ffff8801f0d30c20: [0 ffff8803e7fd00c8 R(1):[256, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     osc@ffff8801e0ae7590: ffff88011fe84240 00101001 0x40e1ce45be5b3266 3 ffff8803364e9e48 size: 0 mtime: 1333818824 atime: 1333818666 ctime: 1333818824 blocks: 0
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9cd98
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff8805b8374c18
      

      These messages appear many times a second. The processes seem to be spinning in the page fault handler. If you catch it while it's not active the backtrace looks like this:

      PID: 22135  TASK: ffff880222c0e080  CPU: 1   COMMAND: "test"
       #0 [ffff880336539158] schedule at ffffffff814eeee0
       #1 [ffff880336539220] __mutex_lock_slowpath at ffffffff814f057e
       #2 [ffff880336539290] mutex_lock at ffffffff814f041b
       #3 [ffff8803365392b0] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
       #4 [ffff8803365392f0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
       #5 [ffff880336539330] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
       #6 [ffff880336539380] lovsub_lock_closure at ffffffffa095c19e [lov]
       #7 [ffff8803365393c0] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
       #8 [ffff880336539410] lov_sublock_lock at ffffffffa095485b [lov]
       #9 [ffff880336539470] lov_lock_enqueue at ffffffffa0955f02 [lov]
      #10 [ffff880336539500] cl_enqueue_try at ffffffffa05d66cc [obdclass]
      #11 [ffff880336539550] cl_enqueue_locked at ffffffffa05d7ced [obdclass]
      #12 [ffff880336539590] cl_lock_request at ffffffffa05d7fae [obdclass]
      #13 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
      #14 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
      #15 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
      #16 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
      #17 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
      #18 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
      #19 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
      #20 [ffff880336539860] __do_fault at ffffffff8113bad4
      #21 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
      #22 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
      #23 [ffff880336539a20] __do_page_fault at ffffffff81042b79
      #24 [ffff880336539b40] do_page_fault at ffffffff814f495e
      #25 [ffff880336539b70] page_fault at ffffffff814f1d15
          [exception RIP: __clear_user+63]
          RIP: ffffffff812791df  RSP: ffff880336539c28  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000df44a8  RCX: 000000000000016b
          RDX: 0000000000000000  RSI: 0000000000000b58  RDI: 0000000000df44a8
          RBP: ffff880336539c38   R8: 0000000000000000   R9: 0000000000000008
          R10: ffff8801a44c7318  R11: 0000000000000000  R12: 0000000000000b58
          R13: 0000000000000000  R14: ffff880336539f58  R15: 0000000000df44a8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
      #26 [ffff880336539c40] clear_user at ffffffff81279248
      #27 [ffff880336539c60] padzero at ffffffff811cc93d
      #28 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
      #29 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
      #30 [ffff880336539eb0] do_execve at ffffffff81180779
      #31 [ffff880336539f20] sys_execve at ffffffff810095ea
      #32 [ffff880336539f50] stub_execve at ffffffff8100b54a
          RIP: 0000003d6f6ab4a7  RSP: 00007fffffffa628  RFLAGS: 00000203
          RAX: 000000000000003b  RBX: 00002aaaaaae05a8  RCX: ffffffffffffffff
          RDX: 0000000000868c08  RSI: 000000000088afc8  RDI: 0000000000869808
          RBP: 000000000088ab08   R8: 0000000000000000   R9: 00007fffffffa598
          R10: 00007fffffffa54c  R11: 0000000000000203  R12: 0000000000869808
          R13: 000000000087c628  R14: 000000000088afc8  R15: 000000000087ce48
          ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b
      

      or this:

      PID: 22135  TASK: ffff880222c0e080  CPU: 8   COMMAND: "test"
       #0 [ffff880336539118] schedule at ffffffff814eeee0
       #1 [ffff8803365391e0] __mutex_lock_slowpath at ffffffff814f057e
       #2 [ffff880336539250] mutex_lock at ffffffff814f041b
       #3 [ffff880336539270] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
       #4 [ffff8803365392b0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
       #5 [ffff8803365392f0] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
       #6 [ffff880336539340] lovsub_lock_closure at ffffffffa095c19e [lov]
       #7 [ffff880336539380] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
       #8 [ffff8803365393d0] lov_sublock_lock at ffffffffa095485b [lov]
       #9 [ffff880336539430] lov_lock_delete at ffffffffa0954b43 [lov]
      #10 [ffff880336539490] cl_lock_delete0 at ffffffffa05d4425 [obdclass]
      #11 [ffff8803365394c0] cl_lock_delete at ffffffffa05d4713 [obdclass]
      #12 [ffff8803365394e0] cl_lock_finish at ffffffffa05d68c7 [obdclass]
      #13 [ffff880336539500] cl_lock_hold_mutex at ffffffffa05d6cd2 [obdclass]
      #14 [ffff880336539590] cl_lock_request at ffffffffa05d7f8e [obdclass]
      #15 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
      #16 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
      #17 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
      #18 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
      #19 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
      #20 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
      #21 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
      #22 [ffff880336539860] __do_fault at ffffffff8113bad4
      #23 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
      #24 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
      #25 [ffff880336539a20] __do_page_fault at ffffffff81042b79
      #26 [ffff880336539b40] do_page_fault at ffffffff814f495e
      #27 [ffff880336539b70] page_fault at ffffffff814f1d15
          [exception RIP: __clear_user+63]
          RIP: ffffffff812791df  RSP: ffff880336539c28  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000df44a8  RCX: 000000000000016b
          RDX: 0000000000000000  RSI: 0000000000000b58  RDI: 0000000000df44a8
          RBP: ffff880336539c38   R8: 0000000000000000   R9: 0000000000000008
          R10: ffff8801a44c7318  R11: 0000000000000000  R12: 0000000000000b58
          R13: 0000000000000000  R14: ffff880336539f58  R15: 0000000000df44a8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
      #28 [ffff880336539c40] clear_user at ffffffff81279248
      #29 [ffff880336539c60] padzero at ffffffff811cc93d
      #30 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
      #31 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
      #32 [ffff880336539eb0] do_execve at ffffffff81180779
      #33 [ffff880336539f20] sys_execve at ffffffff810095ea
      #34 [ffff880336539f50] stub_execve at ffffffff8100b54a
          RIP: 0000003d6f6ab4a7  RSP: 00007fffffffa628  RFLAGS: 00000203
          RAX: 000000000000003b  RBX: 00002aaaaaae05a8  RCX: ffffffffffffffff
          RDX: 0000000000868c08  RSI: 000000000088afc8  RDI: 0000000000869808
          RBP: 000000000088ab08   R8: 0000000000000000   R9: 00007fffffffa598
          R10: 00007fffffffa54c  R11: 0000000000000203  R12: 0000000000869808
          R13: 000000000087c628  R14: 000000000088afc8  R15: 000000000087ce48
          ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b
      

      Evidence of corrupted executable:

      # ansel6 /root > objdump -x <path_to_exe>
      objdump: <path_to_exe>: File truncated
      # ansel6 /root > ls -l <path_to_exe>
      -rwxrwx--- 1 <user> <group> 1048576 Apr  7 10:13 <path_to_exe>
      

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: