Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1299

running truncated executable causes spewing of lock debug messages

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.4
    • Lustre 2.1.1
    • None
    • 3
    • 4599

    Description

      A user was running an executable that appeared to be truncated. This caused the node to OOM and the Lustre client subsequently started spewing lock debugging messages:

      2012-04-10 09:32:58 LustreError: 26191:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -4
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lock@ffff8805b8374c18[3 5 -4 1 0 00000006] P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     vvp@ffff88058a495790: 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lov@ffff8803e7fd0e98: 2
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     0 0: lock@ffff880227a9ce78[1 5 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1003d0000:0x91707b:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lovsub@ffff8801f0d30c60: [0 ffff8803e7fd03a0 R(1):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     osc@ffff8801e0ae7648: ffff88011fe84900 00101001 0x40e1ce45be5b325f 4 (null) size: 1048576 mtime: 1333818831 atime: 1334070919 ctime: 1333818831 blocks: 2048
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9ce78
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     1 0: lock@ffff880227a9cd98[2 3 0 2 2 00000000] R(1):[0, 18446744073709551615]@[0x101d10000:0x93426:0x0] {
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     lovsub@ffff8801f0d30c20: [0 ffff8803e7fd00c8 R(1):[256, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())     osc@ffff8801e0ae7590: ffff88011fe84240 00101001 0x40e1ce45be5b3266 3 ffff8803364e9e48 size: 0 mtime: 1333818824 atime: 1333818666 ctime: 1333818824 blocks: 0
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9cd98
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 
      2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff8805b8374c18
      

      These messages appear many times a second. The processes seem to be spinning in the page fault handler. If you catch it while it's not active the backtrace looks like this:

      PID: 22135  TASK: ffff880222c0e080  CPU: 1   COMMAND: "test"
       #0 [ffff880336539158] schedule at ffffffff814eeee0
       #1 [ffff880336539220] __mutex_lock_slowpath at ffffffff814f057e
       #2 [ffff880336539290] mutex_lock at ffffffff814f041b
       #3 [ffff8803365392b0] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
       #4 [ffff8803365392f0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
       #5 [ffff880336539330] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
       #6 [ffff880336539380] lovsub_lock_closure at ffffffffa095c19e [lov]
       #7 [ffff8803365393c0] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
       #8 [ffff880336539410] lov_sublock_lock at ffffffffa095485b [lov]
       #9 [ffff880336539470] lov_lock_enqueue at ffffffffa0955f02 [lov]
      #10 [ffff880336539500] cl_enqueue_try at ffffffffa05d66cc [obdclass]
      #11 [ffff880336539550] cl_enqueue_locked at ffffffffa05d7ced [obdclass]
      #12 [ffff880336539590] cl_lock_request at ffffffffa05d7fae [obdclass]
      #13 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
      #14 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
      #15 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
      #16 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
      #17 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
      #18 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
      #19 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
      #20 [ffff880336539860] __do_fault at ffffffff8113bad4
      #21 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
      #22 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
      #23 [ffff880336539a20] __do_page_fault at ffffffff81042b79
      #24 [ffff880336539b40] do_page_fault at ffffffff814f495e
      #25 [ffff880336539b70] page_fault at ffffffff814f1d15
          [exception RIP: __clear_user+63]
          RIP: ffffffff812791df  RSP: ffff880336539c28  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000df44a8  RCX: 000000000000016b
          RDX: 0000000000000000  RSI: 0000000000000b58  RDI: 0000000000df44a8
          RBP: ffff880336539c38   R8: 0000000000000000   R9: 0000000000000008
          R10: ffff8801a44c7318  R11: 0000000000000000  R12: 0000000000000b58
          R13: 0000000000000000  R14: ffff880336539f58  R15: 0000000000df44a8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
      #26 [ffff880336539c40] clear_user at ffffffff81279248
      #27 [ffff880336539c60] padzero at ffffffff811cc93d
      #28 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
      #29 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
      #30 [ffff880336539eb0] do_execve at ffffffff81180779
      #31 [ffff880336539f20] sys_execve at ffffffff810095ea
      #32 [ffff880336539f50] stub_execve at ffffffff8100b54a
          RIP: 0000003d6f6ab4a7  RSP: 00007fffffffa628  RFLAGS: 00000203
          RAX: 000000000000003b  RBX: 00002aaaaaae05a8  RCX: ffffffffffffffff
          RDX: 0000000000868c08  RSI: 000000000088afc8  RDI: 0000000000869808
          RBP: 000000000088ab08   R8: 0000000000000000   R9: 00007fffffffa598
          R10: 00007fffffffa54c  R11: 0000000000000203  R12: 0000000000869808
          R13: 000000000087c628  R14: 000000000088afc8  R15: 000000000087ce48
          ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b
      

      or this:

      PID: 22135  TASK: ffff880222c0e080  CPU: 8   COMMAND: "test"
       #0 [ffff880336539118] schedule at ffffffff814eeee0
       #1 [ffff8803365391e0] __mutex_lock_slowpath at ffffffff814f057e
       #2 [ffff880336539250] mutex_lock at ffffffff814f041b
       #3 [ffff880336539270] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
       #4 [ffff8803365392b0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
       #5 [ffff8803365392f0] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
       #6 [ffff880336539340] lovsub_lock_closure at ffffffffa095c19e [lov]
       #7 [ffff880336539380] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
       #8 [ffff8803365393d0] lov_sublock_lock at ffffffffa095485b [lov]
       #9 [ffff880336539430] lov_lock_delete at ffffffffa0954b43 [lov]
      #10 [ffff880336539490] cl_lock_delete0 at ffffffffa05d4425 [obdclass]
      #11 [ffff8803365394c0] cl_lock_delete at ffffffffa05d4713 [obdclass]
      #12 [ffff8803365394e0] cl_lock_finish at ffffffffa05d68c7 [obdclass]
      #13 [ffff880336539500] cl_lock_hold_mutex at ffffffffa05d6cd2 [obdclass]
      #14 [ffff880336539590] cl_lock_request at ffffffffa05d7f8e [obdclass]
      #15 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
      #16 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
      #17 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
      #18 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
      #19 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
      #20 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
      #21 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
      #22 [ffff880336539860] __do_fault at ffffffff8113bad4
      #23 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
      #24 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
      #25 [ffff880336539a20] __do_page_fault at ffffffff81042b79
      #26 [ffff880336539b40] do_page_fault at ffffffff814f495e
      #27 [ffff880336539b70] page_fault at ffffffff814f1d15
          [exception RIP: __clear_user+63]
          RIP: ffffffff812791df  RSP: ffff880336539c28  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: 0000000000df44a8  RCX: 000000000000016b
          RDX: 0000000000000000  RSI: 0000000000000b58  RDI: 0000000000df44a8
          RBP: ffff880336539c38   R8: 0000000000000000   R9: 0000000000000008
          R10: ffff8801a44c7318  R11: 0000000000000000  R12: 0000000000000b58
          R13: 0000000000000000  R14: ffff880336539f58  R15: 0000000000df44a8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
      #28 [ffff880336539c40] clear_user at ffffffff81279248
      #29 [ffff880336539c60] padzero at ffffffff811cc93d
      #30 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
      #31 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
      #32 [ffff880336539eb0] do_execve at ffffffff81180779
      #33 [ffff880336539f20] sys_execve at ffffffff810095ea
      #34 [ffff880336539f50] stub_execve at ffffffff8100b54a
          RIP: 0000003d6f6ab4a7  RSP: 00007fffffffa628  RFLAGS: 00000203
          RAX: 000000000000003b  RBX: 00002aaaaaae05a8  RCX: ffffffffffffffff
          RDX: 0000000000868c08  RSI: 000000000088afc8  RDI: 0000000000869808
          RBP: 000000000088ab08   R8: 0000000000000000   R9: 00007fffffffa598
          R10: 00007fffffffa54c  R11: 0000000000000203  R12: 0000000000869808
          R13: 000000000087c628  R14: 000000000088afc8  R15: 000000000087ce48
          ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b
      

      Evidence of corrupted executable:

      # ansel6 /root > objdump -x <path_to_exe>
      objdump: <path_to_exe>: File truncated
      # ansel6 /root > ls -l <path_to_exe>
      -rwxrwx--- 1 <user> <group> 1048576 Apr  7 10:13 <path_to_exe>
      

      Attachments

        Activity

          [LU-1299] running truncated executable causes spewing of lock debug messages

          b2_1 patch was landed for 2.1.4, master patch was landed for 2.3.0.

          adilger Andreas Dilger added a comment - b2_1 patch was landed for 2.1.4, master patch was landed for 2.3.0.

          Patch for 2.1 still in review. Reopening.

          nedbass Ned Bass (Inactive) added a comment - Patch for 2.1 still in review. Reopening.

          Please reopen or create new tickets as appropriate if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Please reopen or create new tickets as appropriate if more work is needed.

          There seems to be significant churn in the b2_1 patch at http://review.whamcloud.com/2574 . Since the master patch has already been merged, please consider creating new jira issues and review requests for ongoing 2.1 development. This would help us follow your work more easily. Thanks.

          nedbass Ned Bass (Inactive) added a comment - There seems to be significant churn in the b2_1 patch at http://review.whamcloud.com/2574 . Since the master patch has already been merged, please consider creating new jira issues and review requests for ongoing 2.1 development. This would help us follow your work more easily. Thanks.

          Hi Hongchao, it looks like the problem you mentioned is imported by commit 60b5c3e464d6b4b333506e6db6b0635bb5a06577 (job stats). Obviously it was deadlocked at mmap_sem because do_mmap_pgoff() has already held write mode lock, so I don't think this is the same problem described by Ned. Can you please file a new ticket and ask Niu to take a look?

          jay Jinshan Xiong (Inactive) added a comment - Hi Hongchao, it looks like the problem you mentioned is imported by commit 60b5c3e464d6b4b333506e6db6b0635bb5a06577 (job stats). Obviously it was deadlocked at mmap_sem because do_mmap_pgoff() has already held write mode lock, so I don't think this is the same problem described by Ned. Can you please file a new ticket and ask Niu to take a look?

          Hi Jay,
          the following two URLs are also suspected to deadlock for mmap_sem, could you please help to check? Thanks!

          https://maloo.whamcloud.com/test_sets/d52af4d8-b67f-11e1-9e3f-52540035b04c
          https://maloo.whamcloud.com/test_logs/9773fdb4-b680-11e1-9e3f-52540035b04c

          hongchao.zhang Hongchao Zhang added a comment - Hi Jay, the following two URLs are also suspected to deadlock for mmap_sem, could you please help to check? Thanks! https://maloo.whamcloud.com/test_sets/d52af4d8-b67f-11e1-9e3f-52540035b04c https://maloo.whamcloud.com/test_logs/9773fdb4-b680-11e1-9e3f-52540035b04c

          clio has limited access to mmap_sem so I can't figure out any cases causing deadlock this way. However, kernel acquires mmap_sem to enter into fault routine so I'll take a look at kernel code to see if it's possible to leave mmap_sem unlocked and then finally cause deadlock to itself.

          jay Jinshan Xiong (Inactive) added a comment - clio has limited access to mmap_sem so I can't figure out any cases causing deadlock this way. However, kernel acquires mmap_sem to enter into fault routine so I'll take a look at kernel code to see if it's possible to leave mmap_sem unlocked and then finally cause deadlock to itself.

          People

            jay Jinshan Xiong (Inactive)
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: