Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.1
-
None
-
3
-
4599
Description
A user was running an executable that appeared to be truncated. This caused the node to OOM and the Lustre client subsequently started spewing lock debugging messages:
2012-04-10 09:32:58 LustreError: 26191:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -4 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lock@ffff8805b8374c18[3 5 -4 1 0 00000006] P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) vvp@ffff88058a495790: 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lov@ffff8803e7fd0e98: 2 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 0 0: lock@ffff880227a9ce78[1 5 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1003d0000:0x91707b:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c60: [0 ffff8803e7fd03a0 R(1):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7648: ffff88011fe84900 00101001 0x40e1ce45be5b325f 4 (null) size: 1048576 mtime: 1333818831 atime: 1334070919 ctime: 1333818831 blocks: 2048 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9ce78 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 1 0: lock@ffff880227a9cd98[2 3 0 2 2 00000000] R(1):[0, 18446744073709551615]@[0x101d10000:0x93426:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c20: [0 ffff8803e7fd00c8 R(1):[256, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7590: ffff88011fe84240 00101001 0x40e1ce45be5b3266 3 ffff8803364e9e48 size: 0 mtime: 1333818824 atime: 1333818666 ctime: 1333818824 blocks: 0 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9cd98 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff8805b8374c18
These messages appear many times a second. The processes seem to be spinning in the page fault handler. If you catch it while it's not active the backtrace looks like this:
PID: 22135 TASK: ffff880222c0e080 CPU: 1 COMMAND: "test" #0 [ffff880336539158] schedule at ffffffff814eeee0 #1 [ffff880336539220] __mutex_lock_slowpath at ffffffff814f057e #2 [ffff880336539290] mutex_lock at ffffffff814f041b #3 [ffff8803365392b0] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass] #4 [ffff8803365392f0] cl_lock_enclosure at ffffffffa05d789f [obdclass] #5 [ffff880336539330] cl_lock_closure_build at ffffffffa05d7963 [obdclass] #6 [ffff880336539380] lovsub_lock_closure at ffffffffa095c19e [lov] #7 [ffff8803365393c0] cl_lock_closure_build at ffffffffa05d79a0 [obdclass] #8 [ffff880336539410] lov_sublock_lock at ffffffffa095485b [lov] #9 [ffff880336539470] lov_lock_enqueue at ffffffffa0955f02 [lov] #10 [ffff880336539500] cl_enqueue_try at ffffffffa05d66cc [obdclass] #11 [ffff880336539550] cl_enqueue_locked at ffffffffa05d7ced [obdclass] #12 [ffff880336539590] cl_lock_request at ffffffffa05d7fae [obdclass] #13 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre] #14 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre] #15 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre] #16 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass] #17 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass] #18 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre] #19 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre] #20 [ffff880336539860] __do_fault at ffffffff8113bad4 #21 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087 #22 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4 #23 [ffff880336539a20] __do_page_fault at ffffffff81042b79 #24 [ffff880336539b40] do_page_fault at ffffffff814f495e #25 [ffff880336539b70] page_fault at ffffffff814f1d15 [exception RIP: __clear_user+63] RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8 RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008 R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58 R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #26 [ffff880336539c40] clear_user at ffffffff81279248 #27 [ffff880336539c60] padzero at ffffffff811cc93d #28 [ffff880336539c70] load_elf_binary at ffffffff811ce96e #29 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb #30 [ffff880336539eb0] do_execve at ffffffff81180779 #31 [ffff880336539f20] sys_execve at ffffffff810095ea #32 [ffff880336539f50] stub_execve at ffffffff8100b54a RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203 RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808 RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598 R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808 R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48 ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
or this:
PID: 22135 TASK: ffff880222c0e080 CPU: 8 COMMAND: "test" #0 [ffff880336539118] schedule at ffffffff814eeee0 #1 [ffff8803365391e0] __mutex_lock_slowpath at ffffffff814f057e #2 [ffff880336539250] mutex_lock at ffffffff814f041b #3 [ffff880336539270] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass] #4 [ffff8803365392b0] cl_lock_enclosure at ffffffffa05d789f [obdclass] #5 [ffff8803365392f0] cl_lock_closure_build at ffffffffa05d7963 [obdclass] #6 [ffff880336539340] lovsub_lock_closure at ffffffffa095c19e [lov] #7 [ffff880336539380] cl_lock_closure_build at ffffffffa05d79a0 [obdclass] #8 [ffff8803365393d0] lov_sublock_lock at ffffffffa095485b [lov] #9 [ffff880336539430] lov_lock_delete at ffffffffa0954b43 [lov] #10 [ffff880336539490] cl_lock_delete0 at ffffffffa05d4425 [obdclass] #11 [ffff8803365394c0] cl_lock_delete at ffffffffa05d4713 [obdclass] #12 [ffff8803365394e0] cl_lock_finish at ffffffffa05d68c7 [obdclass] #13 [ffff880336539500] cl_lock_hold_mutex at ffffffffa05d6cd2 [obdclass] #14 [ffff880336539590] cl_lock_request at ffffffffa05d7f8e [obdclass] #15 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre] #16 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre] #17 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre] #18 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass] #19 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass] #20 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre] #21 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre] #22 [ffff880336539860] __do_fault at ffffffff8113bad4 #23 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087 #24 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4 #25 [ffff880336539a20] __do_page_fault at ffffffff81042b79 #26 [ffff880336539b40] do_page_fault at ffffffff814f495e #27 [ffff880336539b70] page_fault at ffffffff814f1d15 [exception RIP: __clear_user+63] RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8 RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008 R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58 R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #28 [ffff880336539c40] clear_user at ffffffff81279248 #29 [ffff880336539c60] padzero at ffffffff811cc93d #30 [ffff880336539c70] load_elf_binary at ffffffff811ce96e #31 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb #32 [ffff880336539eb0] do_execve at ffffffff81180779 #33 [ffff880336539f20] sys_execve at ffffffff810095ea #34 [ffff880336539f50] stub_execve at ffffffff8100b54a RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203 RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808 RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598 R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808 R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48 ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
Evidence of corrupted executable:
# ansel6 /root > objdump -x <path_to_exe> objdump: <path_to_exe>: File truncated # ansel6 /root > ls -l <path_to_exe> -rwxrwx--- 1 <user> <group> 1048576 Apr 7 10:13 <path_to_exe>