Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.1
-
None
-
3
-
4599
Description
A user was running an executable that appeared to be truncated. This caused the node to OOM and the Lustre client subsequently started spewing lock debugging messages:
2012-04-10 09:32:58 LustreError: 26191:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -4
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lock@ffff8805b8374c18[3 5 -4 1 0 00000006] P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0] {
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) vvp@ffff88058a495790:
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lov@ffff8803e7fd0e98: 2
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 0 0: lock@ffff880227a9ce78[1 5 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1003d0000:0x91707b:0x0] {
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c60: [0 ffff8803e7fd03a0 R(1):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]]
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7648: ffff88011fe84900 00101001 0x40e1ce45be5b325f 4 (null) size: 1048576 mtime: 1333818831 atime: 1334070919 ctime: 1333818831 blocks: 2048
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9ce78
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 1 0: lock@ffff880227a9cd98[2 3 0 2 2 00000000] R(1):[0, 18446744073709551615]@[0x101d10000:0x93426:0x0] {
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c20: [0 ffff8803e7fd00c8 R(1):[256, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]]
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7590: ffff88011fe84240 00101001 0x40e1ce45be5b3266 3 ffff8803364e9e48 size: 0 mtime: 1333818824 atime: 1333818666 ctime: 1333818824 blocks: 0
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9cd98
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked())
2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff8805b8374c18
These messages appear many times a second. The processes seem to be spinning in the page fault handler. If you catch it while it's not active the backtrace looks like this:
PID: 22135 TASK: ffff880222c0e080 CPU: 1 COMMAND: "test"
#0 [ffff880336539158] schedule at ffffffff814eeee0
#1 [ffff880336539220] __mutex_lock_slowpath at ffffffff814f057e
#2 [ffff880336539290] mutex_lock at ffffffff814f041b
#3 [ffff8803365392b0] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
#4 [ffff8803365392f0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
#5 [ffff880336539330] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
#6 [ffff880336539380] lovsub_lock_closure at ffffffffa095c19e [lov]
#7 [ffff8803365393c0] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
#8 [ffff880336539410] lov_sublock_lock at ffffffffa095485b [lov]
#9 [ffff880336539470] lov_lock_enqueue at ffffffffa0955f02 [lov]
#10 [ffff880336539500] cl_enqueue_try at ffffffffa05d66cc [obdclass]
#11 [ffff880336539550] cl_enqueue_locked at ffffffffa05d7ced [obdclass]
#12 [ffff880336539590] cl_lock_request at ffffffffa05d7fae [obdclass]
#13 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
#14 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
#15 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
#16 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
#17 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
#18 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
#19 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
#20 [ffff880336539860] __do_fault at ffffffff8113bad4
#21 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
#22 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
#23 [ffff880336539a20] __do_page_fault at ffffffff81042b79
#24 [ffff880336539b40] do_page_fault at ffffffff814f495e
#25 [ffff880336539b70] page_fault at ffffffff814f1d15
[exception RIP: __clear_user+63]
RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b
RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8
RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008
R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58
R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#26 [ffff880336539c40] clear_user at ffffffff81279248
#27 [ffff880336539c60] padzero at ffffffff811cc93d
#28 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
#29 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
#30 [ffff880336539eb0] do_execve at ffffffff81180779
#31 [ffff880336539f20] sys_execve at ffffffff810095ea
#32 [ffff880336539f50] stub_execve at ffffffff8100b54a
RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203
RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff
RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808
RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598
R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808
R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48
ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
or this:
PID: 22135 TASK: ffff880222c0e080 CPU: 8 COMMAND: "test"
#0 [ffff880336539118] schedule at ffffffff814eeee0
#1 [ffff8803365391e0] __mutex_lock_slowpath at ffffffff814f057e
#2 [ffff880336539250] mutex_lock at ffffffff814f041b
#3 [ffff880336539270] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass]
#4 [ffff8803365392b0] cl_lock_enclosure at ffffffffa05d789f [obdclass]
#5 [ffff8803365392f0] cl_lock_closure_build at ffffffffa05d7963 [obdclass]
#6 [ffff880336539340] lovsub_lock_closure at ffffffffa095c19e [lov]
#7 [ffff880336539380] cl_lock_closure_build at ffffffffa05d79a0 [obdclass]
#8 [ffff8803365393d0] lov_sublock_lock at ffffffffa095485b [lov]
#9 [ffff880336539430] lov_lock_delete at ffffffffa0954b43 [lov]
#10 [ffff880336539490] cl_lock_delete0 at ffffffffa05d4425 [obdclass]
#11 [ffff8803365394c0] cl_lock_delete at ffffffffa05d4713 [obdclass]
#12 [ffff8803365394e0] cl_lock_finish at ffffffffa05d68c7 [obdclass]
#13 [ffff880336539500] cl_lock_hold_mutex at ffffffffa05d6cd2 [obdclass]
#14 [ffff880336539590] cl_lock_request at ffffffffa05d7f8e [obdclass]
#15 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre]
#16 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre]
#17 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre]
#18 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass]
#19 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass]
#20 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre]
#21 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre]
#22 [ffff880336539860] __do_fault at ffffffff8113bad4
#23 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087
#24 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4
#25 [ffff880336539a20] __do_page_fault at ffffffff81042b79
#26 [ffff880336539b40] do_page_fault at ffffffff814f495e
#27 [ffff880336539b70] page_fault at ffffffff814f1d15
[exception RIP: __clear_user+63]
RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b
RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8
RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008
R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58
R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#28 [ffff880336539c40] clear_user at ffffffff81279248
#29 [ffff880336539c60] padzero at ffffffff811cc93d
#30 [ffff880336539c70] load_elf_binary at ffffffff811ce96e
#31 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb
#32 [ffff880336539eb0] do_execve at ffffffff81180779
#33 [ffff880336539f20] sys_execve at ffffffff810095ea
#34 [ffff880336539f50] stub_execve at ffffffff8100b54a
RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203
RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff
RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808
RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598
R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808
R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48
ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
Evidence of corrupted executable:
# ansel6 /root > objdump -x <path_to_exe> objdump: <path_to_exe>: File truncated # ansel6 /root > ls -l <path_to_exe> -rwxrwx--- 1 <user> <group> 1048576 Apr 7 10:13 <path_to_exe>