Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.1
-
None
-
3
-
4599
Description
A user was running an executable that appeared to be truncated. This caused the node to OOM and the Lustre client subsequently started spewing lock debugging messages:
2012-04-10 09:32:58 LustreError: 26191:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -4 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lock@ffff8805b8374c18[3 5 -4 1 0 00000006] P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) vvp@ffff88058a495790: 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lov@ffff8803e7fd0e98: 2 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 0 0: lock@ffff880227a9ce78[1 5 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1003d0000:0x91707b:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c60: [0 ffff8803e7fd03a0 R(1):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [0 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7648: ffff88011fe84900 00101001 0x40e1ce45be5b325f 4 (null) size: 1048576 mtime: 1333818831 atime: 1334070919 ctime: 1333818831 blocks: 2048 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9ce78 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 1 0: lock@ffff880227a9cd98[2 3 0 2 2 00000000] R(1):[0, 18446744073709551615]@[0x101d10000:0x93426:0x0] { 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) lovsub@ffff8801f0d30c20: [0 ffff8803e7fd00c8 R(1):[256, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e7fd0e98 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] [1 ffff8803e14c1f68 P(0):[0, 18446744073709551615]@[0x22c36cd0b:0x1:0x0]] 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) osc@ffff8801e0ae7590: ffff88011fe84240 00101001 0x40e1ce45be5b3266 3 ffff8803364e9e48 size: 0 mtime: 1333818824 atime: 1333818666 ctime: 1333818824 blocks: 0 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff880227a9cd98 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) 2012-04-10 09:32:58 LustreError: 24081:0:(cl_lock.c:1426:cl_unuse_locked()) } lock@ffff8805b8374c18
These messages appear many times a second. The processes seem to be spinning in the page fault handler. If you catch it while it's not active the backtrace looks like this:
PID: 22135 TASK: ffff880222c0e080 CPU: 1 COMMAND: "test" #0 [ffff880336539158] schedule at ffffffff814eeee0 #1 [ffff880336539220] __mutex_lock_slowpath at ffffffff814f057e #2 [ffff880336539290] mutex_lock at ffffffff814f041b #3 [ffff8803365392b0] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass] #4 [ffff8803365392f0] cl_lock_enclosure at ffffffffa05d789f [obdclass] #5 [ffff880336539330] cl_lock_closure_build at ffffffffa05d7963 [obdclass] #6 [ffff880336539380] lovsub_lock_closure at ffffffffa095c19e [lov] #7 [ffff8803365393c0] cl_lock_closure_build at ffffffffa05d79a0 [obdclass] #8 [ffff880336539410] lov_sublock_lock at ffffffffa095485b [lov] #9 [ffff880336539470] lov_lock_enqueue at ffffffffa0955f02 [lov] #10 [ffff880336539500] cl_enqueue_try at ffffffffa05d66cc [obdclass] #11 [ffff880336539550] cl_enqueue_locked at ffffffffa05d7ced [obdclass] #12 [ffff880336539590] cl_lock_request at ffffffffa05d7fae [obdclass] #13 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre] #14 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre] #15 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre] #16 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass] #17 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass] #18 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre] #19 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre] #20 [ffff880336539860] __do_fault at ffffffff8113bad4 #21 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087 #22 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4 #23 [ffff880336539a20] __do_page_fault at ffffffff81042b79 #24 [ffff880336539b40] do_page_fault at ffffffff814f495e #25 [ffff880336539b70] page_fault at ffffffff814f1d15 [exception RIP: __clear_user+63] RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8 RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008 R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58 R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #26 [ffff880336539c40] clear_user at ffffffff81279248 #27 [ffff880336539c60] padzero at ffffffff811cc93d #28 [ffff880336539c70] load_elf_binary at ffffffff811ce96e #29 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb #30 [ffff880336539eb0] do_execve at ffffffff81180779 #31 [ffff880336539f20] sys_execve at ffffffff810095ea #32 [ffff880336539f50] stub_execve at ffffffff8100b54a RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203 RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808 RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598 R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808 R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48 ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
or this:
PID: 22135 TASK: ffff880222c0e080 CPU: 8 COMMAND: "test" #0 [ffff880336539118] schedule at ffffffff814eeee0 #1 [ffff8803365391e0] __mutex_lock_slowpath at ffffffff814f057e #2 [ffff880336539250] mutex_lock at ffffffff814f041b #3 [ffff880336539270] cl_lock_mutex_get at ffffffffa05d4d45 [obdclass] #4 [ffff8803365392b0] cl_lock_enclosure at ffffffffa05d789f [obdclass] #5 [ffff8803365392f0] cl_lock_closure_build at ffffffffa05d7963 [obdclass] #6 [ffff880336539340] lovsub_lock_closure at ffffffffa095c19e [lov] #7 [ffff880336539380] cl_lock_closure_build at ffffffffa05d79a0 [obdclass] #8 [ffff8803365393d0] lov_sublock_lock at ffffffffa095485b [lov] #9 [ffff880336539430] lov_lock_delete at ffffffffa0954b43 [lov] #10 [ffff880336539490] cl_lock_delete0 at ffffffffa05d4425 [obdclass] #11 [ffff8803365394c0] cl_lock_delete at ffffffffa05d4713 [obdclass] #12 [ffff8803365394e0] cl_lock_finish at ffffffffa05d68c7 [obdclass] #13 [ffff880336539500] cl_lock_hold_mutex at ffffffffa05d6cd2 [obdclass] #14 [ffff880336539590] cl_lock_request at ffffffffa05d7f8e [obdclass] #15 [ffff8803365395f0] cl_glimpse_lock at ffffffffa0a0e5fb [lustre] #16 [ffff880336539650] ccc_prep_size at ffffffffa0a11b8a [lustre] #17 [ffff8803365396b0] vvp_io_fault_start at ffffffffa0a165db [lustre] #18 [ffff880336539720] cl_io_start at ffffffffa05d8a5a [obdclass] #19 [ffff880336539760] cl_io_loop at ffffffffa05dc094 [obdclass] #20 [ffff8803365397b0] ll_fault0 at ffffffffa0a00164 [lustre] #21 [ffff880336539810] ll_fault at ffffffffa0a00428 [lustre] #22 [ffff880336539860] __do_fault at ffffffff8113bad4 #23 [ffff8803365398f0] handle_pte_fault at ffffffff8113c087 #24 [ffff8803365399d0] handle_mm_fault at ffffffff8113ccc4 #25 [ffff880336539a20] __do_page_fault at ffffffff81042b79 #26 [ffff880336539b40] do_page_fault at ffffffff814f495e #27 [ffff880336539b70] page_fault at ffffffff814f1d15 [exception RIP: __clear_user+63] RIP: ffffffff812791df RSP: ffff880336539c28 RFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000df44a8 RCX: 000000000000016b RDX: 0000000000000000 RSI: 0000000000000b58 RDI: 0000000000df44a8 RBP: ffff880336539c38 R8: 0000000000000000 R9: 0000000000000008 R10: ffff8801a44c7318 R11: 0000000000000000 R12: 0000000000000b58 R13: 0000000000000000 R14: ffff880336539f58 R15: 0000000000df44a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #28 [ffff880336539c40] clear_user at ffffffff81279248 #29 [ffff880336539c60] padzero at ffffffff811cc93d #30 [ffff880336539c70] load_elf_binary at ffffffff811ce96e #31 [ffff880336539e40] search_binary_handler at ffffffff8117f5eb #32 [ffff880336539eb0] do_execve at ffffffff81180779 #33 [ffff880336539f20] sys_execve at ffffffff810095ea #34 [ffff880336539f50] stub_execve at ffffffff8100b54a RIP: 0000003d6f6ab4a7 RSP: 00007fffffffa628 RFLAGS: 00000203 RAX: 000000000000003b RBX: 00002aaaaaae05a8 RCX: ffffffffffffffff RDX: 0000000000868c08 RSI: 000000000088afc8 RDI: 0000000000869808 RBP: 000000000088ab08 R8: 0000000000000000 R9: 00007fffffffa598 R10: 00007fffffffa54c R11: 0000000000000203 R12: 0000000000869808 R13: 000000000087c628 R14: 000000000088afc8 R15: 000000000087ce48 ORIG_RAX: 000000000000003b CS: 0033 SS: 002b
Evidence of corrupted executable:
# ansel6 /root > objdump -x <path_to_exe> objdump: <path_to_exe>: File truncated # ansel6 /root > ls -l <path_to_exe> -rwxrwx--- 1 <user> <group> 1048576 Apr 7 10:13 <path_to_exe>
Hi Jinshan,
We got this report of deadlocked processes on Lustre clients that do not have your patch applied. The stacktraces make me think it is this bug since they involve clio, page fault, and signals. However we have not seen this exact symptom before where the threads are deadlocked on a semaphore, so I wanted to confirm with you. Thanks