Details
Description
This process has been waiting in osc_lru_reserve for a very long time:
PID: 22025 TASK: ffff88017aeba480 CPU: 5 COMMAND: "reads"
#0 [ffff88017b2ff838] schedule at ffffffff8145ec7b
#1 [ffff88017b2ff980] osc_lru_reserve at ffffffffa0e16ee5 [osc]
#2 [ffff88017b2ffa00] osc_page_init at ffffffffa0e1710d [osc]
#3 [ffff88017b2ffa40] lov_page_init_raid0 at ffffffffa0ea48b0 [lov]
#4 [ffff88017b2ffaa0] cl_page_alloc at ffffffffa0aae632 [obdclass]
#5 [ffff88017b2ffae0] cl_page_find at ffffffffa0aae91b [obdclass]
#6 [ffff88017b2ffb30] ll_write_begin at ffffffffa0f96f8d [lustre]
#7 [ffff88017b2ffb90] generic_perform_write at ffffffff810f8242
#8 [ffff88017b2ffc10] generic_file_buffered_write at ffffffff810f83a1
#9 [ffff88017b2ffc60] __generic_file_aio_write at ffffffff810fb336
#10 [ffff88017b2ffd10] generic_file_aio_write at ffffffff810fb57c
#11 [ffff88017b2ffd50] vvp_io_write_start at ffffffffa0faae48 [lustre]
#12 [ffff88017b2ffda0] cl_io_start at ffffffffa0ab65f9 [obdclass]
#13 [ffff88017b2ffdd0] cl_io_loop at ffffffffa0aba123 [obdclass]
#14 [ffff88017b2ffe00] ll_file_io_generic at ffffffffa0f46af1 [lustre]
#15 [ffff88017b2ffe70] ll_file_aio_write at ffffffffa0f47037 [lustre]
#16 [ffff88017b2ffec0] ll_file_write at ffffffffa0f47a00 [lustre]
#17 [ffff88017b2fff10] vfs_write at ffffffff8115aeae
#18 [ffff88017b2fff40] sys_write at ffffffff8115b023
#19 [ffff88017b2fff80] system_call_fastpath at ffffffff81468d92
RIP: 00002aaaaad99630 RSP: 00007fffffffc568 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffffffff81468d92 RCX: 00007fffffffc510
RDX: 0000000000010000 RSI: 0000000000603040 RDI: 0000000000000003
RBP: 0000000000010000 R8: 0000000000000000 R9: 0101010101010101
R10: 00007fffffffc3b0 R11: 0000000000000246 R12: 0000000000010000
R13: 0000000000000001 R14: 00000000063b0000 R15: 00000000063c0000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
While testing for LU-4856, the bug described in LU-5123 caused sanity 101a to run with ccc_lru_max = 32 (pages), I have not tried, but it should be possible to reproduce this in master by modifying 101a to set max_dirty_mb to 128k.
This is a pathological condition, but I think it exposed a real bug. Namely, it appears that the wakeup from the sleep in osc_lru_reserve can be incidental - causes by another process that just happens to do something that triggers an osc_lru_shrink, rather than the deliberate and specific process of the conditions which caused the sleep being addressed when it becomes possible to do so.
I have a core and debug log from a system in this state, and will attach the debug log, and paste my notes in a comment.