Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs.
In particular it hangs here:
# Flush slab cache entries echo 2 >/proc/sys/vm/drop_caches
And this is the backtrace for where it is getting stuck:
crash> bt -xs 1386
PID: 1386 TASK: ffff88201b0a5080 CPU: 10 COMMAND: "prolog"
#0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975
#1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049
#2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre]
#3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre]
#4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre]
#5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077
#6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e
#7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113
#8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343
#9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25
#10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743
#11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03
#12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34
#13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd
#14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f
#15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709
RIP: 00007ffff76d3500 RSP: 00007fffffffe180 RFLAGS: 00010206
RAX: 0000000000000001 RBX: ffffffff8165d709 RCX: 0000000000000400
RDX: 0000000000000002 RSI: 00007ffff7ff8000 RDI: 0000000000000001
RBP: 00007ffff7ff8000 R8: 000000000000000a R9: 00007ffff7fbd740
R10: 00007fffffffe670 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000002 R14: 00007ffff79a7400 R15: 0000000000000002
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b