Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
Running lustre 2.8.0_0.0.llnlpreview.18 on the clients (see the lustre-release-fe-llnl) , we are regularly seeing hangs of the /etc/slurm/prolog script when it triggers drop_caches. This script runs before each job to clear out the cache from any previous jobs.
In particular it hangs here:
# Flush slab cache entries echo 2 >/proc/sys/vm/drop_caches
And this is the backtrace for where it is getting stuck:
crash> bt -xs 1386 PID: 1386 TASK: ffff88201b0a5080 CPU: 10 COMMAND: "prolog" #0 [ffff882011bd3af8] __schedule+0x295 at ffffffff81651975 #1 [ffff882011bd3b60] schedule+0x29 at ffffffff81652049 #2 [ffff882011bd3b70] cl_inode_fini+0x1ac at ffffffffa0c6b3ac [lustre] #3 [ffff882011bd3c10] ll_clear_inode+0x21c at ffffffffa0c377ec [lustre] #4 [ffff882011bd3c38] ll_delete_inode+0x58 at ffffffffa0c39048 [lustre] #5 [ffff882011bd3c60] evict+0xa7 at ffffffff81204077 #6 [ffff882011bd3c88] dispose_list+0x3e at ffffffff8120417e #7 [ffff882011bd3cb0] prune_icache_sb+0x163 at ffffffff81205113 #8 [ffff882011bd3d18] prune_super+0x143 at ffffffff811ea343 #9 [ffff882011bd3d50] shrink_slab+0x175 at ffffffff81183a25 #10 [ffff882011bd3e08] drop_caches_sysctl_handler+0x283 at ffffffff8124a743 #11 [ffff882011bd3e90] proc_sys_call_handler+0xd3 at ffffffff81260f03 #12 [ffff882011bd3ee8] proc_sys_write+0x14 at ffffffff81260f34 #13 [ffff882011bd3ef8] vfs_write+0xbd at ffffffff811e7bfd #14 [ffff882011bd3f38] sys_write+0x7f at ffffffff811e869f #15 [ffff882011bd3f80] system_call_fastpath+0x16 at ffffffff8165d709 RIP: 00007ffff76d3500 RSP: 00007fffffffe180 RFLAGS: 00010206 RAX: 0000000000000001 RBX: ffffffff8165d709 RCX: 0000000000000400 RDX: 0000000000000002 RSI: 00007ffff7ff8000 RDI: 0000000000000001 RBP: 00007ffff7ff8000 R8: 000000000000000a R9: 00007ffff7fbd740 R10: 00007fffffffe670 R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000002 R14: 00007ffff79a7400 R15: 0000000000000002 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b