Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
Running 2.14.0_2_gb280f22 on ubuntu 18.04
Mounting lustre with localflock can seriously hurt lustre and non lustre performance.
We noticed our compute were sometimes hitting very bad performance even when lustre was not in the path.
This is an strace of ls /
08:30:54 execve("/bin/ls", ["ls"], 0x7fff5a2ab698 / 88 vars /) = 0
08:30:54 brk(NULL) = 0x55fa6a4e9000
08:30:54 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
08:30:54 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fae19898000
08:30:54 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/usr/lib/oracle/12.1/client64/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib", {st_mode=S_IFDIR|0755, st_size=2523, ...}) = 0
08:30:57 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
08:30:57 fstat(3, {st_mode=S_IFREG|0644, st_size=146843, ...}) = 0
08:30:57 mmap(NULL, 146843, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fae19874000
08:30:57 close(3) = 0
What we discovered was that a system that was mounting lustre with localflock could mess up the "inode cache" such that stats anywhere were not cached.
In the above strace we see that LD_LIBRARY_PATH was extended in to a relatively slow nfs server.
We discovered that
echo 2 > /proc/sys/vm/drop_caches
temporarily fixed the issue and ended up with a script that run from cron.
#!/bin/bash
export LD_LIBRARY_PATH=$(find /software/lsf-*/10.1/linux3.10-glibc2.17-x86_64/ -maxdepth 1 -type d -name lib )
timeout 2s ls / > /dev/null
if [ $? != 0 ] ; then
cat /proc/slabinfo | logger -t _lustre_before_slabinfo
cat /proc/meminfo | logger -t _lustre_before_meminfo
grep localflock /proc/mounts | wc -l | logger -t _lustre_before_localflock
lctl get_param 'ldlm.namespaces.lus*-OST*/lru_size' | logger -t _lustre_before_lru_size
echo 2 > /proc/sys/vm/drop_caches
# The blankline above here is important as template
logger -t _lustre_after_did_we_do_it "LUSTRE drop cache 2 "
cat /proc/slabinfo | logger -t _lustre_after_slabinfo
cat /proc/meminfo | logger -t _lustre_after_meminfo
lctl get_param 'ldlm.namespaces.lus*-OST*/lru_size' | logger -t _lustre_after_lru_size
fi
We discovered that removing localflock means that this script does not fire.
If this can be reproduced then, the option should either be fixed or a message emitted when localflock is used.