Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16769

localflock can perform really badly

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running 2.14.0_2_gb280f22 on ubuntu 18.04

       

      Mounting lustre with localflock can seriously hurt lustre and non lustre performance.

       

      We noticed our compute were sometimes hitting very bad performance even when lustre was not in the path.

       

      This is an strace of ls /

       

      08:30:54 execve("/bin/ls", ["ls"], 0x7fff5a2ab698 / 88 vars /) = 0
      08:30:54 brk(NULL)                      = 0x55fa6a4e9000
      08:30:54 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
      08:30:54 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fae19898000
      08:30:54 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/tls/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/tls", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/usr/lib/oracle/12.1/client64/lib/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/usr/lib/oracle/12.1/client64/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
      08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:54 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:54 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:55 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:55 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/tls", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:56 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:56 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/haswell", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/avx512_1", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/x86_64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/x86_64", 0x7ffe359be700) = -1 ENOENT (No such file or directory)
      08:30:57 openat(AT_FDCWD, "/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib/libselinux.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
      08:30:57 stat("/software/lsf-cgp3/10.1/linux3.10-glibc2.17-x86_64/lib", {st_mode=S_IFDIR|0755, st_size=2523, ...}) = 0
      08:30:57 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
      08:30:57 fstat(3, {st_mode=S_IFREG|0644, st_size=146843, ...}) = 0
      08:30:57 mmap(NULL, 146843, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fae19874000
      08:30:57 close(3)                       = 0

      What we discovered was that a system that was mounting lustre with localflock could mess up the "inode cache" such that stats anywhere were not cached.

      In the above strace we see that LD_LIBRARY_PATH was extended in to a relatively slow nfs server.

       

      We discovered that

      echo 2 > /proc/sys/vm/drop_caches

      temporarily fixed the issue and ended up with a script that run from cron.

       

      #!/bin/bash
      export LD_LIBRARY_PATH=$(find /software/lsf-*/10.1/linux3.10-glibc2.17-x86_64/   -maxdepth 1 -type d  -name lib )
      timeout 2s ls / > /dev/null
      if [ $? != 0 ] ; then
        cat /proc/slabinfo | logger -t _lustre_before_slabinfo
        cat /proc/meminfo  | logger -t _lustre_before_meminfo
        grep localflock /proc/mounts  | wc -l | logger -t _lustre_before_localflock
        lctl get_param 'ldlm.namespaces.lus*-OST*/lru_size' | logger -t _lustre_before_lru_size
        echo 2 > /proc/sys/vm/drop_caches   
        # The blankline above here is important as template
        logger -t _lustre_after_did_we_do_it  "LUSTRE drop cache 2 "
        cat /proc/slabinfo | logger -t _lustre_after_slabinfo
        cat /proc/meminfo  | logger -t _lustre_after_meminfo
        lctl get_param 'ldlm.namespaces.lus*-OST*/lru_size' | logger -t _lustre_after_lru_size
      fi

      We discovered that removing localflock means that this script does not fire.

       

      If this can be reproduced then, the option should either be fixed or a message emitted when localflock is used.

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            james beal James Beal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: