Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1673

Locking issue with 1.8.x clients talking to 2.2 Servers

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.2.0
    • client 1.8.x server 2.2

    Description

      We noticed that clients running Lustre 1.8.x seem to have trouble locking files hosted on 2.2 Servers.

      Attachments

        Activity

          [LU-1673] Locking issue with 1.8.x clients talking to 2.2 Servers
          pjones Peter Jones added a comment -

          Yangsheng is working on creating patched RPMs

          pjones Peter Jones added a comment - Yangsheng is working on creating patched RPMs

          Could you provide us patched rpms?

          Thanks in advance

          ethz.support ETHz Support (Inactive) added a comment - Could you provide us patched rpms? Thanks in advance
          green Oleg Drokin added a comment -

          flocks are only taken on MDS, so updating just MDS is fine.

          green Oleg Drokin added a comment - flocks are only taken on MDS, so updating just MDS is fine.

          Would we have to patch only the MDS or also all OSTs?
          Upgrading all OSTs requires quite some downtime :-/

          Could you provide us the rpms patched for 2.2 servers?

          Thanks in advance.

          ethz.support ETHz Support (Inactive) added a comment - Would we have to patch only the MDS or also all OSTs? Upgrading all OSTs requires quite some downtime :-/ Could you provide us the rpms patched for 2.2 servers? Thanks in advance.
          green Oleg Drokin added a comment -

          Hm, I was under impression that the fix for this landed in time for 2.2, but alas.

          The patch that fixes this could be found here: http://review.whamcloud.com/#change,2193

          green Oleg Drokin added a comment - Hm, I was under impression that the fix for this landed in time for 2.2, but alas. The patch that fixes this could be found here: http://review.whamcloud.com/#change,2193
          pjones Peter Jones added a comment -

          Oleg is look into this one

          pjones Peter Jones added a comment - Oleg is look into this one

          This simple C code is enough to reproduce the problem:

          $ cat test.c
          #include <sys/file.h>
          #include <errno.h>

          int main() {
          int fd;
          int r;

          printf("-- starting --\n");

          fd = open("locktest.txt", O_RDWR);
          if(fd < 0)

          { printf("open failed\n"); exit(1); }

          r = flock(fd, LOCK_EX|LOCK_NB);
          if(r == -1)

          { printf("Error in flock: %d\n", errno); exit(1); }

          flock(fd, LOCK_UN);
          close(fd);
          return 0;
          }

          Creating 'locktest.txt' on a 2.2 server (while using the 1.8) client + starting the application ~2-3 times causes flock() to fail:

          rm -f locktest.txt ; touch locktest.txt ; for x in

          {0..5}

          ; do ./a.out ; sleep 1 ; done
          – starting –
          -- starting –
          Error in flock: 11
          – starting –
          Error in flock: 11
          – starting –
          Error in flock: 11
          – starting –
          Error in flock: 11
          – starting –
          Error in flock: 11

          The 'EAGAIN' error will be gone after a couple of seconds (i suppose that's when the leaked lock timed out).

          Note that exactly the same code works fine on:

          • A 1.8.x client talking to 1.8.x servers
          • A 2.2 client talking to 2.2 servers

          The 1.8 client in my test is running:
          $ cat /proc/fs/lustre/version
          lustre: 1.8.7.80
          kernel: patchless_client
          build: ../lustre/scripts--PRISTINE-2.6.18-308.1.1.el5

          The 2.2 servers are on:

          bash-4.1$ uname -r
          2.6.32-220.4.2.el6_lustre.x86_64

          bash-4.1$ cat /proc/fs/lustre/version
          lustre: 2.2.0
          kernel: patchless_client
          build: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

          The filesystem is mounted via:
          $ grep _xl /etc/fstab
          10.201.62.13@o2ib:10.201.62.14@o2ib:/nero /cluster/scratch_xl lustre flock,_netdev 0 0

          ethz.support ETHz Support (Inactive) added a comment - This simple C code is enough to reproduce the problem: $ cat test.c #include <sys/file.h> #include <errno.h> int main() { int fd; int r; printf("-- starting --\n"); fd = open("locktest.txt", O_RDWR); if(fd < 0) { printf("open failed\n"); exit(1); } r = flock(fd, LOCK_EX|LOCK_NB); if(r == -1) { printf("Error in flock: %d\n", errno); exit(1); } flock(fd, LOCK_UN); close(fd); return 0; } Creating 'locktest.txt' on a 2.2 server (while using the 1.8) client + starting the application ~2-3 times causes flock() to fail: rm -f locktest.txt ; touch locktest.txt ; for x in {0..5} ; do ./a.out ; sleep 1 ; done – starting – -- starting – Error in flock: 11 – starting – Error in flock: 11 – starting – Error in flock: 11 – starting – Error in flock: 11 – starting – Error in flock: 11 The 'EAGAIN' error will be gone after a couple of seconds (i suppose that's when the leaked lock timed out). Note that exactly the same code works fine on: A 1.8.x client talking to 1.8.x servers A 2.2 client talking to 2.2 servers The 1.8 client in my test is running: $ cat /proc/fs/lustre/version lustre: 1.8.7.80 kernel: patchless_client build: ../lustre/scripts--PRISTINE-2.6.18-308.1.1.el5 The 2.2 servers are on: bash-4.1$ uname -r 2.6.32-220.4.2.el6_lustre.x86_64 bash-4.1$ cat /proc/fs/lustre/version lustre: 2.2.0 kernel: patchless_client build: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64 The filesystem is mounted via: $ grep _xl /etc/fstab 10.201.62.13@o2ib:10.201.62.14@o2ib:/nero /cluster/scratch_xl lustre flock,_netdev 0 0

          People

            ys Yang Sheng
            ethz.support ETHz Support (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: