[LU-1673] Locking issue with 1.8.x clients talking to 2.2 Servers Created: 25/Jul/12  Updated: 19/Nov/12  Resolved: 19/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: ETHz Support (Inactive) Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: client, locking, server
Environment:

client 1.8.x server 2.2


Severity: 2
Epic: client, locking, server
Rank (Obsolete): 4007

 Description   

We noticed that clients running Lustre 1.8.x seem to have trouble locking files hosted on 2.2 Servers.



 Comments   
Comment by ETHz Support (Inactive) [ 25/Jul/12 ]

This simple C code is enough to reproduce the problem:

$ cat test.c
#include <sys/file.h>
#include <errno.h>

int main() {
int fd;
int r;

printf("-- starting --\n");

fd = open("locktest.txt", O_RDWR);
if(fd < 0)

{ printf("open failed\n"); exit(1); }

r = flock(fd, LOCK_EX|LOCK_NB);
if(r == -1)

{ printf("Error in flock: %d\n", errno); exit(1); }

flock(fd, LOCK_UN);
close(fd);
return 0;
}

Creating 'locktest.txt' on a 2.2 server (while using the 1.8) client + starting the application ~2-3 times causes flock() to fail:

rm -f locktest.txt ; touch locktest.txt ; for x in

{0..5}

; do ./a.out ; sleep 1 ; done
– starting –
-- starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11

The 'EAGAIN' error will be gone after a couple of seconds (i suppose that's when the leaked lock timed out).

Note that exactly the same code works fine on:

  • A 1.8.x client talking to 1.8.x servers
  • A 2.2 client talking to 2.2 servers

The 1.8 client in my test is running:
$ cat /proc/fs/lustre/version
lustre: 1.8.7.80
kernel: patchless_client
build: ../lustre/scripts--PRISTINE-2.6.18-308.1.1.el5

The 2.2 servers are on:

bash-4.1$ uname -r
2.6.32-220.4.2.el6_lustre.x86_64

bash-4.1$ cat /proc/fs/lustre/version
lustre: 2.2.0
kernel: patchless_client
build: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

The filesystem is mounted via:
$ grep _xl /etc/fstab
10.201.62.13@o2ib:10.201.62.14@o2ib:/nero /cluster/scratch_xl lustre flock,_netdev 0 0

Comment by Peter Jones [ 25/Jul/12 ]

Oleg is look into this one

Comment by Oleg Drokin [ 25/Jul/12 ]

Hm, I was under impression that the fix for this landed in time for 2.2, but alas.

The patch that fixes this could be found here: http://review.whamcloud.com/#change,2193

Comment by ETHz Support (Inactive) [ 26/Jul/12 ]

Would we have to patch only the MDS or also all OSTs?
Upgrading all OSTs requires quite some downtime :-/

Could you provide us the rpms patched for 2.2 servers?

Thanks in advance.

Comment by Oleg Drokin [ 27/Jul/12 ]

flocks are only taken on MDS, so updating just MDS is fine.

Comment by ETHz Support (Inactive) [ 27/Jul/12 ]

Could you provide us patched rpms?

Thanks in advance

Comment by Peter Jones [ 27/Jul/12 ]

Yangsheng is working on creating patched RPMs

Comment by James A Simmons [ 27/Jul/12 ]

Don't forget the patch for http://review.whamcloud.com/#change,3008 as well since it is need to 2.3 <-> 2.2 interop testing,

Comment by Peter Jones [ 27/Jul/12 ]

http://review.whamcloud.com/#change,3486

Comment by Oleg Drokin [ 28/Jul/12 ]

James, like you correctly mention, that change is only needed for 2.3 interop which is not the case here, so there is no rush to get it included, esp. since there is no official 2.3 release and won't be for some time.

You can get RPMs here: http://build.whamcloud.com/job/lustre-reviews/7975/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/

Comment by Yang Sheng [ 19/Nov/12 ]

As Lurii commented in gerrit:

Iurii Golovach		Nov 2

Patch Set 1: I would prefer that you didn't submit this

This patch looks obsolete since there are already number of patches which cover this issue:

http://review.whamcloud.com/#change,3722 http://review.whamcloud.com/#change,3202 http://review.whamcloud.com/#change,3203 http://review.whamcloud.com/#change,3725 http://review.whamcloud.com/#change,3727

So close this bug.

Generated at Sat Feb 10 01:18:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.