This simple C code is enough to reproduce the problem:
$ cat test.c
#include <sys/file.h>
#include <errno.h>
int main() {
int fd;
int r;
printf("-- starting --\n");
fd = open("locktest.txt", O_RDWR);
if(fd < 0)
{
printf("open failed\n");
exit(1);
}
r = flock(fd, LOCK_EX|LOCK_NB);
if(r == -1)
{
printf("Error in flock: %d\n", errno);
exit(1);
}
flock(fd, LOCK_UN);
close(fd);
return 0;
}
Creating 'locktest.txt' on a 2.2 server (while using the 1.8) client + starting the application ~2-3 times causes flock() to fail:
rm -f locktest.txt ; touch locktest.txt ; for x in
{0..5}
; do ./a.out ; sleep 1 ; done
– starting –
-- starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
The 'EAGAIN' error will be gone after a couple of seconds (i suppose that's when the leaked
lock timed out).
Note that exactly the same code works fine on:
- A 1.8.x client talking to 1.8.x servers
- A 2.2 client talking to 2.2 servers
The 1.8 client in my test is running:
$ cat /proc/fs/lustre/version
lustre: 1.8.7.80
kernel: patchless_client
build: ../lustre/scripts--PRISTINE-2.6.18-308.1.1.el5
The 2.2 servers are on:
bash-4.1$ uname -r
2.6.32-220.4.2.el6_lustre.x86_64
bash-4.1$ cat /proc/fs/lustre/version
lustre: 2.2.0
kernel: patchless_client
build: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64
The filesystem is mounted via:
$ grep _xl /etc/fstab
10.201.62.13@o2ib:10.201.62.14@o2ib:/nero /cluster/scratch_xl lustre flock,_netdev 0 0
James, like you correctly mention, that change is only needed for 2.3 interop which is not the case here, so there is no rush to get it included, esp. since there is no official 2.3 release and won't be for some time.
You can get RPMs here: http://build.whamcloud.com/job/lustre-reviews/7975/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/