This simple C code is enough to reproduce the problem:
$ cat test.c
#include <sys/file.h>
#include <errno.h>
int main() {
int fd;
int r;
printf("-- starting --\n");
fd = open("locktest.txt", O_RDWR);
if(fd < 0)
{
printf("open failed\n");
exit(1);
}
r = flock(fd, LOCK_EX|LOCK_NB);
if(r == -1)
{
printf("Error in flock: %d\n", errno);
exit(1);
}
flock(fd, LOCK_UN);
close(fd);
return 0;
}
Creating 'locktest.txt' on a 2.2 server (while using the 1.8) client + starting the application ~2-3 times causes flock() to fail:
rm -f locktest.txt ; touch locktest.txt ; for x in
{0..5}
; do ./a.out ; sleep 1 ; done
– starting –
-- starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
– starting –
Error in flock: 11
The 'EAGAIN' error will be gone after a couple of seconds (i suppose that's when the leaked
lock timed out).
Note that exactly the same code works fine on:
- A 1.8.x client talking to 1.8.x servers
- A 2.2 client talking to 2.2 servers
The 1.8 client in my test is running:
$ cat /proc/fs/lustre/version
lustre: 1.8.7.80
kernel: patchless_client
build: ../lustre/scripts--PRISTINE-2.6.18-308.1.1.el5
The 2.2 servers are on:
bash-4.1$ uname -r
2.6.32-220.4.2.el6_lustre.x86_64
bash-4.1$ cat /proc/fs/lustre/version
lustre: 2.2.0
kernel: patchless_client
build: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64
The filesystem is mounted via:
$ grep _xl /etc/fstab
10.201.62.13@o2ib:10.201.62.14@o2ib:/nero /cluster/scratch_xl lustre flock,_netdev 0 0
Yangsheng is working on creating patched RPMs