[LU-11069] ifort lseek returns wrong position on lustre 2.10.3 Created: 01/Jun/18 Updated: 15/May/19 Resolved: 19/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
sles12 lustre 2.10.3 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
On lustre2.10.3 ifort lseek will return the wrong position when using append ">>" operation. This can be reproduced on client: sles12 lustre2.103 <–> server: lustre2.7.3 client: sles12 lustre2.103 <–> server: lustre2.10.3 hello.f write(6,*) "hello world"
end
Working on sles11 and lustre2.7.3 > date > tt > date >> tt > cat tt Wed May 30 13:32:58 PDT 2018 Wed May 30 13:33:04 PDT 2018 > strace ./a.out >> tt ..... ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffffffbf80) = -1 ENOTTY (Inappropriate ioctl for device) fstat(1, {st_mode=S_IFREG|0600, st_size=58, ...}) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffffffbf80) = -1 ENOTTY (Inappropriate ioctl for device) write(1, " hello world\n", 13) = 13 lseek(1, 0, SEEK_CUR) = 71 ftruncate(1, 71) = 0 > cat tt Wed May 30 13:32:58 PDT 2018 Wed May 30 13:33:04 PDT 2018 hello world
Not working on lustre2.10.3. > rm tt
> date > tt
> date >> tt
> cat tt
Wed May 30 13:41:39 PDT 2018
Wed May 30 13:41:44 PDT 2018
> strace ./a.out >> tt
....
fstat(1, {st_mode=S_IFREG|0600, st_size=58, ...}) = 0
ioctl(1, TCGETS, 0x7fffffffbf80) = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " hello world\n", 13) = 13
lseek(1, 0, SEEK_CUR) = 13
ftruncate(1, 13) = 0
pfe27:/nobackup/jchang> cat tt
Wed May 30 13pfe27:/nobackup/jchang>
*Note that it's lseek that is getting the **wrong* information about current position after the append. |
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 02/Jun/18 ] |
|
Looks similar to |
| Comment by John Hammond [ 02/Jun/18 ] |
|
This is not limited to ifort and this is on master as well. n:~# cat test.c
#include <unistd.h>
int main(int argc, char *argv[])
{
write(1, "Hello world!\n", 13);
lseek(1, 0, SEEK_CUR);
ftruncate(1, 13);
return 0;
}
n:~# gcc -Wall test.c -o test
n:~# cd /mnt/lustre
n:lustre# rm -f f0
n:lustre# date >> f0
n:lustre# date >> f0
n:lustre# cat f0
Sat Jun 2 12:20:11 CDT 2018
Sat Jun 2 12:20:12 CDT 2018
n:lustre# strace ~/test >> f0
execve("/root/test", ["/root/test"], [/* 36 vars */]) = 0
brk(NULL) = 0x1c4f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6506da000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=46571, ...}) = 0
mmap(NULL, 46571, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fd6506ce000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\35\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2127336, ...}) = 0
mmap(NULL, 3940800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fd6500f7000
mprotect(0x7fd6502af000, 2097152, PROT_NONE) = 0
mmap(0x7fd6504af000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x7fd6504af000
mmap(0x7fd6504b5000, 16832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd6504b5000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6506cd000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6506cb000
arch_prctl(ARCH_SET_FS, 0x7fd6506cb740) = 0
mprotect(0x7fd6504af000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7fd6506db000, 4096, PROT_READ) = 0
munmap(0x7fd6506ce000, 46571) = 0
write(1, "Hello world!\n", 13) = 13
lseek(1, 0, SEEK_CUR) = 13
ftruncate(1, 13) = 0
exit_group(0) = ?
+++ exited with 0 +++
n:lustre# cat f0
Sat Jun 2 12n:lustre#
|
| Comment by Andreas Dilger [ 03/Jun/18 ] |
|
Note that the same behaviour with John's test program is seen on both HFS+ and ext4, so it isn't clear if this is really a Lustre bug. |
| Comment by Andreas Dilger [ 03/Jun/18 ] |
|
Mahmoud, can you please attach your original test program, since I don't think John's is exercising this problem correctly. |
| Comment by John Hammond [ 04/Jun/18 ] |
|
> Note that the same behaviour with John's test program is seen on both HFS+ and ext4, so it isn't clear if this is really a Lustre bug. When I run my test program on ext4 (3.10.0-693.11.6.el7.lustre.x86_64) I see the correct behavior: n:~# cd /tmp
n:tmp# rm -f f0
n:tmp# date >> f0
n:tmp# date >> f0
n:tmp# cat f0
Mon Jun 4 08:28:09 CDT 2018
Mon Jun 4 08:28:10 CDT 2018
n:tmp# strace ~/test >> f0
execve("/root/test", ["/root/test"], [/* 36 vars */]) = 0
brk(NULL) = 0x1ccc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff77014e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=46571, ...}) = 0
mmap(NULL, 46571, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff770142000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\35\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2127336, ...}) = 0
mmap(NULL, 3940800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ff76fb6b000
mprotect(0x7ff76fd23000, 2097152, PROT_NONE) = 0
mmap(0x7ff76ff23000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x7ff76ff23000
mmap(0x7ff76ff29000, 16832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ff76ff29000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff770141000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff77013f000
arch_prctl(ARCH_SET_FS, 0x7ff77013f740) = 0
mprotect(0x7ff76ff23000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7ff77014f000, 4096, PROT_READ) = 0
munmap(0x7ff770142000, 46571) = 0
write(1, "Hello world!\n", 13) = 13
lseek(1, 0, SEEK_CUR) = 71
ftruncate(1, 13) = 0
exit_group(0) = ?
+++ exited with 0 +++
Note that lseek(1, 0, SEEK_CUR) returns 71 (which is correct) rather than 13 as on Lustre. |
| Comment by Nathan Dauchy (Inactive) [ 04/Jun/18 ] |
|
Andreas, the fortran reproducer is the two line "hello.f" in the original description. |
| Comment by John Hammond [ 04/Jun/18 ] |
|
Bisection shows that this was introduced by https://review.whamcloud.com/26468 |
| Comment by Andreas Dilger [ 04/Jun/18 ] |
|
My confusion is that I was looking at the file output at the end, which was always truncated, due to the test program calling ftruncate(1, 13) each time. Was that supposed to be ftruncate(1, <lseek_return>) or similar? In any case, just checking the return of lseek() shows that this is indeed a Lustre-specific issue. It would be good to make the test tool check the file size before and after the append and make that clear. It seems possible to reproduce this problem with multiop instead of making a dedicated test binary for the regression test. |
| Comment by Mahmoud Hanafi [ 05/Jun/18 ] |
|
Can we expect a patch soon? Or should we pull the offending patch, in our build, until a fix is available. |
| Comment by John Hammond [ 05/Jun/18 ] |
|
Hi Mahmoud, I have a patch which passes my version of your test case. (I expect that it will pass yours as well.) I am validating it now. |
| Comment by Gerrit Updater [ 06/Jun/18 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/32641 |
| Comment by Jay Lan (Inactive) [ 06/Jun/18 ] |
|
I have a conflict applying this patch to 2.10.3 at lustre/tests/multiop.c: int main(int argc, char **argv) If you agree, I will keep fields dv, rc and last_rc. The 'dv' field might have been removed in master branch in commits that 2.10.3 (b2_10) does not have. |
| Comment by Andreas Dilger [ 06/Jun/18 ] |
|
Yes, this is no problem. dv was moved to a different part of the file in another patch, but it is harmless to leave this as-is. |
| Comment by John Hammond [ 06/Jun/18 ] |
|
Hi Jay, I encourage you to test the patch but please be aware that it has not completed full testing and review. In fact Jinshan pointed out an issue with the first patch set and I have submitted a second patch set. So we would not advise you to put this change into production just yet. |
| Comment by Jay Lan (Inactive) [ 07/Jun/18 ] |
|
More problems in the patch when cherry-picked to b2_10 without being flagged as conflicts. It then caused compilation errors. Could you please back port to b2_10? Thanks! |
| Comment by Gerrit Updater [ 07/Jun/18 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/32661 |
| Comment by Jay Lan (Inactive) [ 07/Jun/18 ] |
|
multiop.c: In function 'main': |
| Comment by John Hammond [ 07/Jun/18 ] |
|
Hi Jay, A port of the change to b2_10 is available at https://review.whamcloud.com/#/c/32661/2. |
| Comment by Mahmoud Hanafi [ 08/Jun/18 ] |
|
The patch fixed the lseek issue in our fortran test. When can we expect for it to land. $ date > tt $ date >> tt $ strace ./a.out >> tt .... ioctl(1, TCGETS, 0x7fffffff77a0) = -1 ENOTTY (Inappropriate ioctl for device) ioctl(1, TCGETS, 0x7fffffffbd00) = -1 ENOTTY (Inappropriate ioctl for device) fstat(1, {st_mode=S_IFREG|0600, st_size=58, ...}) = 0 ioctl(1, TCGETS, 0x7fffffffbd00) = -1 ENOTTY (Inappropriate ioctl for device) write(1, " hello world\n", 13) = 13 lseek(1, 0, SEEK_CUR) = 71 ftruncate(1, 71) = 0 ..... $ cat tt Fri Jun 8 16:00:58 PDT 2018 Fri Jun 8 16:01:01 PDT 2018 hello world |
| Comment by Peter Jones [ 09/Jun/18 ] |
|
Mahmoud If all goes well it should land to master mid-next week. Peter |
| Comment by Gerrit Updater [ 14/Jun/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/32641/ |
| Comment by Peter Jones [ 19/Jun/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 03/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32661/ |