[LU-5704] lfs getstripe hangs on fifo files Created: 02/Oct/14 Updated: 16/Sep/16 Resolved: 02/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mahmoud Hanafi | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 15976 | ||||
| Description |
|
lfs getstripe on fifo files hangs on 'open' system call. This worked in 2.1.x version. mhanafi@pfe23:/nobackupp8/mhanafi> rm testfifo
mhanafi@pfe23:/nobackupp8/mhanafi> mkfifo testfifo
mhanafi@pfe23:/nobackupp8/mhanafi> strace lfs getstripe testfifo
execve("/usr/bin/lfs", ["lfs", "getstripe", "testfifo"], [/* 35 vars */]) = 0
brk(0) = 0x6c1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedb02000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=270183, ...}) = 0
mmap(NULL, 270183, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fffedac0000
close(3) = 0
open("/lib64/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200Z\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=135764, ...}) = 0
mmap(NULL, 2212784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed6c9000
fadvise64(3, 0, 2212784, POSIX_FADV_WILLNEED) = 0
mprotect(0x7fffed6e0000, 2097152, PROT_NONE) = 0
mmap(0x7fffed8e0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7fffed8e0000
mmap(0x7fffed8e2000, 13232, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed8e2000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\355\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1775524, ...}) = 0
mmap(NULL, 3639480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed350000
fadvise64(3, 0, 3639480, POSIX_FADV_WILLNEED) = 0
mprotect(0x7fffed4c0000, 2093056, PROT_NONE) = 0
mmap(0x7fffed6bf000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x7fffed6bf000
mmap(0x7fffed6c4000, 18616, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed6c4000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabf000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabd000
arch_prctl(ARCH_SET_FS, 0x7fffedabe700) = 0
mprotect(0x7fffed6bf000, 16384, PROT_READ) = 0
mprotect(0x7fffed8e0000, 4096, PROT_READ) = 0
mprotect(0x66a000, 4096, PROT_READ) = 0
mprotect(0x7fffedb04000, 4096, PROT_READ) = 0
munmap(0x7fffedac0000, 270183) = 0
set_tid_address(0x7fffedabe9d0) = 61077
set_robust_list(0x7fffedabe9e0, 0x18) = 0
futex(0x7fffffffe7dc, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fffffffe7dc, 0x189 /* FUTEX_??? */, 1, NULL, 7fffedabe700) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x7fffed6ce8f0, [], SA_RESTORER|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7fffed6ce980, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=300000*1024, rlim_max=7340032*1024}) = 0
shmget(IPC_PRIVATE, 65680, 0600) = 438992901
shmat(438992901, 0, 0) = ?
shmctl(438992901, IPC_RMID, 0) = 0
brk(0) = 0x6c1000
brk(0x6e2000) = 0x6e2000
open("testfifo", O_RDONLY^
|
| Comments |
| Comment by Peter Jones [ 03/Oct/14 ] |
|
Bob is trying to repro this issue |
| Comment by Bob Glossman (Inactive) [ 03/Oct/14 ] |
|
This reproduces 100% in every version I have easy access to up through current master. I think this is expected behavior. It behaves the same way if the test fifo file is in a lustre filesystem or not. Looking at the kernel stack trace in crash shows that it is waiting in pipe_wait() in the kernel: PID: 110276 TASK: ffff88003a7b4300 CPU: 0 COMMAND: "lfs" I think it's normal behavior for a process that opens a fifo for read to just wait until somebody writes. I note if I echo a few chars into the test fifo file the lfs command wakes up and says: error: can't get lov name.: Invalid argument (22) then exits. I can't say why this ever worked in 2.1 unless the lfs code there somehow avoided doing an open on the file. |
| Comment by Paul Kolano [ 10/Mar/16 ] |
|
Regardless of whether this ever worked, it's still a bug. This isn't "normal behavior" of any command that gets file attributes. I can do a stat on a fifo without hanging. I can do a getfattr on a fifo without hanging. I can do a getfacl on a fifo without hanging. Getting stripe info is no different than these. The implementation is flawed if it hangs and could be easily fixed with a fifo check that immediately prints the exact message shown after writing to the pipe. I can't imagine the fix would take more than a few minutes?!? |
| Comment by Bob Glossman (Inactive) [ 10/Mar/16 ] |
|
the fact that getfattr/getfacl/stat on a fifo don't hang isn't relevant IMHO. you can do all those on a fifo in lustre too. lfs is a lustre specific tool. it does open on it's target(s) prior to doing anything else. I strongly disagree that it's a bug for it to hang in certain unlikely situations, like for instance trying to do lustre related stripe operations on files where that concept doesn't apply. a read of a fifo that nobody ever writes to will hang in precisely the same way. |
| Comment by Andreas Dilger [ 18/Mar/16 ] |
|
The reason that those other tools do not hang when opening a FIFO is because they pass open flags to prevent this. It appears that our open() call in llapi_file_get_stripe() is only passing O_RDONLY and it should also include O_NONBLOCK to avoid exactly this problem. |
| Comment by Bob Glossman (Inactive) [ 21/Mar/16 ] |
|
'lfs getstripe' doesn't call llapi_file_get_stripe() at all. It does however do a different call sequence that has an open() call with only O_RDONLY in the flags. I can work up a patch to add O_NONBLOCK, but this particular call sequence is used for so many other things too I'm worried that doing so will cause bad side effects. |
| Comment by Gerrit Updater [ 21/Mar/16 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19039 |
| Comment by Jay Lan (Inactive) [ 23/Mar/16 ] |
|
Bob, if you are worried that this patch "will cause bad side effects," we are not comfortable applying this patch to our production systems. |
| Comment by Bob Glossman (Inactive) [ 23/Mar/16 ] |
|
Jay, I think it's safe. |
| Comment by Jay Lan (Inactive) [ 23/Mar/16 ] |
|
Thanks for clarification, Bob. |
| Comment by Gerrit Updater [ 28/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19039/ |
| Comment by Peter Jones [ 02/Apr/16 ] |
|
Landed for 2.9 |