[LU-5704] lfs getstripe hangs on fifo files Created: 02/Oct/14  Updated: 16/Sep/16  Resolved: 02/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 15976

 Description   

lfs getstripe on fifo files hangs on 'open' system call. This worked in 2.1.x version.

mhanafi@pfe23:/nobackupp8/mhanafi> rm testfifo
mhanafi@pfe23:/nobackupp8/mhanafi> mkfifo testfifo
mhanafi@pfe23:/nobackupp8/mhanafi> strace lfs getstripe testfifo
execve("/usr/bin/lfs", ["lfs", "getstripe", "testfifo"], [/* 35 vars */]) = 0
brk(0)                                  = 0x6c1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedb02000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=270183, ...}) = 0
mmap(NULL, 270183, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fffedac0000
close(3)                                = 0
open("/lib64/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200Z\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=135764, ...}) = 0
mmap(NULL, 2212784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed6c9000
fadvise64(3, 0, 2212784, POSIX_FADV_WILLNEED) = 0
mprotect(0x7fffed6e0000, 2097152, PROT_NONE) = 0
mmap(0x7fffed8e0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7fffed8e0000
mmap(0x7fffed8e2000, 13232, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed8e2000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\355\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1775524, ...}) = 0
mmap(NULL, 3639480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed350000
fadvise64(3, 0, 3639480, POSIX_FADV_WILLNEED) = 0
mprotect(0x7fffed4c0000, 2093056, PROT_NONE) = 0
mmap(0x7fffed6bf000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x7fffed6bf000
mmap(0x7fffed6c4000, 18616, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed6c4000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabf000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabd000
arch_prctl(ARCH_SET_FS, 0x7fffedabe700) = 0
mprotect(0x7fffed6bf000, 16384, PROT_READ) = 0
mprotect(0x7fffed8e0000, 4096, PROT_READ) = 0
mprotect(0x66a000, 4096, PROT_READ)     = 0
mprotect(0x7fffedb04000, 4096, PROT_READ) = 0
munmap(0x7fffedac0000, 270183)          = 0
set_tid_address(0x7fffedabe9d0)         = 61077
set_robust_list(0x7fffedabe9e0, 0x18)   = 0
futex(0x7fffffffe7dc, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fffffffe7dc, 0x189 /* FUTEX_??? */, 1, NULL, 7fffedabe700) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x7fffed6ce8f0, [], SA_RESTORER|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7fffed6ce980, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=300000*1024, rlim_max=7340032*1024}) = 0
shmget(IPC_PRIVATE, 65680, 0600)        = 438992901
shmat(438992901, 0, 0)                  = ?
shmctl(438992901, IPC_RMID, 0)          = 0
brk(0)                                  = 0x6c1000
brk(0x6e2000)                           = 0x6e2000
open("testfifo", O_RDONLY^


 Comments   
Comment by Peter Jones [ 03/Oct/14 ]

Bob is trying to repro this issue

Comment by Bob Glossman (Inactive) [ 03/Oct/14 ]

This reproduces 100% in every version I have easy access to up through current master.

I think this is expected behavior. It behaves the same way if the test fifo file is in a lustre filesystem or not. Looking at the kernel stack trace in crash shows that it is waiting in pipe_wait() in the kernel:

PID: 110276 TASK: ffff88003a7b4300 CPU: 0 COMMAND: "lfs"
#0 [ffff88003a747b48] schedule at ffffffff81461beb
#1 [ffff88003a747c90] pipe_wait at ffffffff81164252
#2 [ffff88003a747ce0] fifo_open at ffffffff8116f4d6
#3 [ffff88003a747d10] __dentry_open at ffffffff81158f68
#4 [ffff88003a747d60] do_last at ffffffff811685c2
#5 [ffff88003a747dc0] path_openat at ffffffff81169839
#6 [ffff88003a747e50] do_filp_open at ffffffff81169cbc
#7 [ffff88003a747f20] do_sys_open at ffffffff8115a90f
#8 [ffff88003a747f80] system_call_fastpath at ffffffff8146bd12
RIP: 00007f7848bb0fd0 RSP: 00007fff3f449fd8 RFLAGS: 00010206
RAX: 0000000000000002 RBX: ffffffff8146bd12 RCX: 6f66696674736574
RDX: 0000000000000050 RSI: 0000000000000000 RDI: 0000000000624030
RBP: 000000000040fde0 R8: 00007fff3f44c447 R9: 0000000000000000
R10: 00007fff3f449d80 R11: 0000000000000246 R12: 000000000040a320
R13: 00007fff3f44a070 R14: 0000000000624030 R15: 00007fff3f44a000
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

I think it's normal behavior for a process that opens a fifo for read to just wait until somebody writes. I note if I echo a few chars into the test fifo file the lfs command wakes up and says:

error: can't get lov name.: Invalid argument (22)
testfifo has no stripe info

then exits.

I can't say why this ever worked in 2.1 unless the lfs code there somehow avoided doing an open on the file.

Comment by Paul Kolano [ 10/Mar/16 ]

Regardless of whether this ever worked, it's still a bug. This isn't "normal behavior" of any command that gets file attributes. I can do a stat on a fifo without hanging. I can do a getfattr on a fifo without hanging. I can do a getfacl on a fifo without hanging. Getting stripe info is no different than these. The implementation is flawed if it hangs and could be easily fixed with a fifo check that immediately prints the exact message shown after writing to the pipe. I can't imagine the fix would take more than a few minutes?!?

Comment by Bob Glossman (Inactive) [ 10/Mar/16 ]

the fact that getfattr/getfacl/stat on a fifo don't hang isn't relevant IMHO. you can do all those on a fifo in lustre too.

lfs is a lustre specific tool. it does open on it's target(s) prior to doing anything else. I strongly disagree that it's a bug for it to hang in certain unlikely situations, like for instance trying to do lustre related stripe operations on files where that concept doesn't apply. a read of a fifo that nobody ever writes to will hang in precisely the same way.

Comment by Andreas Dilger [ 18/Mar/16 ]

The reason that those other tools do not hang when opening a FIFO is because they pass open flags to prevent this.

It appears that our open() call in llapi_file_get_stripe() is only passing O_RDONLY and it should also include O_NONBLOCK to avoid exactly this problem.

Comment by Bob Glossman (Inactive) [ 21/Mar/16 ]

'lfs getstripe' doesn't call llapi_file_get_stripe() at all. It does however do a different call sequence that has an open() call with only O_RDONLY in the flags.

I can work up a patch to add O_NONBLOCK, but this particular call sequence is used for so many other things too I'm worried that doing so will cause bad side effects.

Comment by Gerrit Updater [ 21/Mar/16 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19039
Subject: LU-5704 utils: stop open hangs on fifo files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bbb6139de7761d43a5b86e1ee14550d5f7ba2c02

Comment by Jay Lan (Inactive) [ 23/Mar/16 ]

Bob, if you are worried that this patch "will cause bad side effects," we are not comfortable applying this patch to our production systems.

Comment by Bob Glossman (Inactive) [ 23/Mar/16 ]

Jay,
At this point I believe my worry was unfounded. That comment was from before I created and submitted my patch. It has since undergone a complete set of review tests as well as passed inspection by reviewers.

I think it's safe.

Comment by Jay Lan (Inactive) [ 23/Mar/16 ]

Thanks for clarification, Bob.

Comment by Gerrit Updater [ 28/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19039/
Subject: LU-5704 utils: stop open hangs on fifo files
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fab6073165dfbd107f54057e33363c77978af6cb

Comment by Peter Jones [ 02/Apr/16 ]

Landed for 2.9

Generated at Sat Feb 10 01:53:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.