Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.4.3
    • None
    • 3
    • 15976

    Description

      lfs getstripe on fifo files hangs on 'open' system call. This worked in 2.1.x version.

      mhanafi@pfe23:/nobackupp8/mhanafi> rm testfifo
      mhanafi@pfe23:/nobackupp8/mhanafi> mkfifo testfifo
      mhanafi@pfe23:/nobackupp8/mhanafi> strace lfs getstripe testfifo
      execve("/usr/bin/lfs", ["lfs", "getstripe", "testfifo"], [/* 35 vars */]) = 0
      brk(0)                                  = 0x6c1000
      mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedb02000
      access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
      open("/etc/ld.so.cache", O_RDONLY)      = 3
      fstat(3, {st_mode=S_IFREG|0644, st_size=270183, ...}) = 0
      mmap(NULL, 270183, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fffedac0000
      close(3)                                = 0
      open("/lib64/libpthread.so.0", O_RDONLY) = 3
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200Z\0\0\0\0\0\0"..., 832) = 832
      fstat(3, {st_mode=S_IFREG|0755, st_size=135764, ...}) = 0
      mmap(NULL, 2212784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed6c9000
      fadvise64(3, 0, 2212784, POSIX_FADV_WILLNEED) = 0
      mprotect(0x7fffed6e0000, 2097152, PROT_NONE) = 0
      mmap(0x7fffed8e0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7fffed8e0000
      mmap(0x7fffed8e2000, 13232, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed8e2000
      close(3)                                = 0
      open("/lib64/libc.so.6", O_RDONLY)      = 3
      read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\355\1\0\0\0\0\0"..., 832) = 832
      fstat(3, {st_mode=S_IFREG|0755, st_size=1775524, ...}) = 0
      mmap(NULL, 3639480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fffed350000
      fadvise64(3, 0, 3639480, POSIX_FADV_WILLNEED) = 0
      mprotect(0x7fffed4c0000, 2093056, PROT_NONE) = 0
      mmap(0x7fffed6bf000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x7fffed6bf000
      mmap(0x7fffed6c4000, 18616, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fffed6c4000
      close(3)                                = 0
      mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabf000
      mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabe000
      mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fffedabd000
      arch_prctl(ARCH_SET_FS, 0x7fffedabe700) = 0
      mprotect(0x7fffed6bf000, 16384, PROT_READ) = 0
      mprotect(0x7fffed8e0000, 4096, PROT_READ) = 0
      mprotect(0x66a000, 4096, PROT_READ)     = 0
      mprotect(0x7fffedb04000, 4096, PROT_READ) = 0
      munmap(0x7fffedac0000, 270183)          = 0
      set_tid_address(0x7fffedabe9d0)         = 61077
      set_robust_list(0x7fffedabe9e0, 0x18)   = 0
      futex(0x7fffffffe7dc, FUTEX_WAKE_PRIVATE, 1) = 0
      futex(0x7fffffffe7dc, 0x189 /* FUTEX_??? */, 1, NULL, 7fffedabe700) = -1 EAGAIN (Resource temporarily unavailable)
      rt_sigaction(SIGRTMIN, {0x7fffed6ce8f0, [], SA_RESTORER|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
      rt_sigaction(SIGRT_1, {0x7fffed6ce980, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fffed6d8810}, NULL, 8) = 0
      rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
      getrlimit(RLIMIT_STACK, {rlim_cur=300000*1024, rlim_max=7340032*1024}) = 0
      shmget(IPC_PRIVATE, 65680, 0600)        = 438992901
      shmat(438992901, 0, 0)                  = ?
      shmctl(438992901, IPC_RMID, 0)          = 0
      brk(0)                                  = 0x6c1000
      brk(0x6e2000)                           = 0x6e2000
      open("testfifo", O_RDONLY^
      

      Attachments

        Activity

          [LU-5704] lfs getstripe hangs on fifo files

          Thanks for clarification, Bob.

          jaylan Jay Lan (Inactive) added a comment - Thanks for clarification, Bob.

          Jay,
          At this point I believe my worry was unfounded. That comment was from before I created and submitted my patch. It has since undergone a complete set of review tests as well as passed inspection by reviewers.

          I think it's safe.

          bogl Bob Glossman (Inactive) added a comment - Jay, At this point I believe my worry was unfounded. That comment was from before I created and submitted my patch. It has since undergone a complete set of review tests as well as passed inspection by reviewers. I think it's safe.

          Bob, if you are worried that this patch "will cause bad side effects," we are not comfortable applying this patch to our production systems.

          jaylan Jay Lan (Inactive) added a comment - Bob, if you are worried that this patch "will cause bad side effects," we are not comfortable applying this patch to our production systems.

          Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19039
          Subject: LU-5704 utils: stop open hangs on fifo files
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bbb6139de7761d43a5b86e1ee14550d5f7ba2c02

          gerrit Gerrit Updater added a comment - Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19039 Subject: LU-5704 utils: stop open hangs on fifo files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bbb6139de7761d43a5b86e1ee14550d5f7ba2c02
          bogl Bob Glossman (Inactive) added a comment - - edited

          'lfs getstripe' doesn't call llapi_file_get_stripe() at all. It does however do a different call sequence that has an open() call with only O_RDONLY in the flags.

          I can work up a patch to add O_NONBLOCK, but this particular call sequence is used for so many other things too I'm worried that doing so will cause bad side effects.

          bogl Bob Glossman (Inactive) added a comment - - edited 'lfs getstripe' doesn't call llapi_file_get_stripe() at all. It does however do a different call sequence that has an open() call with only O_RDONLY in the flags. I can work up a patch to add O_NONBLOCK, but this particular call sequence is used for so many other things too I'm worried that doing so will cause bad side effects.

          The reason that those other tools do not hang when opening a FIFO is because they pass open flags to prevent this.

          It appears that our open() call in llapi_file_get_stripe() is only passing O_RDONLY and it should also include O_NONBLOCK to avoid exactly this problem.

          adilger Andreas Dilger added a comment - The reason that those other tools do not hang when opening a FIFO is because they pass open flags to prevent this. It appears that our open() call in llapi_file_get_stripe() is only passing O_RDONLY and it should also include O_NONBLOCK to avoid exactly this problem.

          the fact that getfattr/getfacl/stat on a fifo don't hang isn't relevant IMHO. you can do all those on a fifo in lustre too.

          lfs is a lustre specific tool. it does open on it's target(s) prior to doing anything else. I strongly disagree that it's a bug for it to hang in certain unlikely situations, like for instance trying to do lustre related stripe operations on files where that concept doesn't apply. a read of a fifo that nobody ever writes to will hang in precisely the same way.

          bogl Bob Glossman (Inactive) added a comment - the fact that getfattr/getfacl/stat on a fifo don't hang isn't relevant IMHO. you can do all those on a fifo in lustre too. lfs is a lustre specific tool. it does open on it's target(s) prior to doing anything else. I strongly disagree that it's a bug for it to hang in certain unlikely situations, like for instance trying to do lustre related stripe operations on files where that concept doesn't apply. a read of a fifo that nobody ever writes to will hang in precisely the same way.

          Regardless of whether this ever worked, it's still a bug. This isn't "normal behavior" of any command that gets file attributes. I can do a stat on a fifo without hanging. I can do a getfattr on a fifo without hanging. I can do a getfacl on a fifo without hanging. Getting stripe info is no different than these. The implementation is flawed if it hangs and could be easily fixed with a fifo check that immediately prints the exact message shown after writing to the pipe. I can't imagine the fix would take more than a few minutes?!?

          kolano Paul Kolano (Inactive) added a comment - Regardless of whether this ever worked, it's still a bug. This isn't "normal behavior" of any command that gets file attributes. I can do a stat on a fifo without hanging. I can do a getfattr on a fifo without hanging. I can do a getfacl on a fifo without hanging. Getting stripe info is no different than these. The implementation is flawed if it hangs and could be easily fixed with a fifo check that immediately prints the exact message shown after writing to the pipe. I can't imagine the fix would take more than a few minutes?!?

          This reproduces 100% in every version I have easy access to up through current master.

          I think this is expected behavior. It behaves the same way if the test fifo file is in a lustre filesystem or not. Looking at the kernel stack trace in crash shows that it is waiting in pipe_wait() in the kernel:

          PID: 110276 TASK: ffff88003a7b4300 CPU: 0 COMMAND: "lfs"
          #0 [ffff88003a747b48] schedule at ffffffff81461beb
          #1 [ffff88003a747c90] pipe_wait at ffffffff81164252
          #2 [ffff88003a747ce0] fifo_open at ffffffff8116f4d6
          #3 [ffff88003a747d10] __dentry_open at ffffffff81158f68
          #4 [ffff88003a747d60] do_last at ffffffff811685c2
          #5 [ffff88003a747dc0] path_openat at ffffffff81169839
          #6 [ffff88003a747e50] do_filp_open at ffffffff81169cbc
          #7 [ffff88003a747f20] do_sys_open at ffffffff8115a90f
          #8 [ffff88003a747f80] system_call_fastpath at ffffffff8146bd12
          RIP: 00007f7848bb0fd0 RSP: 00007fff3f449fd8 RFLAGS: 00010206
          RAX: 0000000000000002 RBX: ffffffff8146bd12 RCX: 6f66696674736574
          RDX: 0000000000000050 RSI: 0000000000000000 RDI: 0000000000624030
          RBP: 000000000040fde0 R8: 00007fff3f44c447 R9: 0000000000000000
          R10: 00007fff3f449d80 R11: 0000000000000246 R12: 000000000040a320
          R13: 00007fff3f44a070 R14: 0000000000624030 R15: 00007fff3f44a000
          ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

          I think it's normal behavior for a process that opens a fifo for read to just wait until somebody writes. I note if I echo a few chars into the test fifo file the lfs command wakes up and says:

          error: can't get lov name.: Invalid argument (22)
          testfifo has no stripe info

          then exits.

          I can't say why this ever worked in 2.1 unless the lfs code there somehow avoided doing an open on the file.

          bogl Bob Glossman (Inactive) added a comment - This reproduces 100% in every version I have easy access to up through current master. I think this is expected behavior. It behaves the same way if the test fifo file is in a lustre filesystem or not. Looking at the kernel stack trace in crash shows that it is waiting in pipe_wait() in the kernel: PID: 110276 TASK: ffff88003a7b4300 CPU: 0 COMMAND: "lfs" #0 [ffff88003a747b48] schedule at ffffffff81461beb #1 [ffff88003a747c90] pipe_wait at ffffffff81164252 #2 [ffff88003a747ce0] fifo_open at ffffffff8116f4d6 #3 [ffff88003a747d10] __dentry_open at ffffffff81158f68 #4 [ffff88003a747d60] do_last at ffffffff811685c2 #5 [ffff88003a747dc0] path_openat at ffffffff81169839 #6 [ffff88003a747e50] do_filp_open at ffffffff81169cbc #7 [ffff88003a747f20] do_sys_open at ffffffff8115a90f #8 [ffff88003a747f80] system_call_fastpath at ffffffff8146bd12 RIP: 00007f7848bb0fd0 RSP: 00007fff3f449fd8 RFLAGS: 00010206 RAX: 0000000000000002 RBX: ffffffff8146bd12 RCX: 6f66696674736574 RDX: 0000000000000050 RSI: 0000000000000000 RDI: 0000000000624030 RBP: 000000000040fde0 R8: 00007fff3f44c447 R9: 0000000000000000 R10: 00007fff3f449d80 R11: 0000000000000246 R12: 000000000040a320 R13: 00007fff3f44a070 R14: 0000000000624030 R15: 00007fff3f44a000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b I think it's normal behavior for a process that opens a fifo for read to just wait until somebody writes. I note if I echo a few chars into the test fifo file the lfs command wakes up and says: error: can't get lov name.: Invalid argument (22) testfifo has no stripe info then exits. I can't say why this ever worked in 2.1 unless the lfs code there somehow avoided doing an open on the file.
          pjones Peter Jones added a comment -

          Bob is trying to repro this issue

          pjones Peter Jones added a comment - Bob is trying to repro this issue

          People

            bogl Bob Glossman (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: