Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-305

utime() fails with EINTR : not conform to POSIX standard

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • Lustre 2.0.0, Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • RHEL 6.0 GA, Lustre 2.0.0.1

    Description

      When uncompressing an archive in a Lustre filesystem on a client node, the tar command fails. The error comes from the failure of utime() system call with EINTR when the modification time of the extracted file is updated. However, EINTR is not mentionned as a possible error code for utime() in POSIX standard.

      The problem can be quite easily reproduced on the client node. But, it does not reproduce with lustre logs enabled (echo "-1" > /proc/sys/lnet/debug).

      $ pwd
      /scratch_lustre/xtmp

      $ tar xvfoz netcdf-3.6.1.tar.gz netcdf-3.6.1/src/win32/NET/examples/Form1.resX
      netcdf-3.6.1/src/win32/NET/examples/Form1.resX
      tar: netcdf-3.6.1/src/win32/NET/examples/Form1.resX: Cannot utime: Interrupted system call
      tar: Exiting with failure status due to previous errors

      Here is the output of 'strace' with the same command.
      $ strace -f tar xvfoz netcdf-3.6.1.tar.gz netcdf-3.6.1/src/win32/NET/examples/Form1.resX
      ...
      [pid 3086] open("netcdf-3.6.1/src/win32/NET/examples/Form1.resX", O_WRONLY|O_CREAT|O_EXCL, 0755) = -1 EEXIST (File exists)
      [pid 3086] unlink("netcdf-3.6.1/src/win32/NET/examples/Form1.resX") = 0
      [pid 3086] open("netcdf-3.6.1/src/win32/NET/examples/Form1.resX", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4
      [pid 3086] write(4, "<?xml version=\"1.0\" encoding=\"ut"..., 4608) = 4608
      [pid 3086] read(3, "System.Resources.ResXResourceWri"..., 10240) = 10240
      [pid 3087] <... write resumed> ) = 32768
      [pid 3086] write(4, "System.Resources.ResXResourceWri"..., 2289 <unfinished ...>
      [pid 3087] write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096 <unfinished ...>
      [pid 3086] <... write resumed> ) = 2289
      [pid 3087] <... write resumed> ) = 4096
      [pid 3086] close(4 <unfinished ...>
      [pid 3087] close(0 <unfinished ...>
      [pid 3086] <... close resumed> ) = 0
      [pid 3086] utimensat(AT_FDCWD, "netcdf-3.6.1/src/win32/NET/examples/Form1.resX", 1303910452, 237591535}, {1085494499, 0, 0 <unfinished ...>
      [pid 3087] <... close resumed> ) = 0
      [pid 3087] close(1) = 0
      [pid 3087] close(2) = 0
      [pid 3087] exit_group(0) = ?
      Process 3087 detached
      <... utimensat resumed> ) = -1 EINTR (Interrupted system call)
      — SIGCHLD (Child exited) @ 0 (0) —
      ...

      The tar command forks a child process to perform the uncompression of the archive (gunzip) while the parent process creates the extracted files, writes data and restores initial file attribute (modification time).

      When the child process exits, the parent process receives a SIGCHLD signal. Note that the tar command sets the signal handler of SIGCHLD to SIG_DFL (which is 'Ignore'). The signal may lead to the interruption of the utime() implementation in Lustre.

      I have been able to reproduce a similar EINTR with a test-program on one of our test cluster, with the lustre logs enabled. The error occurs during a write system call (which is allowed in POSIX) and comes from the cl_lock_state_wait() routine in lustre/obdclass/cl_lock.c. This routine make the thread wait on a wait queue and when the thread wakes-up, the routine checks the thread pending signals: cfs_signal_pending().

      Is the cl_lock_state_wait() routine part of the utime() call path of the utime() system call on Lustre ?
      Are there other places where EINTR might be returned in this call path ?

      Maybe Lustre should avoid any interruptible wait during the utime() call path ?

      In attachment are

      • the test-program I wrote to reproduce the issue independently of tar,
      • the lctl debug_kernel log when the error reproduced in the write() system call

      Attachments

        1. utime_sigchild.c
          2 kB
          Gregoire Pichon
        2. utime_sigchild.dk.log
          1.82 MB
          Gregoire Pichon

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: