Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10132

IO execvp errors 2.10 client/EE3.1.1 server

Details

    • 3
    • 9223372036854775807

    Description

      Our users are seeing consistent IO errors at very early attempts to scale jobs that access executables on lustre volumes. The errors hang the entire job. These instances are 2.10 client and EE 3.1.1 server scenarios. I've been able to replicate the errors on several different filesystems. The errors do not occur when loading the EE 3.1.1/2.7 client stack to run the same jobs. Is there any expectation of 2.10 client compatibility with EE 3.1.1 servers?

      Example job:

      $ mpirun -perhost 1 -np 12 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
      Hello world from processor ekf087, rank 2 out of 12 processors
      Hello world from processor ekf067, rank 11 out of 12 processors
      Hello world from processor ekf082, rank 1 out of 12 processors
      Hello world from processor ekf195, rank 5 out of 12 processors
      Hello world from processor ekf095, rank 3 out of 12 processors
      Hello world from processor ekf194, rank 4 out of 12 processors
      Hello world from processor ekf355, rank 6 out of 12 processors
      Hello world from processor ekf358, rank 7 out of 12 processors
      Hello world from processor ekf364, rank 10 out of 12 processors
      Hello world from processor ekf361, rank 9 out of 12 processors
      Hello world from processor ekf359, rank 8 out of 12 processors
      Hello world from processor ekf067, rank 0 out of 12 processors
      
      $mpirun -perhost 2 -np 24 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
      [proxy:0:5@ekf195] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file /lfs/lfs11/tgeerdes/hello (Input/output error)
      

      Attachments

        Activity

          [LU-10132] IO execvp errors 2.10 client/EE3.1.1 server
          ys Yang Sheng made changes -
          Link New: This issue is related to HUAW-52 [ HUAW-52 ]
          mdiep Minh Diep made changes -
          Link Original: This issue is related to JFC-26 [ JFC-26 ]
          pjones Peter Jones made changes -
          Labels Original: LTS

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29795/
          Subject: LU-10132 llite: handle xattr cache refill race
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: 78a5d681932e30797775aa10d22fc25b20aa58f7

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29795/ Subject: LU-10132 llite: handle xattr cache refill race Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 78a5d681932e30797775aa10d22fc25b20aa58f7
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.11.0 [ 13091 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29654/
          Subject: LU-10132 llite: handle xattr cache refill race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3dcb7d098759614ae7deb532e1555bd82dac7936

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29654/ Subject: LU-10132 llite: handle xattr cache refill race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3dcb7d098759614ae7deb532e1555bd82dac7936
          pjones Peter Jones made changes -
          Link New: This issue is related to HP-131 [ HP-131 ]
          pjones Peter Jones made changes -
          Link New: This issue is related to JFC-26 [ JFC-26 ]
          mdiep Minh Diep made changes -
          Fix Version/s New: Lustre 2.10.2 [ 13494 ]

          People

            jhammond John Hammond
            tgeerdes Trent Geerdes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: