Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10132

IO execvp errors 2.10 client/EE3.1.1 server

Details

    • 3
    • 9223372036854775807

    Description

      Our users are seeing consistent IO errors at very early attempts to scale jobs that access executables on lustre volumes. The errors hang the entire job. These instances are 2.10 client and EE 3.1.1 server scenarios. I've been able to replicate the errors on several different filesystems. The errors do not occur when loading the EE 3.1.1/2.7 client stack to run the same jobs. Is there any expectation of 2.10 client compatibility with EE 3.1.1 servers?

      Example job:

      $ mpirun -perhost 1 -np 12 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
      Hello world from processor ekf087, rank 2 out of 12 processors
      Hello world from processor ekf067, rank 11 out of 12 processors
      Hello world from processor ekf082, rank 1 out of 12 processors
      Hello world from processor ekf195, rank 5 out of 12 processors
      Hello world from processor ekf095, rank 3 out of 12 processors
      Hello world from processor ekf194, rank 4 out of 12 processors
      Hello world from processor ekf355, rank 6 out of 12 processors
      Hello world from processor ekf358, rank 7 out of 12 processors
      Hello world from processor ekf364, rank 10 out of 12 processors
      Hello world from processor ekf361, rank 9 out of 12 processors
      Hello world from processor ekf359, rank 8 out of 12 processors
      Hello world from processor ekf067, rank 0 out of 12 processors
      
      $mpirun -perhost 2 -np 24 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
      [proxy:0:5@ekf195] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file /lfs/lfs11/tgeerdes/hello (Input/output error)
      

      Attachments

        Activity

          [LU-10132] IO execvp errors 2.10 client/EE3.1.1 server

          John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29654
          Subject: LU-10132 llite: handle xattr cache refill race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 9cc8c3a20c547ec75325dde3dd17f4b1dcc66348

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29654 Subject: LU-10132 llite: handle xattr cache refill race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9cc8c3a20c547ec75325dde3dd17f4b1dcc66348
          jhammond John Hammond added a comment -

          This is from:

          static int ll_xattr_cache_refill(struct inode *inode)
          {
                  ...
                  /* Matched but no cache? Cancelled on error by a parallel refill. */
                  if (unlikely(req == NULL)) {
                          CDEBUG(D_CACHE, "cancelled by a parallel getxattr\n");
                          ll_intent_drop_lock(&oit);
                          GOTO(err_unlock, rc = -EIO);
                  }
                  ...
          }
          

          Looking at the code here, we should be returning -EAGAIN instead of -EIO so that ll_getxattr_common() will handle the race.

          This affects master as well.

          jhammond John Hammond added a comment - This is from: static int ll_xattr_cache_refill(struct inode *inode) { ... /* Matched but no cache? Cancelled on error by a parallel refill. */ if (unlikely(req == NULL)) { CDEBUG(D_CACHE, "cancelled by a parallel getxattr\n" ); ll_intent_drop_lock(&oit); GOTO(err_unlock, rc = -EIO); } ... } Looking at the code here, we should be returning -EAGAIN instead of -EIO so that ll_getxattr_common() will handle the race. This affects master as well.

          Sure, go ahead. Thank you.

          tgeerdes Trent Geerdes (Inactive) added a comment - Sure, go ahead. Thank you.
          jhammond John Hammond added a comment -

          Trent, do you mind if I move this to the LU project?

          Thanks to your logs I can reproduce it locally and now understand where this is coming from.

          jhammond John Hammond added a comment - Trent, do you mind if I move this to the LU project? Thanks to your logs I can reproduce it locally and now understand where this is coming from.

          I've set the params and captured the log. Attached.

          tgeerdes Trent Geerdes (Inactive) added a comment - I've set the params and captured the log. Attached.
          jhammond John Hammond added a comment -

          Could you run the following on each client:

          lctl set_param debug_mb=256
          lctl set_param debug="vfstrace rpctrace dlmtrace net neterror ha trace"
          lctl clear
          

          Then run your reproducer

          date +%s
          mpirun -perhost 2 -np 24 -host ...
          

          Then run lctl dk > INTL-313.log on one of the clients where execvp() fails and attach the log file here?

          You will probably want to restore the settings to debug and debug_mb afterwards.

          jhammond John Hammond added a comment - Could you run the following on each client: lctl set_param debug_mb=256 lctl set_param debug="vfstrace rpctrace dlmtrace net neterror ha trace" lctl clear Then run your reproducer date +%s mpirun -perhost 2 -np 24 -host ... Then run lctl dk > INTL-313 .log on one of the clients where execvp() fails and attach the log file here? You will probably want to restore the settings to debug and debug_mb afterwards.

          Pretty easy. Many of our customers have hit it and I didn't have trouble reproducing when trying.

          tgeerdes Trent Geerdes (Inactive) added a comment - Pretty easy. Many of our customers have hit it and I didn't have trouble reproducing when trying.
          jhammond John Hammond added a comment -

          How easy is it to reproduce this?

          jhammond John Hammond added a comment - How easy is it to reproduce this?
          tgeerdes Trent Geerdes (Inactive) added a comment - - edited

          Client, MDT, OST's don't log any messages related to the failures. Just the typical, unrelated client disconnects, reconnects.

          tgeerdes Trent Geerdes (Inactive) added a comment - - edited Client, MDT, OST's don't log any messages related to the failures. Just the typical, unrelated client disconnects, reconnects.
          jhammond John Hammond added a comment -

          Hi Trent, can you attach logs from the clients that experienced the error, the MDT(s), and any OSTs that contained stripes from the executable.

          jhammond John Hammond added a comment - Hi Trent, can you attach logs from the clients that experienced the error, the MDT(s), and any OSTs that contained stripes from the executable.
          tgeerdes Trent Geerdes (Inactive) added a comment - - edited

          Hi Peter,
          2.10.0 clients and this is on Endeavour.
          Also the only FS I'm not able to replicate it on is our highest performing SSD based EE 3.1.1 FS. Smaller SSD based and large HDD based all exhibit the issue.

          tgeerdes Trent Geerdes (Inactive) added a comment - - edited Hi Peter, 2.10.0 clients and this is on Endeavour. Also the only FS I'm not able to replicate it on is our highest performing SSD based EE 3.1.1 FS. Smaller SSD based and large HDD based all exhibit the issue.

          People

            jhammond John Hammond
            tgeerdes Trent Geerdes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: