[LU-10132] IO execvp errors 2.10 client/EE3.1.1 server Created: 13/Oct/17 Updated: 15/Nov/17 Resolved: 01/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Trent Geerdes (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Our users are seeing consistent IO errors at very early attempts to scale jobs that access executables on lustre volumes. The errors hang the entire job. These instances are 2.10 client and EE 3.1.1 server scenarios. I've been able to replicate the errors on several different filesystems. The errors do not occur when loading the EE 3.1.1/2.7 client stack to run the same jobs. Is there any expectation of 2.10 client compatibility with EE 3.1.1 servers? Example job: $ mpirun -perhost 1 -np 12 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello Hello world from processor ekf087, rank 2 out of 12 processors Hello world from processor ekf067, rank 11 out of 12 processors Hello world from processor ekf082, rank 1 out of 12 processors Hello world from processor ekf195, rank 5 out of 12 processors Hello world from processor ekf095, rank 3 out of 12 processors Hello world from processor ekf194, rank 4 out of 12 processors Hello world from processor ekf355, rank 6 out of 12 processors Hello world from processor ekf358, rank 7 out of 12 processors Hello world from processor ekf364, rank 10 out of 12 processors Hello world from processor ekf361, rank 9 out of 12 processors Hello world from processor ekf359, rank 8 out of 12 processors Hello world from processor ekf067, rank 0 out of 12 processors $mpirun -perhost 2 -np 24 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello [proxy:0:5@ekf195] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file /lfs/lfs11/tgeerdes/hello (Input/output error) |
| Comments |
| Comment by Peter Jones [ 13/Oct/17 ] |
|
Hi Trent We would expect this combination to interoperate and it is included in our regular release tesitng. When you say 2.10 do you mean 2.10.0 or 2.10.1 clients? Peter |
| Comment by Trent Geerdes (Inactive) [ 13/Oct/17 ] |
|
Hi Peter, |
| Comment by John Hammond [ 13/Oct/17 ] |
|
Hi Trent, can you attach logs from the clients that experienced the error, the MDT(s), and any OSTs that contained stripes from the executable. |
| Comment by Trent Geerdes (Inactive) [ 13/Oct/17 ] |
|
Client, MDT, OST's don't log any messages related to the failures. Just the typical, unrelated client disconnects, reconnects. |
| Comment by John Hammond [ 16/Oct/17 ] |
|
How easy is it to reproduce this? |
| Comment by Trent Geerdes (Inactive) [ 16/Oct/17 ] |
|
Pretty easy. Many of our customers have hit it and I didn't have trouble reproducing when trying. |
| Comment by John Hammond [ 16/Oct/17 ] |
|
Could you run the following on each client: lctl set_param debug_mb=256 lctl set_param debug="vfstrace rpctrace dlmtrace net neterror ha trace" lctl clear Then run your reproducer date +%s mpirun -perhost 2 -np 24 -host ... Then run lctl dk > You will probably want to restore the settings to debug and debug_mb afterwards. |
| Comment by Trent Geerdes (Inactive) [ 17/Oct/17 ] |
|
I've set the params and captured the log. Attached. |
| Comment by John Hammond [ 17/Oct/17 ] |
|
Trent, do you mind if I move this to the LU project? Thanks to your logs I can reproduce it locally and now understand where this is coming from. |
| Comment by Trent Geerdes (Inactive) [ 17/Oct/17 ] |
|
Sure, go ahead. Thank you. |
| Comment by John Hammond [ 17/Oct/17 ] |
|
This is from: static int ll_xattr_cache_refill(struct inode *inode) { ... /* Matched but no cache? Cancelled on error by a parallel refill. */ if (unlikely(req == NULL)) { CDEBUG(D_CACHE, "cancelled by a parallel getxattr\n"); ll_intent_drop_lock(&oit); GOTO(err_unlock, rc = -EIO); } ... } Looking at the code here, we should be returning -EAGAIN instead of -EIO so that ll_getxattr_common() will handle the race. This affects master as well. |
| Comment by Gerrit Updater [ 17/Oct/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29654 |
| Comment by Gerrit Updater [ 26/Oct/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29795 |
| Comment by Gerrit Updater [ 01/Nov/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29654/ |
| Comment by Peter Jones [ 01/Nov/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 01/Nov/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29795/ |