[LU-10132] IO execvp errors 2.10 client/EE3.1.1 server Created: 13/Oct/17  Updated: 15/Nov/17  Resolved: 01/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Minor
Reporter: Trent Geerdes (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File INTL-313.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Our users are seeing consistent IO errors at very early attempts to scale jobs that access executables on lustre volumes. The errors hang the entire job. These instances are 2.10 client and EE 3.1.1 server scenarios. I've been able to replicate the errors on several different filesystems. The errors do not occur when loading the EE 3.1.1/2.7 client stack to run the same jobs. Is there any expectation of 2.10 client compatibility with EE 3.1.1 servers?

Example job:

$ mpirun -perhost 1 -np 12 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
Hello world from processor ekf087, rank 2 out of 12 processors
Hello world from processor ekf067, rank 11 out of 12 processors
Hello world from processor ekf082, rank 1 out of 12 processors
Hello world from processor ekf195, rank 5 out of 12 processors
Hello world from processor ekf095, rank 3 out of 12 processors
Hello world from processor ekf194, rank 4 out of 12 processors
Hello world from processor ekf355, rank 6 out of 12 processors
Hello world from processor ekf358, rank 7 out of 12 processors
Hello world from processor ekf364, rank 10 out of 12 processors
Hello world from processor ekf361, rank 9 out of 12 processors
Hello world from processor ekf359, rank 8 out of 12 processors
Hello world from processor ekf067, rank 0 out of 12 processors
$mpirun -perhost 2 -np 24 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
[proxy:0:5@ekf195] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file /lfs/lfs11/tgeerdes/hello (Input/output error)


 Comments   
Comment by Peter Jones [ 13/Oct/17 ]

Hi Trent

We would expect this combination to interoperate and it is included in our regular release tesitng. When you say 2.10 do you mean 2.10.0 or 2.10.1 clients?

Peter

Comment by Trent Geerdes (Inactive) [ 13/Oct/17 ]

Hi Peter,
2.10.0 clients and this is on Endeavour.
Also the only FS I'm not able to replicate it on is our highest performing SSD based EE 3.1.1 FS. Smaller SSD based and large HDD based all exhibit the issue.

Comment by John Hammond [ 13/Oct/17 ]

Hi Trent, can you attach logs from the clients that experienced the error, the MDT(s), and any OSTs that contained stripes from the executable.

Comment by Trent Geerdes (Inactive) [ 13/Oct/17 ]

Client, MDT, OST's don't log any messages related to the failures. Just the typical, unrelated client disconnects, reconnects.

Comment by John Hammond [ 16/Oct/17 ]

How easy is it to reproduce this?

Comment by Trent Geerdes (Inactive) [ 16/Oct/17 ]

Pretty easy. Many of our customers have hit it and I didn't have trouble reproducing when trying.

Comment by John Hammond [ 16/Oct/17 ]

Could you run the following on each client:

lctl set_param debug_mb=256
lctl set_param debug="vfstrace rpctrace dlmtrace net neterror ha trace"
lctl clear

Then run your reproducer

date +%s
mpirun -perhost 2 -np 24 -host ...

Then run lctl dk > INTL-313.log on one of the clients where execvp() fails and attach the log file here?

You will probably want to restore the settings to debug and debug_mb afterwards.

Comment by Trent Geerdes (Inactive) [ 17/Oct/17 ]

I've set the params and captured the log. Attached.

Comment by John Hammond [ 17/Oct/17 ]

Trent, do you mind if I move this to the LU project?

Thanks to your logs I can reproduce it locally and now understand where this is coming from.

Comment by Trent Geerdes (Inactive) [ 17/Oct/17 ]

Sure, go ahead. Thank you.

Comment by John Hammond [ 17/Oct/17 ]

This is from:

static int ll_xattr_cache_refill(struct inode *inode)
{
        ...
        /* Matched but no cache? Cancelled on error by a parallel refill. */
        if (unlikely(req == NULL)) {
                CDEBUG(D_CACHE, "cancelled by a parallel getxattr\n");
                ll_intent_drop_lock(&oit);
                GOTO(err_unlock, rc = -EIO);
        }
        ...
}

Looking at the code here, we should be returning -EAGAIN instead of -EIO so that ll_getxattr_common() will handle the race.

This affects master as well.

Comment by Gerrit Updater [ 17/Oct/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/29654
Subject: LU-10132 llite: handle xattr cache refill race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9cc8c3a20c547ec75325dde3dd17f4b1dcc66348

Comment by Gerrit Updater [ 26/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29795
Subject: LU-10132 llite: handle xattr cache refill race
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: f9a2f8bc817829416646dc7d3ea3add16055cefe

Comment by Gerrit Updater [ 01/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29654/
Subject: LU-10132 llite: handle xattr cache refill race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3dcb7d098759614ae7deb532e1555bd82dac7936

Comment by Peter Jones [ 01/Nov/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 01/Nov/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29795/
Subject: LU-10132 llite: handle xattr cache refill race
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 78a5d681932e30797775aa10d22fc25b20aa58f7

Generated at Sat Feb 10 02:32:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.