[LU-10132] IO execvp errors 2.10 client/EE3.1.1 server - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Our users are seeing consistent IO errors at very early attempts to scale jobs that access executables on lustre volumes. The errors hang the entire job. These instances are 2.10 client and EE 3.1.1 server scenarios. I've been able to replicate the errors on several different filesystems. The errors do not occur when loading the EE 3.1.1/2.7 client stack to run the same jobs. Is there any expectation of 2.10 client compatibility with EE 3.1.1 servers?

Example job:

$ mpirun -perhost 1 -np 12 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
Hello world from processor ekf087, rank 2 out of 12 processors
Hello world from processor ekf067, rank 11 out of 12 processors
Hello world from processor ekf082, rank 1 out of 12 processors
Hello world from processor ekf195, rank 5 out of 12 processors
Hello world from processor ekf095, rank 3 out of 12 processors
Hello world from processor ekf194, rank 4 out of 12 processors
Hello world from processor ekf355, rank 6 out of 12 processors
Hello world from processor ekf358, rank 7 out of 12 processors
Hello world from processor ekf364, rank 10 out of 12 processors
Hello world from processor ekf361, rank 9 out of 12 processors
Hello world from processor ekf359, rank 8 out of 12 processors
Hello world from processor ekf067, rank 0 out of 12 processors

$mpirun -perhost 2 -np 24 -host ekf067,ekf082,ekf087,ekf095,ekf194,ekf195,ekf355,ekf358,ekf359,ekf361,ekf364 -PSM2 /lfs/lfs11/tgeerdes/hello
[proxy:0:5@ekf195] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file /lfs/lfs11/tgeerdes/hello (Input/output error)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

INTL-313.log
5.64 MB
17/Oct/17 5:47 PM

Activity

People

Assignee:: John Hammond

Reporter:: Trent Geerdes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/Oct/17 5:38 PM

Updated:: 15/Nov/17 12:25 PM

Resolved:: 01/Nov/17 8:32 AM