[LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock" - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.5.3
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
RHEL 6.4/MLNX OFED 2.0.2.6.8.10

Severity:
4
Rank (Obsolete):
11800

Description

When a MPI job is run, we see many of these messages "binary x changed while waiting for the page fault lock." Is this normal behavior or not? It was also reported here.

https://lists.01.org/pipermail/hpdd-discuss/2013-October/000560.html

Nov 25 13:46:50 rhea25 kernel: Lustre: 105703:0:vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x18:0x0] changed while waiting for the page fault lock
Nov 25 13:46:53 rhea25 kernel: Lustre: 105751:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x19:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1a:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 1 previous similar message
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1b:0x0] changed while waiting for the page fault lock
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 2 previous similar messages
Nov 25 13:47:07 rhea25 kernel: Lustre: 105942:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1d:0x0] changed while waiting for the page fault lock

Attachments

Issue Links

is related to

LU-7198 vvp_io.c:701:vvp_io_fault_start()) binary changed while waiting for the page fault lock

Resolved

Activity

[LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock"

Jason Hill (Inactive) added a comment - 07/Mar/14 1:52 PM

Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu.

Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock

–
-Jason

Jason Hill (Inactive) added a comment - 07/Mar/14 1:52 PM Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu. Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock – -Jason

Peter Jones added a comment - 20/Feb/14 3:39 PM

As per ORNL ok to close

Peter Jones added a comment - 20/Feb/14 3:39 PM As per ORNL ok to close

Zhenyu Xu added a comment - 13/Jan/14 2:49 AM

Blake,

This message means that a mmapped executable file is under read while other threads may change its contents. Is this what the MPI job meant to do, ie. generating executable files while other jobs reading it?

Zhenyu Xu added a comment - 13/Jan/14 2:49 AM Blake, This message means that a mmapped executable file is under read while other threads may change its contents. Is this what the MPI job meant to do, ie. generating executable files while other jobs reading it?

Peter Jones added a comment - 25/Nov/13 9:20 PM

Thanks for clarifying Blake.

Alex

Could you please comment?

Thanks

Peter

Peter Jones added a comment - 25/Nov/13 9:20 PM Thanks for clarifying Blake. Alex Could you please comment? Thanks Peter

Blake Caldwell added a comment - 25/Nov/13 9:11 PM

It should be severity 4

Blake Caldwell added a comment - 25/Nov/13 9:11 PM It should be severity 4

Peter Jones added a comment - 25/Nov/13 9:06 PM

Hi Blake

I will get an engineer to comment asap, but just to clarify - is this really an S1 situation (i.e. a production cluster is completely inoperational)?

Peter

Peter Jones added a comment - 25/Nov/13 9:06 PM Hi Blake I will get an engineer to comment asap, but just to clarify - is this really an S1 situation (i.e. a production cluster is completely inoperational)? Peter

People

Assignee:: Zhenyu Xu

Reporter:: Blake Caldwell

Votes:: 5 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 25/Nov/13 8:16 PM

Updated:: 28/Jan/16 2:12 PM

Resolved:: 14/Aug/14 4:27 PM