[LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock" - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.5.3
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
RHEL 6.4/MLNX OFED 2.0.2.6.8.10

Severity:
4
Rank (Obsolete):
11800

Description

When a MPI job is run, we see many of these messages "binary x changed while waiting for the page fault lock." Is this normal behavior or not? It was also reported here.

https://lists.01.org/pipermail/hpdd-discuss/2013-October/000560.html

Nov 25 13:46:50 rhea25 kernel: Lustre: 105703:0:vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x18:0x0] changed while waiting for the page fault lock
Nov 25 13:46:53 rhea25 kernel: Lustre: 105751:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x19:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1a:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 1 previous similar message
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1b:0x0] changed while waiting for the page fault lock
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 2 previous similar messages
Nov 25 13:47:07 rhea25 kernel: Lustre: 105942:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1d:0x0] changed while waiting for the page fault lock

Attachments

Issue Links

is related to

LU-7198 vvp_io.c:701:vvp_io_fault_start()) binary changed while waiting for the page fault lock

Resolved

Activity

[LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock"

Tommi Tervo added a comment - 14/Aug/14 6:57 AM

I applied patch to 2.5.2 client but problem persists. It was VASP MPI jobs which triggered this error.

Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock
Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 1 previous similar message
Lustre: 62618:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock

Tommi Tervo added a comment - 14/Aug/14 6:57 AM I applied patch to 2.5.2 client but problem persists. It was VASP MPI jobs which triggered this error. Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 1 previous similar message Lustre: 62618:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock

Aron Parsons added a comment - 16/Jul/14 2:15 AM

We've been running 2.5.1 client packages with this patch included in two separate environments for the past month. It has eliminated these error messages from occurring.

Aron Parsons added a comment - 16/Jul/14 2:15 AM We've been running 2.5.1 client packages with this patch included in two separate environments for the past month. It has eliminated these error messages from occurring.

Zhenyu Xu added a comment - 15/Jul/14 3:17 AM

patch for b2_5 branch http://review.whamcloud.com/11098

Zhenyu Xu added a comment - 15/Jul/14 3:17 AM patch for b2_5 branch http://review.whamcloud.com/11098

Lukasz Flis added a comment - 14/Jul/14 6:13 PM

@Zhenyu Xu: logs related to the following object have been uploaded to ftp:
Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock

Please find the logs here: ftp://ftp.whamcloud.com/uploads/lu-4308.cyfronet.log.gz

Lukasz Flis added a comment - 14/Jul/14 6:13 PM @Zhenyu Xu: logs related to the following object have been uploaded to ftp: Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Please find the logs here: ftp://ftp.whamcloud.com/uploads/lu-4308.cyfronet.log.gz

Lukasz Flis added a comment - 14/Jul/14 5:57 PM

Is there a patch for client 2.5.1?
We are observing the same issues in Cyfronet with 2.5.1 client and mpi jobs

Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 54 previous similar messages
Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages
Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages
Jul 14 19:36:47 n1043-amd kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff882834414c00: Communicating with 172.16.193.1@o2ib, operation mds_get_info failed with -1119251304.
Jul 14 19:36:47 n1043-amd kernel: LustreError: Skipped 4 previous similar messages
Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 71 previous similar messages
Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 375 previous similar messages

Lukasz Flis added a comment - 14/Jul/14 5:57 PM Is there a patch for client 2.5.1? We are observing the same issues in Cyfronet with 2.5.1 client and mpi jobs Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 54 previous similar messages Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages Jul 14 19:36:47 n1043-amd kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff882834414c00: Communicating with 172.16.193.1@o2ib, operation mds_get_info failed with -1119251304. Jul 14 19:36:47 n1043-amd kernel: LustreError: Skipped 4 previous similar messages Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 71 previous similar messages Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 375 previous similar messages

Zhenyu Xu added a comment - 29/May/14 8:15 AM

Also would you please trying this patch http://review.whamcloud.com/10483 ?

Zhenyu Xu added a comment - 29/May/14 8:15 AM Also would you please trying this patch http://review.whamcloud.com/10483 ?

Zhenyu Xu added a comment - 29/May/14 1:05 AM

Is it easy to reproduce? Can you collect -1 logs with as simple as possible reproduce procedure and upload the logs? Thank you.

Zhenyu Xu added a comment - 29/May/14 1:05 AM Is it easy to reproduce? Can you collect -1 logs with as simple as possible reproduce procedure and upload the logs? Thank you.

Jesse Stroik added a comment - 07/May/14 2:31 PM

I'm not confident that our issue, which I suspect is related to a bug in the lustre client, has anything to do with the error 'binary changed...'.

We did, however, revert our clients to 2.1.6 (still running server 2.5.1) and we no longer observe the same hanging file / hanging mount problem.

Jesse Stroik added a comment - 07/May/14 2:31 PM I'm not confident that our issue, which I suspect is related to a bug in the lustre client, has anything to do with the error 'binary changed...'. We did, however, revert our clients to 2.1.6 (still running server 2.5.1) and we no longer observe the same hanging file / hanging mount problem.

Jesse Stroik added a comment - 30/Apr/14 2:15 PM

We left the clients in their broken state over night. Same deal. Then, we tested deleting the file from a bad actor and observed this:

April 29 @ 17:43
172.17.1.251 - [0x200001435:0xad:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst
172.17.1.251 - [0x200001435:0x12e:0x0] -
/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/gfs_namelist

(rm the file)

April 30 @ 13:35
172.17.1.251 - [0x200001435:0xad:0x0] - No file or directory

The file handle is still stuck open on all nodes where it was unreadable. The stuck nodes cannot read the file, but they can delete it. However, even deleting it doesn't release their file handle.

Jesse Stroik added a comment - 30/Apr/14 2:15 PM We left the clients in their broken state over night. Same deal. Then, we tested deleting the file from a bad actor and observed this: April 29 @ 17:43 172.17.1.251 - [0x200001435:0xad:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst 172.17.1.251 - [0x200001435:0x12e:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/gfs_namelist (rm the file) April 30 @ 13:35 172.17.1.251 - [0x200001435:0xad:0x0] - No file or directory The file handle is still stuck open on all nodes where it was unreadable. The stuck nodes cannot read the file, but they can delete it. However, even deleting it doesn't release their file handle.

Jesse Stroik added a comment - 29/Apr/14 4:41 PM - edited

We are observing a similar issue and sometimes see the same error. It only happens when we run specific jobs that use MPI with OpenMP threads. Those same jobs complete when using pure MPI.

Symptoms:

The file in question cannot be read on some specific clients (but can on most other clients)
On those nodes where it cannot be read, processes that need to access /proc/PID/cmdline for the process executing that file block (w, ps, etc).
File system can no longer unmount.
sometimes clients will have open FIDs that cannot resolve to paths when we get the list of FIDs held open by the client on the MDS.

Effectively, we have to power cycle or reboot -fn to get the nodes in question back up.

The file in question can be read by other lustre clients. We tested running lustre client and server 2.4.2 and 2.5.1. Same results for both. To reduce the chance this is an MPI bug, we also tested on two versions of Intel MPI (4.1.0.024 and 4.1.3.049) and Intel compilers 13.1 and 14.0-1. To reduce the risk of race conditions, we pin MPI threads to a single socket.

Here is a bit of strace for cp:

===============
stat("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst",

{st_mode=S_IFREG|0755, st_size=24534300, ...}) = 0
stat("./global_fcst", 0x7fffb2d95eb0) = -1 ENOENT (No such file or directory)
open("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=24534300, ...}

) = 0
open("./global_fcst", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4
fstat(4,

{st_mode=S_IFREG|0755, st_size=0, ...}

) = 0
mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5b721c5000
read(3,
===============

At that point, it hangs indefinitely.

For us, this may be a blocking issue. We're not sure where exactly the blame lies, but it does seem suspect that the systems cannot flush bad pages or read files other lustre clients can read. We can't go into production with this hardware until we resolve this.

Jesse Stroik added a comment - 29/Apr/14 4:41 PM - edited We are observing a similar issue and sometimes see the same error. It only happens when we run specific jobs that use MPI with OpenMP threads. Those same jobs complete when using pure MPI. Symptoms: The file in question cannot be read on some specific clients (but can on most other clients) On those nodes where it cannot be read, processes that need to access /proc/PID/cmdline for the process executing that file block (w, ps, etc). File system can no longer unmount. sometimes clients will have open FIDs that cannot resolve to paths when we get the list of FIDs held open by the client on the MDS. Effectively, we have to power cycle or reboot -fn to get the nodes in question back up. The file in question can be read by other lustre clients. We tested running lustre client and server 2.4.2 and 2.5.1. Same results for both. To reduce the chance this is an MPI bug, we also tested on two versions of Intel MPI (4.1.0.024 and 4.1.3.049) and Intel compilers 13.1 and 14.0-1. To reduce the risk of race conditions, we pin MPI threads to a single socket. Here is a bit of strace for cp: =============== stat("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", {st_mode=S_IFREG|0755, st_size=24534300, ...}) = 0 stat("./global_fcst", 0x7fffb2d95eb0) = -1 ENOENT (No such file or directory) open("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0755, st_size=24534300, ...} ) = 0 open("./global_fcst", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4 fstat(4, {st_mode=S_IFREG|0755, st_size=0, ...} ) = 0 mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5b721c5000 read(3, =============== At that point, it hangs indefinitely. For us, this may be a blocking issue. We're not sure where exactly the blame lies, but it does seem suspect that the systems cannot flush bad pages or read files other lustre clients can read. We can't go into production with this hardware until we resolve this.

Jason Hill (Inactive) added a comment - 07/Mar/14 1:52 PM

Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu.

Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock

–
-Jason

Jason Hill (Inactive) added a comment - 07/Mar/14 1:52 PM Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu. Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock – -Jason

People

Assignee:: Zhenyu Xu

Reporter:: Blake Caldwell

Votes:: 5 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 25/Nov/13 8:16 PM

Updated:: 28/Jan/16 2:12 PM

Resolved:: 14/Aug/14 4:27 PM