Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.1
-
None
-
RHEL 6.4/MLNX OFED 2.0.2.6.8.10
-
4
-
11800
Description
When a MPI job is run, we see many of these messages "binary x changed while waiting for the page fault lock." Is this normal behavior or not? It was also reported here.
https://lists.01.org/pipermail/hpdd-discuss/2013-October/000560.html
Nov 25 13:46:50 rhea25 kernel: Lustre: 105703:0:vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x18:0x0] changed while waiting for the page fault lock
Nov 25 13:46:53 rhea25 kernel: Lustre: 105751:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x19:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1a:0x0] changed while waiting for the page fault lock
Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 1 previous similar message
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1b:0x0] changed while waiting for the page fault lock
Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 2 previous similar messages
Nov 25 13:47:07 rhea25 kernel: Lustre: 105942:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1d:0x0] changed while waiting for the page fault lock
Attachments
Issue Links
- is related to
-
LU-7198 vvp_io.c:701:vvp_io_fault_start()) binary changed while waiting for the page fault lock
-
- Resolved
-
We left the clients in their broken state over night. Same deal. Then, we tested deleting the file from a bad actor and observed this:
April 29 @ 17:43
172.17.1.251 - [0x200001435:0xad:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst
172.17.1.251 - [0x200001435:0x12e:0x0] -
/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/gfs_namelist
(rm the file)
April 30 @ 13:35
172.17.1.251 - [0x200001435:0xad:0x0] - No file or directory
The file handle is still stuck open on all nodes where it was unreadable. The stuck nodes cannot read the file, but they can delete it. However, even deleting it doesn't release their file handle.