Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4308

MPI job causes errors "binary changed while waiting for the page fault lock"

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.3
    • Lustre 2.4.1
    • None
    • RHEL 6.4/MLNX OFED 2.0.2.6.8.10
    • 4
    • 11800

    Description

      When a MPI job is run, we see many of these messages "binary x changed while waiting for the page fault lock." Is this normal behavior or not? It was also reported here.

      https://lists.01.org/pipermail/hpdd-discuss/2013-October/000560.html

      Nov 25 13:46:50 rhea25 kernel: Lustre: 105703:0:vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x18:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:53 rhea25 kernel: Lustre: 105751:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x19:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1a:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 1 previous similar message
      Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1b:0x0] changed while waiting for the page fault lock
      Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 2 previous similar messages
      Nov 25 13:47:07 rhea25 kernel: Lustre: 105942:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1d:0x0] changed while waiting for the page fault lock

      Attachments

        Issue Links

          Activity

            [LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock"
            jstroik Jesse Stroik added a comment -

            We left the clients in their broken state over night. Same deal. Then, we tested deleting the file from a bad actor and observed this:

            April 29 @ 17:43
            172.17.1.251 - [0x200001435:0xad:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst
            172.17.1.251 - [0x200001435:0x12e:0x0] -
            /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/gfs_namelist

            (rm the file)

            April 30 @ 13:35
            172.17.1.251 - [0x200001435:0xad:0x0] - No file or directory

            The file handle is still stuck open on all nodes where it was unreadable. The stuck nodes cannot read the file, but they can delete it. However, even deleting it doesn't release their file handle.

            jstroik Jesse Stroik added a comment - We left the clients in their broken state over night. Same deal. Then, we tested deleting the file from a bad actor and observed this: April 29 @ 17:43 172.17.1.251 - [0x200001435:0xad:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst 172.17.1.251 - [0x200001435:0x12e:0x0] - /scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/gfs_namelist (rm the file) April 30 @ 13:35 172.17.1.251 - [0x200001435:0xad:0x0] - No file or directory The file handle is still stuck open on all nodes where it was unreadable. The stuck nodes cannot read the file, but they can delete it. However, even deleting it doesn't release their file handle.
            jstroik Jesse Stroik added a comment - - edited

            We are observing a similar issue and sometimes see the same error. It only happens when we run specific jobs that use MPI with OpenMP threads. Those same jobs complete when using pure MPI.

            Symptoms:

            • The file in question cannot be read on some specific clients (but can on most other clients)
            • On those nodes where it cannot be read, processes that need to access /proc/PID/cmdline for the process executing that file block (w, ps, etc).
            • File system can no longer unmount.
            • sometimes clients will have open FIDs that cannot resolve to paths when we get the list of FIDs held open by the client on the MDS.

            Effectively, we have to power cycle or reboot -fn to get the nodes in question back up.

            The file in question can be read by other lustre clients. We tested running lustre client and server 2.4.2 and 2.5.1. Same results for both. To reduce the chance this is an MPI bug, we also tested on two versions of Intel MPI (4.1.0.024 and 4.1.3.049) and Intel compilers 13.1 and 14.0-1. To reduce the risk of race conditions, we pin MPI threads to a single socket.

            Here is a bit of strace for cp:

            ===============
            stat("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst",

            {st_mode=S_IFREG|0755, st_size=24534300, ...}) = 0
            stat("./global_fcst", 0x7fffb2d95eb0) = -1 ENOENT (No such file or directory)
            open("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", O_RDONLY) = 3
            fstat(3, {st_mode=S_IFREG|0755, st_size=24534300, ...}

            ) = 0
            open("./global_fcst", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4
            fstat(4,

            {st_mode=S_IFREG|0755, st_size=0, ...}

            ) = 0
            mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5b721c5000
            read(3,
            ===============

            At that point, it hangs indefinitely.

            For us, this may be a blocking issue. We're not sure where exactly the blame lies, but it does seem suspect that the systems cannot flush bad pages or read files other lustre clients can read. We can't go into production with this hardware until we resolve this.

            jstroik Jesse Stroik added a comment - - edited We are observing a similar issue and sometimes see the same error. It only happens when we run specific jobs that use MPI with OpenMP threads. Those same jobs complete when using pure MPI. Symptoms: The file in question cannot be read on some specific clients (but can on most other clients) On those nodes where it cannot be read, processes that need to access /proc/PID/cmdline for the process executing that file block (w, ps, etc). File system can no longer unmount. sometimes clients will have open FIDs that cannot resolve to paths when we get the list of FIDs held open by the client on the MDS. Effectively, we have to power cycle or reboot -fn to get the nodes in question back up. The file in question can be read by other lustre clients. We tested running lustre client and server 2.4.2 and 2.5.1. Same results for both. To reduce the chance this is an MPI bug, we also tested on two versions of Intel MPI (4.1.0.024 and 4.1.3.049) and Intel compilers 13.1 and 14.0-1. To reduce the risk of race conditions, we pin MPI threads to a single socket. Here is a bit of strace for cp: =============== stat("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", {st_mode=S_IFREG|0755, st_size=24534300, ...}) = 0 stat("./global_fcst", 0x7fffb2d95eb0) = -1 ENOENT (No such file or directory) open("/scratch/short/jimj/s4_1534/stmp_2013062412_gdas_fcst1/global_fcst", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0755, st_size=24534300, ...} ) = 0 open("./global_fcst", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4 fstat(4, {st_mode=S_IFREG|0755, st_size=0, ...} ) = 0 mmap(NULL, 4202496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5b721c5000 read(3, =============== At that point, it hangs indefinitely. For us, this may be a blocking issue. We're not sure where exactly the blame lies, but it does seem suspect that the systems cannot flush bad pages or read files other lustre clients can read. We can't go into production with this hardware until we resolve this.

            Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu.

            Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock


            -Jason

            hilljjornl Jason Hill (Inactive) added a comment - Hate to open something we said to close last week, but we have another occurrence of this issue on the same cluster. We are consulting with the sysadmin for that cluster, and likely with the user to discuss the question from Zhenyu. Mar 6 16:49:55 rhea101 kernel: Lustre: 24431:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20006c6eb:0xc:0x0] changed while waiting for the page fault lock – -Jason
            pjones Peter Jones added a comment -

            As per ORNL ok to close

            pjones Peter Jones added a comment - As per ORNL ok to close
            bobijam Zhenyu Xu added a comment -

            Blake,

            This message means that a mmapped executable file is under read while other threads may change its contents. Is this what the MPI job meant to do, ie. generating executable files while other jobs reading it?

            bobijam Zhenyu Xu added a comment - Blake, This message means that a mmapped executable file is under read while other threads may change its contents. Is this what the MPI job meant to do, ie. generating executable files while other jobs reading it?
            pjones Peter Jones added a comment -

            Thanks for clarifying Blake.

            Alex

            Could you please comment?

            Thanks

            Peter

            pjones Peter Jones added a comment - Thanks for clarifying Blake. Alex Could you please comment? Thanks Peter

            It should be severity 4

            blakecaldwell Blake Caldwell added a comment - It should be severity 4
            pjones Peter Jones added a comment -

            Hi Blake

            I will get an engineer to comment asap, but just to clarify - is this really an S1 situation (i.e. a production cluster is completely inoperational)?

            Peter

            pjones Peter Jones added a comment - Hi Blake I will get an engineer to comment asap, but just to clarify - is this really an S1 situation (i.e. a production cluster is completely inoperational)? Peter

            People

              bobijam Zhenyu Xu
              blakecaldwell Blake Caldwell
              Votes:
              5 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: