Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4308

MPI job causes errors "binary changed while waiting for the page fault lock"

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.3
    • Lustre 2.4.1
    • None
    • RHEL 6.4/MLNX OFED 2.0.2.6.8.10
    • 4
    • 11800

    Description

      When a MPI job is run, we see many of these messages "binary x changed while waiting for the page fault lock." Is this normal behavior or not? It was also reported here.

      https://lists.01.org/pipermail/hpdd-discuss/2013-October/000560.html

      Nov 25 13:46:50 rhea25 kernel: Lustre: 105703:0:vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x18:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:53 rhea25 kernel: Lustre: 105751:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x19:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1a:0x0] changed while waiting for the page fault lock
      Nov 25 13:46:57 rhea25 kernel: Lustre: 105803:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 1 previous similar message
      Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1b:0x0] changed while waiting for the page fault lock
      Nov 25 13:47:00 rhea25 kernel: Lustre: 105846:0:(vvp_io.c:699:vvp_io_fault_start()) Skipped 2 previous similar messages
      Nov 25 13:47:07 rhea25 kernel: Lustre: 105942:0:(vvp_io.c:699:vvp_io_fault_start()) binary [0x20000f81c:0x1d:0x0] changed while waiting for the page fault lock

      Attachments

        Issue Links

          Activity

            [LU-4308] MPI job causes errors "binary changed while waiting for the page fault lock"

            We are seeing this issue on clients running 2.5.3

            vvp_io.c:694:vvp_io_fault_start()) binary [0x20000aaa6:0x1d668:0x0] changed while waiting for the page fault lock

            It manifested while the file system was under high load.

            kjstrosahl Kurt J. Strosahl (Inactive) added a comment - We are seeing this issue on clients running 2.5.3 vvp_io.c:694:vvp_io_fault_start()) binary [0x20000aaa6:0x1d668:0x0] changed while waiting for the page fault lock It manifested while the file system was under high load.
            jstroik Jesse Stroik added a comment -

            I wanted to report back on our issue because it may be related.

            Our original observation was that some of our Lustre clients would deadlock when running MPI+OpenMP executables. The executable on those clients could only be partially read, and would hang indefinitely on copy or access to /proc/<pid>/exe or /proc/<pid>/cmdline.

            We upgraded to 2.5.2 and applied the aforementioned patch and observed no change.

            We did test the following workarounds some of which were entirely successful:

            (1) enabling 'localflock' for those clients as a mount flag was completely successful.

            (2) hosting the executable on an NFS mount was completely successful.

            (3) upgrading to lustre 2.6.0 was completely successful.

            (4) running lustre 2.1.6 with khugepaged disabled mitigated the issue to a large extent, with rare observed deadlocks.

            (5) we tried running on another lustre file system (lustre 2.4.2 servers running ZFS instead of lustre 2.5.1 running ldiskfs) but did not notice any improvement to the client deadlock issue.

            We settled on running lustre 2.6.0 on the clients because we also observed a performance increase when using it.

            NOTE: this issue may have been related to an irregularity we observed in slurmd and have found a work around for as well.

            jstroik Jesse Stroik added a comment - I wanted to report back on our issue because it may be related. Our original observation was that some of our Lustre clients would deadlock when running MPI+OpenMP executables. The executable on those clients could only be partially read, and would hang indefinitely on copy or access to /proc/<pid>/exe or /proc/<pid>/cmdline. We upgraded to 2.5.2 and applied the aforementioned patch and observed no change. We did test the following workarounds some of which were entirely successful: (1) enabling 'localflock' for those clients as a mount flag was completely successful. (2) hosting the executable on an NFS mount was completely successful. (3) upgrading to lustre 2.6.0 was completely successful. (4) running lustre 2.1.6 with khugepaged disabled mitigated the issue to a large extent, with rare observed deadlocks. (5) we tried running on another lustre file system (lustre 2.4.2 servers running ZFS instead of lustre 2.5.1 running ldiskfs) but did not notice any improvement to the client deadlock issue. We settled on running lustre 2.6.0 on the clients because we also observed a performance increase when using it. NOTE: this issue may have been related to an irregularity we observed in slurmd and have found a work around for as well.
            pjones Peter Jones added a comment -

            The patch that has been reportedly successful at a couple of sites has been landed for 2.5.3. It is not believed that this same issue exists on 2.6 or newer releases so equivalent changes are not needed on master. If there are still residual issues affecting 2.5.x releases then please open a new ticket to track those - thanks!

            pjones Peter Jones added a comment - The patch that has been reportedly successful at a couple of sites has been landed for 2.5.3. It is not believed that this same issue exists on 2.6 or newer releases so equivalent changes are not needed on master. If there are still residual issues affecting 2.5.x releases then please open a new ticket to track those - thanks!
            tomtervo Tommi Tervo added a comment -

            I applied patch to 2.5.2 client but problem persists. It was VASP MPI jobs which triggered this error.

            Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock
            Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 1 previous similar message
            Lustre: 62618:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock

            tomtervo Tommi Tervo added a comment - I applied patch to 2.5.2 client but problem persists. It was VASP MPI jobs which triggered this error. Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock Lustre: 62291:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 1 previous similar message Lustre: 62618:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x201c53daa:0x1d27:0x0] changed while waiting for the page fault lock
            parsonsa@bit-sys.com Aron Parsons added a comment -

            We've been running 2.5.1 client packages with this patch included in two separate environments for the past month. It has eliminated these error messages from occurring.

            parsonsa@bit-sys.com Aron Parsons added a comment - We've been running 2.5.1 client packages with this patch included in two separate environments for the past month. It has eliminated these error messages from occurring.
            bobijam Zhenyu Xu added a comment -

            patch for b2_5 branch http://review.whamcloud.com/11098

            bobijam Zhenyu Xu added a comment - patch for b2_5 branch http://review.whamcloud.com/11098
            lflis Lukasz Flis added a comment -

            @Zhenyu Xu: logs related to the following object have been uploaded to ftp:
            Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock

            Please find the logs here: ftp://ftp.whamcloud.com/uploads/lu-4308.cyfronet.log.gz

            lflis Lukasz Flis added a comment - @Zhenyu Xu: logs related to the following object have been uploaded to ftp: Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Please find the logs here: ftp://ftp.whamcloud.com/uploads/lu-4308.cyfronet.log.gz
            lflis Lukasz Flis added a comment -

            Is there a patch for client 2.5.1?
            We are observing the same issues in Cyfronet with 2.5.1 client and mpi jobs

            Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
            Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 54 previous similar messages
            Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
            Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
            Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages
            Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages
            Jul 14 19:36:47 n1043-amd kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff882834414c00: Communicating with 172.16.193.1@o2ib, operation mds_get_info failed with -1119251304.
            Jul 14 19:36:47 n1043-amd kernel: LustreError: Skipped 4 previous similar messages
            Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
            Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 71 previous similar messages
            Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock
            Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 375 previous similar messages

            lflis Lukasz Flis added a comment - Is there a patch for client 2.5.1? We are observing the same issues in Cyfronet with 2.5.1 client and mpi jobs Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:24:56 n1043-amd kernel: Lustre: 21837:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 54 previous similar messages Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:34:59 n1043-amd kernel: Lustre: 21812:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages Jul 14 19:34:59 n1043-amd kernel: Lustre: 21813:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 75 previous similar messages Jul 14 19:36:47 n1043-amd kernel: LustreError: 11-0: scratch-MDT0000-mdc-ffff882834414c00: Communicating with 172.16.193.1@o2ib, operation mds_get_info failed with -1119251304. Jul 14 19:36:47 n1043-amd kernel: LustreError: Skipped 4 previous similar messages Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:45:00 n1043-amd kernel: Lustre: 21825:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 71 previous similar messages Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) binary [0x20d545086:0x191e4:0x0] changed while waiting for the page fault lock Jul 14 19:55:02 n1043-amd kernel: Lustre: 21828:0:(vvp_io.c:692:vvp_io_fault_start()) Skipped 375 previous similar messages
            bobijam Zhenyu Xu added a comment -

            Also would you please trying this patch http://review.whamcloud.com/10483 ?

            bobijam Zhenyu Xu added a comment - Also would you please trying this patch http://review.whamcloud.com/10483 ?
            bobijam Zhenyu Xu added a comment -

            Is it easy to reproduce? Can you collect -1 logs with as simple as possible reproduce procedure and upload the logs? Thank you.

            bobijam Zhenyu Xu added a comment - Is it easy to reproduce? Can you collect -1 logs with as simple as possible reproduce procedure and upload the logs? Thank you.

            People

              bobijam Zhenyu Xu
              blakecaldwell Blake Caldwell
              Votes:
              5 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: