Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3027

Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.2
    • Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0
    • None
    • 3
    • 7390

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2ed1fef2-94bd-11e2-93c6-52540035b04c.

      The sub-test test_write_disjoint failed with the following error:

      write_disjoint failed! 1

      test log shows:

      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      librdmacm: Fatal: no RDMA devices found
      loop 0: chunk_size 103399
      [client-27vm6.lab.whamcloud.com:00935] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
      [client-27vm6.lab.whamcloud.com:00935] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
      loop 79: chunk_size 71702, file size was 573616
      rank 2, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
      loop 79: chunk_size 71702, file size was 573616
      rank 4, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
      loop 79: chunk_size 71702, file size was 573616
      rank 6, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
      loop 79: chunk_size 71702, file size was 573616
      rank 0, loop 80: invalid file size 140329 instead of 160376 = 20047 * 8
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD 
      with errorcode -1.
      

      Looks like LU-2453 is a similar issue seen in b2_1 branch

      Attachments

        Issue Links

          Activity

            [LU-3027] Failure on test suite parallel-scale test_write_disjoint: invalid file size 140329 instead of 160376 = 20047 * 8

            Interesting - Thanks, Jian.

            paf Patrick Farrell (Inactive) added a comment - Interesting - Thanks, Jian.
            yujian Jian Yu added a comment - Hi Patrick, Lustre b2_4 build #48 contains those two patches: http://build.whamcloud.com/job/lustre-b2_4/48/ . And since that build, parallel-scale test write_disjoint has passed: https://maloo.whamcloud.com/test_sets/35b41124-45a7-11e3-b22a-52540035b04c https://maloo.whamcloud.com/test_sets/484f9162-4590-11e3-8713-52540035b04c https://maloo.whamcloud.com/test_sets/c5f43484-4602-11e3-b5e8-52540035b04c https://maloo.whamcloud.com/test_sets/1c537816-47f3-11e3-bc81-52540035b04c

            Jian - Are you saying that when you added 7569 and 7841 to b2_4 you no longer see the failure?

            I do not see those patches as landed to b2_4 in Gerrit, and Cray is still seeing the closely related LU-3889 with both patches landed in 2.4.

            paf Patrick Farrell (Inactive) added a comment - Jian - Are you saying that when you added 7569 and 7841 to b2_4 you no longer see the failure? I do not see those patches as landed to b2_4 in Gerrit, and Cray is still seeing the closely related LU-3889 with both patches landed in 2.4.
            yujian Jian Yu added a comment -

            Patches http://review.whamcloud.com/7569 and http://review.whamcloud.com/7841 landed on Lustre b2_4 branch. The failure was fixed.

            yujian Jian Yu added a comment - Patches http://review.whamcloud.com/7569 and http://review.whamcloud.com/7841 landed on Lustre b2_4 branch. The failure was fixed.
            yujian Jian Yu added a comment - Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ The failure still occurred regularly on Lustre b2_4 branch: https://maloo.whamcloud.com/test_sets/0039b106-4332-11e3-9490-52540035b04c https://maloo.whamcloud.com/test_sets/762409ba-4333-11e3-8676-52540035b04c
            pjones Peter Jones added a comment -

            Landed for 2.5

            pjones Peter Jones added a comment - Landed for 2.5
            sarah Sarah Liu added a comment -

            Here is the result of parallel-scale test_write_disjoint:

            https://maloo.whamcloud.com/test_sessions/5602b272-303b-11e3-b28a-52540035b04c

            sarah Sarah Liu added a comment - Here is the result of parallel-scale test_write_disjoint: https://maloo.whamcloud.com/test_sessions/5602b272-303b-11e3-b28a-52540035b04c

            patch is at: http://review.whamcloud.com/7841, please give it a try.

            jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/7841 , please give it a try.
            sarah Sarah Liu added a comment - - edited hmm for LU-3027 , tag-2.4.93 ( http://build.whamcloud.com/job/lustre-master/1687/ ) should have the fix but I still hit this error: https://maloo.whamcloud.com/test_sets/9a628942-272b-11e3-88c6-52540035b04c
            pjones Peter Jones added a comment -

            Patch landed for 2.5.0

            pjones Peter Jones added a comment - Patch landed for 2.5.0

            People

              green Oleg Drokin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: