Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9367

parallel-scale test_cascading_rw: cascading_rw failed! 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/45d48942-2507-11e7-9de9-5254006e85c2.

      The sub-test test_cascading_rw failed with the following error:

      cascading_rw failed! 1
      

      server/client: lustre-master #3558 ldiskfs el7

      test log

      + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 4 /usr/lib64/lustre/tests/cascading_rw -g -d /mnt/lustre/d0.cascading_rw -n 300 "
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A deprecated MCA parameter value was specified in an MCA parameter
      file.  Deprecated MCA parameters should be avoided; they may disappear
      in future releases.
      
        Deprecated parameter: plm_rsh_agent
      --------------------------------------------------------------------------
      /usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode
      23:47:45: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0)
      [trevis-55vm1:21694] *** Process received signal ***
      [trevis-55vm1:21694] Signal: Floating point exception (8)
      [trevis-55vm1:21694] Signal code: Integer divide-by-zero (1)
      [trevis-55vm1:21694] Failing at address: 0x4024c8
      [trevis-55vm1:21694] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7fdf9fad6370]
      [trevis-55vm1:21694] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8]
      [trevis-55vm1:21694] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0]
      [trevis-55vm1:21694] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e]
      [trevis-55vm1:21694] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fdf9f727b35]
      [trevis-55vm1:21694] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d]
      [trevis-55vm1:21694] *** End of error message ***
      [trevis-55vm1.trevis.hpdd.intel.com][[36239,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      [trevis-55vm2.trevis.hpdd.intel.com][[36239,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      --------------------------------------------------------------------------
      mpirun noticed that process rank 0 with PID 21694 on node trevis-55vm1.trevis.hpdd.intel.com exited on signal 8 (Floating point exception).
      --------------------------------------------------------------------------
       parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4905:error()
        = /usr/lib64/lustre/tests/functions.sh:734:run_cascading_rw()
        = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw()
        = /usr/lib64/lustre/tests/test-framework.sh:5181:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5220:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5067:run_test()
        = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9367] parallel-scale test_cascading_rw: cascading_rw failed! 1
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26915/
            Subject: LU-9367 llite: restore ll_file_getstripe in ll_lov_setstripe
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 364ec95f3688ac5cc3195f7f46d0d860844796f9

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26915/ Subject: LU-9367 llite: restore ll_file_getstripe in ll_lov_setstripe Project: fs/lustre-release Branch: master Current Patch Set: Commit: 364ec95f3688ac5cc3195f7f46d0d860844796f9

            This might be related to LU-9429. The parallel_grouplock subtest started failing tag testing after 2017-04-05. In each case, parallel_grouplock times out after cascading_rw fails.

            tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing)
            several PFL landings: 2017-04-08
            tag 56 test (b3565): 2017-04-23 (parallel_grouplock 100% failing)

            jcasper James Casper (Inactive) added a comment - This might be related to LU-9429 . The parallel_grouplock subtest started failing tag testing after 2017-04-05. In each case, parallel_grouplock times out after cascading_rw fails. tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing) several PFL landings: 2017-04-08 tag 56 test (b3565): 2017-04-23 (parallel_grouplock 100% failing)
            bobijam Zhenyu Xu added a comment -

            And

            #define LL_IOC_LOV_SETSTRIPE _IOW ('f', 154, long)
            #define LL_IOC_LOV_GETSTRIPE _IOW ('f', 155, long)

            shows that LL_IOC_LOV_SETSTRIPE should be WRITE ONLY ioctl(). And it seems that there is an error for LL_IOC_LOV_GETSTRIPE interface definition.

            bobijam Zhenyu Xu added a comment - And #define LL_IOC_LOV_SETSTRIPE _IOW ('f', 154, long) #define LL_IOC_LOV_GETSTRIPE _IOW ('f', 155, long) shows that LL_IOC_LOV_SETSTRIPE should be WRITE ONLY ioctl(). And it seems that there is an error for LL_IOC_LOV_GETSTRIPE interface definition.
            bobijam Zhenyu Xu added a comment -

            I am fine to restore the call to ll_file_getstripe(). And I always think the get/set methods to be with one direction data flow, and with that in mind, the ll_lov_setstripe() passes lum back to the caller is a side effect of that function to me.

            bobijam Zhenyu Xu added a comment - I am fine to restore the call to ll_file_getstripe(). And I always think the get/set methods to be with one direction data flow, and with that in mind, the ll_lov_setstripe() passes lum back to the caller is a side effect of that function to me.

            I'm not sure why you consider this to be a side effect? There has been code to explicitly call getstripe and return this to userspace forever. Why not just restore the call to ll_file_getstripe()?

            adilger Andreas Dilger added a comment - I'm not sure why you consider this to be a side effect? There has been code to explicitly call getstripe and return this to userspace forever. Why not just restore the call to ll_file_getstripe()?

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26915
            Subject: LU-9367 mpi: get rid of SETSTRIPE side effect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7a493a89ea6bc5ea9f7429d9507891741905d4d8

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26915 Subject: LU-9367 mpi: get rid of SETSTRIPE side effect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7a493a89ea6bc5ea9f7429d9507891741905d4d8

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: