Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9429

parallel-scale test_parallel_grouplock: test failed to respond and timed out

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.10.0, Lustre 2.11.0
    • Fix Version/s: None
    • Environment:
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Seen everywhere test_parallel_grouplock was tested in tag 56 testing (2.9.56):
      https://testing.hpdd.intel.com/test_sessions/732dde3a-7e28-437b-8865-c350e9438ee4
      https://testing.hpdd.intel.com/test_sessions/f14b71e9-9eda-4053-814d-fdf644925d29
      https://testing.hpdd.intel.com/test_sessions/cb12c60c-613a-44b3-bfef-03c0651d2607
      https://testing.hpdd.intel.com/test_sessions/30cc75b6-594f-4255-accf-24fe11bdd565
      https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824
      https://testing.hpdd.intel.com/test_sessions/4f7e260b-bce2-4834-b77c-a1b47527d05a

      With tag 56 testing, parallel-scale had two subtests (test_cascading_rw & test_parallel_grouplock) that failed 6 of 6 times.

      With tag 52-55 testing, some instances of test_cascading_rw failing were seen, but test_parallel_grouplock passed 100% of the time.

      With all 6 failures, we saw this sequence:

      test_cascading_rw: cascading_rw failed! 1 (covered by LU-9367):

      From test_log:

      /usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode
      22:55:22: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0)
      [onyx-48vm1:12185] *** Process received signal ***
      [onyx-48vm1:12185] Signal: Floating point exception (8)
      [onyx-48vm1:12185] Signal code: Integer divide-by-zero (1)
      [onyx-48vm1:12185] Failing at address: 0x4024c8
      [onyx-48vm1:12185] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7f6060bb8370]
      [onyx-48vm1:12185] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8]
      [onyx-48vm1:12185] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0]
      [onyx-48vm1:12185] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e]
      [onyx-48vm1:12185] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6060809b35]
      [onyx-48vm1:12185] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d]
      [onyx-48vm1:12185] *** End of error message ***
      [onyx-48vm2.onyx.hpdd.intel.com][[59688,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      --------------------------------------------------------------------------
      mpirun noticed that process rank 0 with PID 12185 on node onyx-48vm1.onyx.hpdd.intel.com exited on signal 8 (Floating point exception).
      --------------------------------------------------------------------------
       parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
        = /usr/lib64/lustre/tests/functions.sh:740:run_cascading_rw()
        = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw()
        = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
        = /usr/lib64/lustre/tests/parallel-scale.sh:132:main()
      

      The test_cascading_rw failure was then followed by:

      test_parallel_grouplock: test failed to respond and timed out

      From test_log:

      parallel_grouplock subtests -t 11 PASS
      

      Note: Only for non-DNE configs did subtest 11 pass.

      Also from test_log:

      CMD: trevis-52vm1.trevis.hpdd.intel.com,trevis-52vm2,trevis-52vm7,trevis-52vm8 lctl clear
      + /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12
      + chmod 0777 /mnt/lustre
      drwxrwxrwx 5 root root 4096 Apr 25 13:00 /mnt/lustre
      + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 5 /usr/lib64/lustre/tests/parallel_grouplock -g -v -d /mnt/lustre/d0.parallel_grouplock -t 12 "
      /usr/lib64/lustre/tests/parallel_grouplock is running with 5 task(es) in DEBUG mode
      23:38:55: Running test #/usr/lib64/lustre/tests/parallel_grouplock(iter 0)
      23:38:55:	Beginning subtest 12
      

      This was the last activity seen before the test_parallel_grouplock timeout. Nothing obvious was found in any of the console or dmesg logs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bobijam Zhenyu Xu
                Reporter:
                casperjx James Casper (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: