Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-701

parallel-scale test_write_disjoint fails due to invalid file size

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.1.0, Lustre 2.4.0, Lustre 1.8.7, Lustre 2.5.0
    • None
    • 3
    • 5514

    Description

      v2_1_0_0_RC2 testing, MPI_ABORT for unknown reason. No console, syslog at all in the report (maloo bug?)

      Report: https://maloo.whamcloud.com/test_sets/44dc4934-e440-11e0-9909-52540025f9af

      == parallel-scale test write_disjoint: write_disjoint == 14:43:05 (1316554985)
      OPTIONS:
      WRITE_DISJOINT=/usr/lib64/lustre/tests/write_disjoint
      clients=fat-intel-1vm1,fat-intel-1vm2
      wdisjoint_THREADS=4
      wdisjoint_REP=10000
      MACHINEFILE=/tmp/parallel-scale.machines
      fat-intel-1vm1
      fat-intel-1vm2
      + /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000
      UUID Inodes IUsed IFree IUse% Mounted on
      lustre-MDT0000_UUID 5000040 86 4999954 0% /mnt/lustre[MDT:0]
      lustre-OST0000_UUID 167552 10974 156578 7% /mnt/lustre[OST:0]
      lustre-OST0001_UUID 167552 11326 156226 7% /mnt/lustre[OST:1]
      lustre-OST0002_UUID 167552 3807 163745 2% /mnt/lustre[OST:2]
      lustre-OST0003_UUID 167552 4830 162722 3% /mnt/lustre[OST:3]
      lustre-OST0004_UUID 167552 3806 163746 2% /mnt/lustre[OST:4]
      lustre-OST0005_UUID 167552 3646 163906 2% /mnt/lustre[OST:5]
      lustre-OST0006_UUID 167552 3806 163746 2% /mnt/lustre[OST:6]

      filesystem summary: 5000040 86 4999954 0% /mnt/lustre

      + chmod 0777 /mnt/lustre
      drwxrwxrwx 7 root root 4096 Sep 20 14:43 /mnt/lustre
      + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca boot ssh -mca btl tcp,self -np 8 -machinefile /tmp/parallel-scale.machines /usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000 "
      loop 0: chunk_size 103399
      loop 1000: chunk_size 69125
      loop 2000: chunk_size 104360
      loop 3000: chunk_size 11295
      loop 4000: chunk_size 77918
      loop 5000: chunk_size 27295
      loop 6000: chunk_size 42065
      loop 7000: chunk_size 82749
      loop 8000: chunk_size 94370
      loop 9000: chunk_size 107226
      loop 9371: chunk_size 25301, file size was 202408
      rank 5, loop 9372: invalid file size 801136 instead of 915584 = 114448 * 8
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
      with errorcode -1.

      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      mpirun has exited due to process rank 5 with PID 30944 on
      node fat-intel-1vm2 exiting without calling "finalize". This may
      have caused other processes in the application to be
      terminated by signals sent by mpirun (as reported here).
      --------------------------------------------------------------------------
      [fat-intel-1vm2.lab.whamcloud.com][[61908,1],7][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [fat-intel-1vm1.lab.whamcloud.com][[61908,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      UUID Inodes IUsed IFree IUse% Mounted on
      lustre-MDT0000_UUID 5000040 87 4999953 0% /mnt/lustre[MDT:0]
      lustre-OST0000_UUID 167552 10974 156578 7% /mnt/lustre[OST:0]
      lustre-OST0001_UUID 167552 11326 156226 7% /mnt/lustre[OST:1]
      lustre-OST0002_UUID 167552 3806 163746 2% /mnt/lustre[OST:2]
      lustre-OST0003_UUID 167552 4830 162722 3% /mnt/lustre[OST:3]
      lustre-OST0004_UUID 167552 3806 163746 2% /mnt/lustre[OST:4]
      lustre-OST0005_UUID 167552 3646 163906 2% /mnt/lustre[OST:5]
      lustre-OST0006_UUID 167552 3806 163746 2% /mnt/lustre[OST:6]

      filesystem summary: 5000040 87 4999953 0% /mnt/lustre

      parallel-scale test_write_disjoint: @@@@@@ FAIL: write_disjoint failed! 1
      Dumping lctl log to /logdir/test_logs/2011-09-19/lustre-mixed-el6-x86_64_283_-7f6a2ad2c9e0/parallel-scale.test_write_disjoint.*.1316557553.log
      Resetting fail_loc on all nodes...done.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: