Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11818

ERROR in aiori-POSIX.c (line 256): transfer failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6
    • 3
    • 9223372036854775807

    Description

      Running soak over 15 hours, no hard crash, but many applicaitons failed

      IOR testing hit many failures like:
      IOR version: IOR-2.10.3: MPI Coordinated Test of Parallel I/O

      Summary:
              api                = POSIX
              test filename      = /mnt/soaked/soaktest/test/iorssf/219922/ssf
              access             = single-shared-file
              pattern            = segmented (1 segment)
              ordering in a file = sequential offsets
              ordering inter file=random task offsets >= 1, seed=0
              clients            = 23 (2 per node)
              repetitions        = 1
              xfersize           = 31.49 MiB
              blocksize          = 27.34 GiB
              aggregate filesize = 628.83 GiB
      
      ParseCommandLine: unknown option `--'.
      task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
      WARNING: Task 1 requested transfer of 33021952 bytes,
               but transferred 7806976 bytes at offset 29356515328
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes
      Only transferred 7806976 of 33021952 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
      with errorcode -1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      ParseCommandLine: unknown option `--'.
      task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
      WARNING: Task 3 requested transfer of 33021952 bytes,
               but transferred 15032320 bytes at offset 88069545984
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 ***
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-43: task 22: Killed
      srun: Terminating job step 219922.0
      WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes
      Only transferred 15032320 of 33021952 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      

      For mdtest, also hit many failures like

      /mnt/soaked/soaktest/test/mdtestfpp/220334
        lcm_layout_gen:    0
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 0
          lcme_extent.e_end:   1048576
            stripe_count:  0       stripe_size:   1048576       pattern:       mdt       stripe_offset: -1
      
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 1048576
          lcme_extent.e_end:   EOF
            stripe_count:  -1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
      
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
      mdtidx           FID[seq:oid:ver]
           3           [0x2c001aa0c:0x1fd24:0x0]              
           0           [0x20001d918:0x1fd22:0x0]              
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
      srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node.
      srun: error: soak-22: task 5: Exited with exit code 2
      srun: Terminating job step 220334.0
      srun: Job step 220334.0 aborted before step completely launched.
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 ***
      srun: error: soak-40: task 18: Exited with exit code 2
      srun: error: soak-20: task 3: Killed
      

      On OSS, found following after OSS restart

      Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root.
      Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28.
      Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root.
      Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
      Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages
      Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib  ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0
      Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
      Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages
      Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0
      Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages
      Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib  ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0
      Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0
      Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages
      Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: