Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11818

ERROR in aiori-POSIX.c (line 256): transfer failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • soak runs on 2.12-RC3 lustre-master-ib #177 EL7.6
    • 3
    • 9223372036854775807

    Description

      Running soak over 15 hours, no hard crash, but many applicaitons failed

      IOR testing hit many failures like:
      IOR version: IOR-2.10.3: MPI Coordinated Test of Parallel I/O

      Summary:
              api                = POSIX
              test filename      = /mnt/soaked/soaktest/test/iorssf/219922/ssf
              access             = single-shared-file
              pattern            = segmented (1 segment)
              ordering in a file = sequential offsets
              ordering inter file=random task offsets >= 1, seed=0
              clients            = 23 (2 per node)
              repetitions        = 1
              xfersize           = 31.49 MiB
              blocksize          = 27.34 GiB
              aggregate filesize = 628.83 GiB
      
      ParseCommandLine: unknown option `--'.
      task 1 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
      WARNING: Task 1 requested transfer of 33021952 bytes,
               but transferred 7806976 bytes at offset 29356515328
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      WARNING: Requested xfer of 33021952 bytes, but xferred 7806976 bytes
      Only transferred 7806976 of 33021952 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
      with errorcode -1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      ParseCommandLine: unknown option `--'.
      task 3 writing /mnt/soaked/soaktest/test/iorssf/219922/ssf
      WARNING: Task 3 requested transfer of 33021952 bytes,
               but transferred 15032320 bytes at offset 88069545984
      WARNING: This file system requires support of partial write()s, in aiori-POSIX.c (line 272).
      slurmstepd: error: *** STEP 219922.0 ON soak-17 CANCELLED AT 2018-12-20T10:48:58 ***
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      srun: error: soak-43: task 22: Killed
      srun: Terminating job step 219922.0
      WARNING: Requested xfer of 33021952 bytes, but xferred 15032320 bytes
      Only transferred 15032320 of 33021952 bytes
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      

      For mdtest, also hit many failures like

      /mnt/soaked/soaktest/test/mdtestfpp/220334
        lcm_layout_gen:    0
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 0
          lcme_extent.e_end:   1048576
            stripe_count:  0       stripe_size:   1048576       pattern:       mdt       stripe_offset: -1
      
          lcme_id:             N/A
          lcme_mirror_id:      N/A
          lcme_flags:          0
          lcme_extent.e_start: 1048576
          lcme_extent.e_end:   EOF
            stripe_count:  -1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
      
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
      mdtidx           FID[seq:oid:ver]
           3           [0x2c001aa0c:0x1fd24:0x0]              
           0           [0x20001d918:0x1fd22:0x0]              
      lmv_stripe_count: 2 lmv_stripe_offset: 3 lmv_hash_type: fnv_1a_64
      srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't match the requested tasks 22 with the number of requested nodes 21. Ignoring --ntasks-per-node.
      srun: error: soak-22: task 5: Exited with exit code 2
      srun: Terminating job step 220334.0
      srun: Job step 220334.0 aborted before step completely launched.
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: execve(): mdtest: No such file or directory
      slurmstepd: error: *** STEP 220334.0 ON soak-18 CANCELLED AT 2018-12-20T18:06:52 ***
      srun: error: soak-40: task 18: Exited with exit code 2
      srun: error: soak-20: task 3: Killed
      

      On OSS, found following after OSS restart

      Dec 20 10:43:14 soak-5 systemd: Started Session 28 of user root.
      Dec 20 10:43:14 soak-5 systemd-logind: Removed session 28.
      Dec 20 10:43:14 soak-5 systemd: Removed slice User Slice of root.
      Dec 20 10:43:22 soak-5 kernel: Lustre: soaked-OST0001: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
      Dec 20 10:43:22 soak-5 kernel: Lustre: Skipped 10 previous similar messages
      Dec 20 10:44:23 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 192.168.1.122@o2ib  ns: filter-soaked-OST000d_UUID lock: ffff935683ba1680/0xdb15693e77e2b726 lrc: 3/0,0 mode: PW/PW res: [0x81a037:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.122@o2ib remote: 0xc6594446a74d696b expref: 6 pid: 43079 timeout: 498 lvb_type: 0
      Dec 20 10:44:28 soak-5 kernel: Lustre: soaked-OST000d: Connection restored to 7c73b0d7-5f12-596e-5cb7-6efa2cd15b4c (at 192.168.1.126@o2ib)
      Dec 20 10:44:28 soak-5 kernel: Lustre: Skipped 24 previous similar messages
      Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0009_UUID lock: ffff93569b9e9200/0xdb15693e77de431b lrc: 3/0,0 mode: PW/PW res: [0x812696:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a02b1c expref: 11 pid: 25268 timeout: 502 lvb_type: 0
      Dec 20 10:44:31 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages
      Dec 20 10:44:37 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 153s: evicting client at 192.168.1.140@o2ib  ns: filter-soaked-OST0005_UUID lock: ffff9356b8cb2880/0xdb15693e77e2b7f1 lrc: 3/0,0 mode: PW/PW res: [0x81e6fa:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.140@o2ib remote: 0x5f346f6b7e7eb6b6 expref: 6 pid: 35334 timeout: 508 lvb_type: 0
      Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 164s: evicting client at 192.168.1.117@o2ib  ns: filter-soaked-OST0001_UUID lock: ffff93524f16e780/0xdb15693e77e079e3 lrc: 3/0,0 mode: PW/PW res: [0x80bd1d:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.117@o2ib remote: 0xb9982a1ed8a024a1 expref: 7 pid: 28096 timeout: 502 lvb_type: 0
      Dec 20 10:44:42 soak-5 kernel: LustreError: 25135:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 3 previous similar messages
      Dec 20 10:44:43 soak-5 sshd[222751]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      

      Attachments

        Issue Links

          Activity

            [LU-11818] ERROR in aiori-POSIX.c (line 256): transfer failed
            sarah Sarah Liu added a comment -

            Andreas, I just opened LU-11943 to describe the issue I saw on 2.10, which may related to this one. From that ticket, soak ran well for several hours(there were MDS/OSS failover during that time), then application started to fail a lot, check around the time when things went bad and found the OSS lock callback error like this ticket, and the OSS was just finish restart.

            To see if this issue is only related to failover, I need to try disable failover and run again

            sarah Sarah Liu added a comment - Andreas, I just opened LU-11943 to describe the issue I saw on 2.10, which may related to this one. From that ticket, soak ran well for several hours(there were MDS/OSS failover during that time), then application started to fail a lot, check around the time when things went bad and found the OSS lock callback error like this ticket, and the OSS was just finish restart. To see if this issue is only related to failover, I need to try disable failover and run again

            Sarah, are the IOR and mdtest failures related to MDS/OSS failover, or are there errors even when there are no failovers active? Are the errors always related to ldlm timeouts/eviction?

            The mdtest errors look to be related to a configuration problem and not related to the filesystem.

            adilger Andreas Dilger added a comment - Sarah, are the IOR and mdtest failures related to MDS/OSS failover, or are there errors even when there are no failovers active? Are the errors always related to ldlm timeouts/eviction? The mdtest errors look to be related to a configuration problem and not related to the filesystem.
            sarah Sarah Liu added a comment -

            Move soak to lustre-master-ib build #183, with only regular stripes (no PFL nor DoM). It ran over 3 days, no hard crash, but saw similar IOR, mdtest failures, the fail rate is about 50%

            sarah Sarah Liu added a comment - Move soak to lustre-master-ib build #183, with only regular stripes (no PFL nor DoM). It ran over 3 days, no hard crash, but saw similar IOR, mdtest failures, the fail rate is about 50%
            sarah Sarah Liu added a comment - - edited

            A quick update, in the latest run(no PFL/DoM), I also fixed the IOR parameter, add the missing space.

            And this is the wiki page of soak quick start
            https://wiki.whamcloud.com/display/Releases/System+Administration+Update

            sarah Sarah Liu added a comment - - edited A quick update, in the latest run(no PFL/DoM), I also fixed the IOR parameter, add the missing space. And this is the wiki page of soak quick start https://wiki.whamcloud.com/display/Releases/System+Administration+Update

            Sarah wrote:

            So far soak has been running for 14 hours (without PFL/DoM stripe), completed multiple MDS failover and restart, multiple OSS failover, no IOR nor mdtest failures.

            My theory is that this happens when the client is writing to the file under load and needs to instantiate the PFL components later in the file. Something like the client is in the middle of an MDS layout intent RPC to instantiate a component, and the MDS is waiting for an offline OST to create an object, then a second OST sends a lock callback, the client may be blocked holding an extent lock with dirty data that it can't flush while it is waiting on the layout lock.

            I think what is needed to debug this is some dlmtrace and RPC Tracy logs from the client, OSS, and MDS, with dump_on_eviction and dump_on_timeout set, and a larger debug buffer to hold enough logs. We may need to get a crash dump of the client at the time of the eviction to see what it is doing. Then, when the client is evicted, trace the OST DLM callback to the client, and see why it isn't cancelling the lock in time.

            adilger Andreas Dilger added a comment - Sarah wrote: So far soak has been running for 14 hours (without PFL/DoM stripe), completed multiple MDS failover and restart, multiple OSS failover, no IOR nor mdtest failures. My theory is that this happens when the client is writing to the file under load and needs to instantiate the PFL components later in the file. Something like the client is in the middle of an MDS layout intent RPC to instantiate a component, and the MDS is waiting for an offline OST to create an object, then a second OST sends a lock callback, the client may be blocked holding an extent lock with dirty data that it can't flush while it is waiting on the layout lock. I think what is needed to debug this is some dlmtrace and RPC Tracy logs from the client, OSS, and MDS, with dump_on_eviction and dump_on_timeout set, and a larger debug buffer to hold enough logs. We may need to get a crash dump of the client at the time of the eviction to see what it is doing. Then, when the client is evicted, trace the OST DLM callback to the client, and see why it isn't cancelling the lock in time.
            pjones Peter Jones added a comment -

            Bobijam

            The current theory is that this may be a PFL issue

            Peter

            pjones Peter Jones added a comment - Bobijam The current theory is that this may be a PFL issue Peter

            From Sarah: Command line used: /mnt/soaked/bin/IOR -vvv -a POSIX -F -w -W -r -Z -q-R -W -b 3043736k -t 7534k -o /mnt/soaked/soaktest/test/iorfpp/219975/iorfpp_file

            jamesanunez James Nunez (Inactive) added a comment - From Sarah: Command line used: /mnt/soaked/bin/IOR -vvv -a POSIX -F -w -W -r -Z -q-R -W -b 3043736k -t 7534k -o /mnt/soaked/soaktest/test/iorfpp/219975/iorfpp_file

            People

              bobijam Zhenyu Xu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: