Details

    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20160309' (b2_8 RC5) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For test set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

      The issue occurs during failover of MDS nodes. A typical error is:
      Server event history:

      • mds_failover : 2016-03-09 23:54:41,330 - 2016-03-10 00:03:51,040 lola-9
      • Secondary node lola-8 evict all clients:
        lola-8.log:Mar 10 00:03:38 lola-8 kernel: Lustre: soaked-MDT0003: Recovery over after 1:11, of 16 clients 0 recovered and 16 were evicted.
        lola-8.log:Mar 10 00:03:49 lola-8 kernel: Lustre: soaked-MDT0002: Recovery over after 0:32, of 16 clients 0 recovered and 16 were evicted.
        
      • Primary node lola-9 partially evict clients:
        lola-9.log:Mar 10 00:07:03 lola-9 kernel: Lustre: soaked-MDT0002: Recovery over after 3:04, of 16 clients 11 recovered and 5 were evicted.
        lola-9.log:Mar 10 00:10:51 lola-9 kernel: Lustre: soaked-MDT0003: Recovery over after 6:55, of 16 clients 14 recovered and 2 were evicted.
        

        Client events:

      • Job crash with:
        03/10/2016 00:14:49: Process 0(): FAILED in show_file_system_size, unable to statfs() file system: Input/output error
        --------------------------------------------------------------------------
        MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
        with errorcode 1.
        
        NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
        You may or may not see output from other processes, depending on
        exactly when Open MPI kills them.
        --------------------------------------------------------------------------
        In: PMI_Abort(1, N/A)
        srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
        slurmd[lola-27]: *** STEP 451370.0 KILLED AT 2016-03-10T00:14:49 WITH SIGNAL 9 ***
        slurmd[lola-27]: *** STEP 451370.0 KILLED AT 2016-03-10T00:14:49 WITH SIGNAL 9 ***
        srun: error: lola-27: task 0: Exited with exit code 1
        
      • Lustre eroors on lola-27 reads as
        lola-27.log:Mar 10 00:14:04 lola-27 kernel: Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457596992/real 1457596992]  req@ffff8807278556c0 x1528367917668784/t0(0) o400->soaked-MDT0003-mdc-ffff88081f7c1800@192.168.1.108@o2ib10:12/10 lens 224/224 e 0 to 1 dl 1457597644 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
        lola-27.log:Mar 10 00:14:04 lola-27 kernel: Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
        lola-27.log:Mar 10 00:14:04 lola-27 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffff88081f7c1800: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
        lola-27.log:Mar 10 00:14:04 lola-27 kernel: Lustre: soaked-MDT0003-mdc-ffff88081f7c1800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff88081f7c1800: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: LustreError: 36067:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: LustreError: 59437:0:(lmv_obd.c:1467:lmv_statfs()) can't stat MDS #2 (soaked-MDT0002-mdc-ffff88081f7c1800), error -5
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: LustreError: 59437:0:(llite_lib.c:1752:ll_statfs_internal()) md_statfs fails: rc = -5
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: Lustre: soaked-MDT0002-mdc-ffff88081f7c1800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
        lola-27.log:Mar 10 00:14:49 lola-27 kernel: LustreError: 36067:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 2 previous similar messages
        lola-30.log:Mar 10 00:14:46 lola-30 kernel: LustreError: 167-0: soaked-MDT0002-mdc-ffff88086534ec00: This client was evicted by soaked-MDT0002; in progress operations using this service will fail.
        lola-30.log:Mar 10 00:14:46 lola-30 kernel: LustreError: 42688:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -5
        lola-30.log:Mar 10 00:14:46 lola-30 kernel: Lustre: soaked-MDT0002-mdc-ffff88086534ec00: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
        lola-30.log:Mar 10 00:14:46 lola-30 kernel: LustreError: 42688:0:(llite_lib.c:2309:ll_prep_inode()) Skipped 1 previous similar message
        
      • Other jobs crash and leave orphaned files behind:
        451145:
        ls: cannot access 451145/pct-createunlink-0-412: No such file or directory
        total 9856
        d????????? ? ?        ?            ?            ? pct-createunlink-0-412
        

      We currently don't have debug logs for these events. I'll prepare clients and server nodes to trigger a debug creation.

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: