Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7653

replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.8.0
    • None
    • Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)_dne_singlemds
      Release
      2.6.32_431.29.2.el6_lustremaster_9267_2_g959f8f7 Build Date: Sat 02 Jan 2016 05:21:40 PM UTC
      Server 2.7.64
      Client 2.7.64
    • 3
    • 9223372036854775807

    Description

      == replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 04:57:43 (1452229063)
      Filesystem          1K-blocks  Used Available Use% Mounted on
      fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
      Filesystem          1K-blocks  Used Available Use% Mounted on
      fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
      Failing mds1 on fre1225
      Stopping /mnt/mds1 (opts:) on fre1225
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Failing mds2 on fre1225
      Stopping /mnt/mds2 (opts:) on fre1225
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to fre1225
      04:57:56 (1452229076) waiting for fre1225 network 900 secs ...
      04:57:56 (1452229076) network interface is UP
      mount facets: mds1
      Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Started lustre-MDT0000
      reboot facets: mds2
      Failover mds2 to fre1225
      04:58:07 (1452229087) waiting for fre1225 network 900 secs ...
      04:58:07 (1452229087) network interface is UP
      mount facets: mds2
      Starting mds2: -o rw,user_xattr  /dev/vdc /mnt/mds2
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Started lustre-MDT0001
      fre1228: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      fre1227: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      fre1228: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      fre1227: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      /mnt/lustre/d110f.replay-single/striped_dir has type dir OK
       replay-single test_110f: @@@@@@ FAIL: 1 != 2 after recovery 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
        = /usr/lib64/lustre/tests/replay-single.sh:3600:check_striped_dir_110()
        = /usr/lib64/lustre/tests/replay-single.sh:3725:test_110f()
        = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4899:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:3731:main()
      Dumping lctl log to /tmp/test_logs/1452229057/replay-single.test_110f.*.1452229095.log
      fre1228: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      fre1225: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      fre1226: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      FAIL 110f (33s)
      

      Attachments

        1. 110f__0.console.MDS.log
          149 kB
        2. 110f__0.messages.MDS.log
          234 kB
        3. 110f__0.stdout.log
          7 kB
        4. 110f__PTLDEBUG.lctl.tgz
          1.69 MB
        5. 110f.lctl.tgz
          958 kB

        Activity

          [LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery

          This is actually landed to master for 2.11.0.
          (fixing the fixVersion)

          jgmitter Joseph Gmitter (Inactive) added a comment - This is actually landed to master for 2.11.0. (fixing the fixVersion)
          ys Yang Sheng added a comment -

          Landed to 2.10.

          ys Yang Sheng added a comment - Landed to 2.10.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/
          Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c245e87fad249622b1974dd64f3e497653269ee6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/ Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: Commit: c245e87fad249622b1974dd64f3e497653269ee6
          ys Yang Sheng added a comment -

          Hi, ZhangWei,

          As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node.

          Thanks,
          YangSheng

          ys Yang Sheng added a comment - Hi, ZhangWei, As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node. Thanks, YangSheng

          But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

          Red ZhangWei (Inactive) added a comment - But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

          Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940
          Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

          gerrit Gerrit Updater added a comment - Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940 Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb
          Red ZhangWei (Inactive) added a comment - - edited

          We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows:

          test_110f() {
                   ...
                  mkdir -p $DIR/$tdir
                  replay_barrier mds1 
                  replay_barrier mds2 
                  $LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir
                 * fail mds2,mds1*
          
                  check_striped_dir_110 || error "check striped_dir failed"
          
                  rm -rf $DIR/$tdir || error "rmdir failed"
          
                  return 0
          }
          

          And Our Lustre version is 2.9. Can some one help about this issue ?

          Red ZhangWei (Inactive) added a comment - - edited We reproduce this issue, and as Mr Yang said, this issue can be fixed just changing the script as follows: test_110f() { ... mkdir -p $DIR/$tdir replay_barrier mds1 replay_barrier mds2 $LFS mkdir -i1 -c$MDSCOUNT $DIR/$tdir/striped_dir * fail mds2,mds1* check_striped_dir_110 || error "check striped_dir failed" rm -rf $DIR/$tdir || error "rmdir failed" return 0 } And Our Lustre version is 2.9. Can some one help about this issue ?
          ys Yang Sheng added a comment -

          This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order.

          Thanks,
          YangSheng

          ys Yang Sheng added a comment - This issue is very easy to reproduce when mds1 & mds2 setup on same node. Looks like it can be fixed just changing failover order. Thanks, YangSheng

          James,
          Its very consistent. I am attaching PTLDEBUG=-1 logs here.

          Thanks.

          parinay parinay v kondekar (Inactive) added a comment - James, Its very consistent. I am attaching PTLDEBUG=-1 logs here. Thanks.

          Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you?

          jamesanunez James Nunez (Inactive) added a comment - Parinay - We don't see this test fail in our testing. How often do you see this failure and does it fail consistently for you?

          People

            aboyko Alexander Boyko
            parinay parinay v kondekar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: