Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7653

replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.8.0
    • None
    • Configuration : 4 Node - ( 1 MDS/1 OSS/2 Clients)_dne_singlemds
      Release
      2.6.32_431.29.2.el6_lustremaster_9267_2_g959f8f7 Build Date: Sat 02 Jan 2016 05:21:40 PM UTC
      Server 2.7.64
      Client 2.7.64
    • 3
    • 9223372036854775807

    Description

      == replay-single test 110f: DNE: create striped dir, fail MDT1/MDT2 == 04:57:43 (1452229063)
      Filesystem          1K-blocks  Used Available Use% Mounted on
      fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
      Filesystem          1K-blocks  Used Available Use% Mounted on
      fre1225@tcp:/lustre   1377952 68464   1236816   6% /mnt/lustre
      Failing mds1 on fre1225
      Stopping /mnt/mds1 (opts:) on fre1225
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Failing mds2 on fre1225
      Stopping /mnt/mds2 (opts:) on fre1225
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to fre1225
      04:57:56 (1452229076) waiting for fre1225 network 900 secs ...
      04:57:56 (1452229076) network interface is UP
      mount facets: mds1
      Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/mds1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Started lustre-MDT0000
      reboot facets: mds2
      Failover mds2 to fre1225
      04:58:07 (1452229087) waiting for fre1225 network 900 secs ...
      04:58:07 (1452229087) network interface is UP
      mount facets: mds2
      Starting mds2: -o rw,user_xattr  /dev/vdc /mnt/mds2
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      pdsh@fre1227: fre1225: ssh exited with exit code 1
      Started lustre-MDT0001
      fre1228: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      fre1227: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 5 sec
      fre1228: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      fre1227: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 0 sec
      /mnt/lustre/d110f.replay-single/striped_dir has type dir OK
       replay-single test_110f: @@@@@@ FAIL: 1 != 2 after recovery 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4767:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4798:error()
        = /usr/lib64/lustre/tests/replay-single.sh:3600:check_striped_dir_110()
        = /usr/lib64/lustre/tests/replay-single.sh:3725:test_110f()
        = /usr/lib64/lustre/tests/test-framework.sh:5045:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5082:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4899:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:3731:main()
      Dumping lctl log to /tmp/test_logs/1452229057/replay-single.test_110f.*.1452229095.log
      fre1228: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      fre1225: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      fre1226: Warning: Permanently added 'fre1227,192.168.112.27' (RSA) to the list of known hosts.
      
      FAIL 110f (33s)
      

      Attachments

        1. 110f__0.console.MDS.log
          149 kB
        2. 110f__0.messages.MDS.log
          234 kB
        3. 110f__0.stdout.log
          7 kB
        4. 110f__PTLDEBUG.lctl.tgz
          1.69 MB
        5. 110f.lctl.tgz
          958 kB

        Activity

          [LU-7653] replay-single/test_110f test failed Lustre: DEBUG MARKER: replay-single test_110f: FAIL: 1 != 2 after recovery
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38137/
          Subject: LU-7653 lod: fix stripe allocation during recovery
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 858c0c6959c1319e83a18be5ef6cb50251542052

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38137/ Subject: LU-7653 lod: fix stripe allocation during recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: 858c0c6959c1319e83a18be5ef6cb50251542052

          Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/38137
          Subject: LU-7653 lod: fix stripe allocation during recovery
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 990d9d8cf607b035d2b341588212b77faf99f309

          gerrit Gerrit Updater added a comment - Alexander Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/38137 Subject: LU-7653 lod: fix stripe allocation during recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 990d9d8cf607b035d2b341588212b77faf99f309
          adilger Andreas Dilger added a comment - Still seeing this failure very frequently on Oleg's test system: Test session: http://testing.linuxhacker.ru:3333/lustre-reports/3524/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/ Subtest log: http://testing.linuxhacker.ru:3333/lustre-reports/3524/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/replay-single.test_110f.test_log.oleg264-client.log

          This is actually landed to master for 2.11.0.
          (fixing the fixVersion)

          jgmitter Joseph Gmitter (Inactive) added a comment - This is actually landed to master for 2.11.0. (fixing the fixVersion)
          ys Yang Sheng added a comment -

          Landed to 2.10.

          ys Yang Sheng added a comment - Landed to 2.10.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/
          Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: c245e87fad249622b1974dd64f3e497653269ee6

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27940/ Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: Commit: c245e87fad249622b1974dd64f3e497653269ee6
          ys Yang Sheng added a comment -

          Hi, ZhangWei,

          As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node.

          Thanks,
          YangSheng

          ys Yang Sheng added a comment - Hi, ZhangWei, As my understanding, getdirstripe will try to get status from mds1 in DNE. If mds1 finished failover but mds2 does not, then we may get a stripecount less than mdscount. So keep mds1 finished failover in last can fix this issue especially when mds1 & mds2 setup on same node. Thanks, YangSheng

          But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

          Red ZhangWei (Inactive) added a comment - But, why this issue can be fixed just change the fail order of mds1 and mds2 ? Can someone explain this ?

          Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940
          Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

          gerrit Gerrit Updater added a comment - Parinay Kondekar (parinay.kondekar@seagate.com) uploaded a new patch: https://review.whamcloud.com/27940 Subject: LU-7653 tests: replay-single/110f fails for mdts on same MDS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 10e2e6ab2a1c449ccd2cd5cacac1efcbefed29fb

          People

            aboyko Alexander Boyko
            parinay parinay v kondekar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: