Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12753

sanity test_300a: FAIL: 1:stripe_count is 1, expect 2

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • None
    • None
    • olegtest
    • 3
    • 9223372036854775807

    Description

      Seeing regular failures from Oleg's test environment:

      Initial testing failed:
      
          sanity2@ldiskfs+DNE Failure(6653s)
          - 300a(1:stripe_count is 1, expect 2)
      

      It looks like the MDS is still recovering from the shutdown in test_278 when test_300a is started, so it isn't creating a directory stripe on the missing MDT:

      [ 5346.863984] Lustre: DEBUG MARKER: == sanity test 278: Race starting MDS between MDTs stop/start ======================================== 13:15:41 (1568222141)
      [ 5347.936754] Lustre: Failing over lustre-MDT0001
      [ 5347.942260] Lustre: Skipped 11 previous similar messages
      [ 5347.946645] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_disconnect to node 192.168.203.134@tcp failed: rc = -19
      [ 5347.949716] LustreError: 16840:0:(osp_dev.c:485:osp_disconnect()) lustre-MDT0000-osp-MDT0001: can't disconnect: rc = -19
      [ 5347.954701] LustreError: 16840:0:(lod_dev.c:267:lod_sub_process_config()) lustre-MDT0001-mdtlov: error cleaning up LOD index 0: cmd 0xcf031 : rc = -19
      [ 5348.090212] LustreError: 16840:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 60c sleeping
      [ 5348.901928] Lustre: server umount lustre-MDT0000 complete
      [ 5348.905261] Lustre: Skipped 10 previous similar messages
      [ 5351.974023] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 5352.048229] LustreError: 17417:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 60c waking
      [ 5352.049878] LustreError: 16840:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 60c awake: rc=0
      [ 5352.050393] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 192.168.203.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 5352.050394] Lustre: Skipped 1 previous similar message
      [ 5352.051035] LustreError: 166-1: MGC192.168.203.134@tcp: Connection to MGS (at 192.168.203.134@tcp) was lost; in progress operations using this service will fail
      [ 5352.052430] Lustre: Evicted from MGS (at 192.168.203.134@tcp) after server handle changed from 0xb4695c9561748b31 to 0xb4695c9561c72f3e
      [ 5354.166243] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      [ 5355.558035] Lustre: DEBUG MARKER: oleg334-server.virtnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 8
      [ 5359.121998] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 5359.290980] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.203.134@tcp failed: rc = -114
      [ 5359.294022] LustreError: Skipped 1 previous similar message
      [ 5360.745846] Lustre: DEBUG MARKER: oleg334-server.virtnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 8
      [ 5363.943859] Lustre: DEBUG MARKER: == sanity test 300a: basic striped dir sanity test =================================================== 13:15:58 (1568222158)
      [ 5364.351961] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
      [ 5364.355319] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) Skipped 2 previous similar messages
      [ 5365.197447] Lustre: lustre-MDT0000: Recovery over after 0:08, of 2 clients 2 recovered and 0 were evicted.
      [ 5365.202538] Lustre: Skipped 2 previous similar messages
      [ 5365.217245] Lustre: lustre-OST0001: deleting orphan objects from 0x0:53071 to 0x0:53089
      [ 5365.217254] Lustre: lustre-OST0000: deleting orphan objects from 0x0:53209 to 0x0:53281
      [ 5365.246825] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
      [ 5365.268393] Lustre: lustre-OST0000: deleting orphan objects from 0x280000400:8777 to 0x280000400:8897
      [ 5365.268407] Lustre: lustre-OST0001: deleting orphan objects from 0x2c0000400:8931 to 0x2c0000400:8993
      [ 5365.795157] Lustre: DEBUG MARKER: sanity test_300a: @@@@@@ FAIL: 1:stripe_count is 1, expect 2
      

      It seems likely that we need something like "wait_recovery_complete mds2" at the end of test_278() instead of letting it bleed into the next test.

      Attachments

        Activity

          People

            adilger Andreas Dilger
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: