Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a660d22a-e346-11e9-a0ba-52540065bddc

      test_413b failed with the following error:

      weight diff=-25% must be > 50% ...Fill MDT0 with 45712 files
      weight diff=-25% must be > 50% ...Fill MDT0 with 45670 files
      weight diff=-25% must be > 50% ...Fill MDT0 with 45712 files
      Timeout occurred after 447 mins, last suite running was sanity, restarting cluster to continue tests
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_413b - Timeout occurred after 447 mins, last suite running was sanity, restarting cluster to continue tests

      Attachments

        Issue Links

          Activity

            [LU-12831] sanity test_413b timed out

            +1 on b2_12

            https://testing.whamcloud.com/test_sets/3d7196ad-cecb-420d-8a02-a79f02ebc7ca

            [ 9848.724723] Lustre: DEBUG MARKER: == sanity test 409: Large amount of cross-MDTs hard links on the same file ==== 14:53:05 (1589035985)
            [ 9904.040345] LDISKFS-fs error (device dm-16) in ldiskfs_free_blocks:5463: IO failure
            [ 9904.062336] Aborting journal on device dm-16-8.
            [ 9904.063288] LDISKFS-fs error (device dm-16) in ldiskfs_orphan_add:3370: Journal has aborted
            [ 9904.063461] LDISKFS-fs (dm-16): Remounting filesystem read-only
            [ 9904.161794] LustreError: 28847:0:(ofd_dev.c:1804:ofd_destroy_hdl()) lustre-OST0005: error destroying object [0x440000402:0x11e:0x0]: -30
            
            adilger Andreas Dilger added a comment - +1 on b2_12 https://testing.whamcloud.com/test_sets/3d7196ad-cecb-420d-8a02-a79f02ebc7ca [ 9848.724723] Lustre: DEBUG MARKER: == sanity test 409: Large amount of cross-MDTs hard links on the same file ==== 14:53:05 (1589035985) [ 9904.040345] LDISKFS-fs error (device dm-16) in ldiskfs_free_blocks:5463: IO failure [ 9904.062336] Aborting journal on device dm-16-8. [ 9904.063288] LDISKFS-fs error (device dm-16) in ldiskfs_orphan_add:3370: Journal has aborted [ 9904.063461] LDISKFS-fs (dm-16): Remounting filesystem read-only [ 9904.161794] LustreError: 28847:0:(ofd_dev.c:1804:ofd_destroy_hdl()) lustre-OST0005: error destroying object [0x440000402:0x11e:0x0]: -30

            In recent failures, this doesn't look like a timeout issue because of too many inodes, but rather a problem on the OST from an earlier test that causes the filesystem to be mounted read-only:

            [ 9991.380259] Lustre: DEBUG MARKER: == sanity test 409: Large amount of cross-MDTs hard links on the same file =========================== 23:26:58 (1588807618)
            [10021.751428] LDISKFS-fs error (device dm-17) in ldiskfs_free_blocks:5463: IO failure
            [10021.755577] Aborting journal on device dm-17-8.
            [10021.757692] LDISKFS-fs error (device dm-17): ldiskfs_journal_check_start:56: [10021.757918] LDISKFS-fs (dm-17): Remounting filesystem read-only
            [10021.757922] LDISKFS-fs error (device dm-17) in ldiskfs_reserve_inode_write:5332: Journal has aborted
            [10021.758024] LDISKFS-fs error (device dm-17) in ldiskfs_reserve_inode_write:5332: Journal has aborted
            

            This happened on two separate test runs on b2_12:
            https://testing.whamcloud.com/test_sets/9d1b5d16-9c7e-4138-9ed6-679db5cd60cc
            https://testing.whamcloud.com/test_sets/dab2eaa6-de05-4385-87de-9d9e3ebbcdbc

            adilger Andreas Dilger added a comment - In recent failures, this doesn't look like a timeout issue because of too many inodes, but rather a problem on the OST from an earlier test that causes the filesystem to be mounted read-only: [ 9991.380259] Lustre: DEBUG MARKER: == sanity test 409: Large amount of cross-MDTs hard links on the same file =========================== 23:26:58 (1588807618) [10021.751428] LDISKFS-fs error (device dm-17) in ldiskfs_free_blocks:5463: IO failure [10021.755577] Aborting journal on device dm-17-8. [10021.757692] LDISKFS-fs error (device dm-17): ldiskfs_journal_check_start:56: [10021.757918] LDISKFS-fs (dm-17): Remounting filesystem read-only [10021.757922] LDISKFS-fs error (device dm-17) in ldiskfs_reserve_inode_write:5332: Journal has aborted [10021.758024] LDISKFS-fs error (device dm-17) in ldiskfs_reserve_inode_write:5332: Journal has aborted This happened on two separate test runs on b2_12: https://testing.whamcloud.com/test_sets/9d1b5d16-9c7e-4138-9ed6-679db5cd60cc https://testing.whamcloud.com/test_sets/dab2eaa6-de05-4385-87de-9d9e3ebbcdbc

            There should probably be a limit on the number of files that need to be created to make the MDTs be imbalanced. That might be increased for SLOW=yes tests. 

            adilger Andreas Dilger added a comment - There should probably be a limit on the number of files that need to be created to make the MDTs be imbalanced. That might be increased for SLOW=yes tests. 

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: