Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10686

sanity-pfl test 9 fails with “[0x100010000:0x6025:0x0] != “

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • Lustre 2.12.0
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7
    • None
    • 3
    • 9223372036854775807

    Description

      Since lustre-master build # 3703, 2.10.57.57, on 2018-01-31 we see sanity-pfl test_9 failing to get and compare the FID of the file’s second component with the error

      [0x100010000:0x6025:0x0] !=  
      

      The FID of the second component of the file after MDS failover should be on the right hand side of the “!=”.

      Looking at the suite_log, we see that there is some issue writing to the file

      dd: error writing '/mnt/lustre/d9.sanity-pfl/f9.sanity-pfl': No data available
      1+0 records in
      0+0 records out
      0 bytes copied, 0.000605975 s, 0.0 kB/s
      

      We know the file system isn’t full because, earlier in the test, ‘lfs df’ is printed and shows the file system only 2% full. I see this 'No data available' message when trying to reproduce this issue outside of autotest even without the replay-barrier and when the test succeeds. So, this is probably not the cause of the failure.

      Right after the failed write and prior to the MDS failover, we get the FID of the second component

      before MDS recovery, the ost fid of 2nd component is [0x100010000:0x6025:0x0]
      

      We then failover the MDS and, it looks like it is back on-line, we can’t get the FID of the second component

      onyx-32vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec
      onyx-32vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec
      after MDS recovery, the ost fid of 2nd component is 
       sanity-pfl test_9: @@@@@@ FAIL: [0x100010000:0x6025:0x0] !=  
      

      There doesn’t seem to be anything enlightening in the console and dmesg logs. Looking at the MDS log, we see the second component created

      00020000:01000000:0.0:1518721834.723830:0:29881:0:(lod_pool.c:919:lod_find_pool()) lustre-MDT0000-osd: request for an unknown pool (test_85b)
      00000004:00080000:0.0:1518721834.723866:0:29881:0:(osp_object.c:1546:osp_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x6025:0x0], index 1: 0
      

      Logs for this failure are at
      https://testing.hpdd.intel.com/test_sets/2df1628e-0736-11e8-a6ad-52540065bddc
      https://testing.hpdd.intel.com/test_sets/8c95d88c-0732-11e8-a7cd-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-10686] sanity-pfl test 9 fails with “[0x100010000:0x6025:0x0] != “

            This test was removed from ALWAYS_EXCEPT by patch https://review.whamcloud.com/32847 "LU-11158 mdt: grow lvb buffer to hold layout".

            adilger Andreas Dilger added a comment - This test was removed from ALWAYS_EXCEPT by patch https://review.whamcloud.com/32847 " LU-11158 mdt: grow lvb buffer to hold layout ".
            pjones Peter Jones added a comment -

            It sounds like this is believed to be a duplicate of LU-11158

            pjones Peter Jones added a comment - It sounds like this is believed to be a duplicate of LU-11158
            bobijam Zhenyu Xu added a comment - - edited

            Even it hit the ENODATA, MDS will still instantiate the available component. I don't mean that change the 2nd component end to EOF is not right, the essential issue here is that the component instantiation replay has bug, and LU-11158 patch can fix it.

            bobijam Zhenyu Xu added a comment - - edited Even it hit the ENODATA, MDS will still instantiate the available component. I don't mean that change the 2nd component end to EOF is not right, the essential issue here is that the component instantiation replay has bug, and LU-11158 patch can fix it.

            If the write returns ENODATA, how is it getting the component instantiated?  Does it still hit with the first byte...?

            paf Patrick Farrell (Inactive) added a comment - If the write returns ENODATA, how is it getting the component instantiated?  Does it still hit with the first byte...?
            bobijam Zhenyu Xu added a comment -

            yes, the write returns ENODATA, while the test just intends to instantiate the 2nd component, and verify the recovery replay the 2nd component instantiation.

            bobijam Zhenyu Xu added a comment - yes, the write returns ENODATA, while the test just intends to instantiate the 2nd component, and verify the recovery replay the 2nd component instantiation.

            Have you confirmed the write is no longer returning ENODATA?

            Given that the write in the test is beyond the end of the specified layout, I think we've still got a problem.

            paf Patrick Farrell (Inactive) added a comment - Have you confirmed the write is no longer returning ENODATA? Given that the write in the test is beyond the end of the specified layout, I think we've still got a problem.
            bobijam Zhenyu Xu added a comment -

            I've verified that LU-11158 fixed this issue.

            bobijam Zhenyu Xu added a comment - I've verified that LU-11158 fixed this issue.

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33137
            Subject: LU-10686 tests: correct layout in sanity-pfl 9
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 18d21ed1e69ccdff0d87da8b5fa58fa188e673cb

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/33137 Subject: LU-10686 tests: correct layout in sanity-pfl 9 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 18d21ed1e69ccdff0d87da8b5fa58fa188e673cb

            Well, I'm not sure why it doesn't pass in the single MDT config, but the problem with the test is pretty simple - We're writing beyond the defined layout for the file.  This test doesn't actually instantiate the layout.  The layout goes to 2 MiB, but the dd write is at 2 MiB.  That gets ENODATA because it's beyond the end of the layout.

            I'll push a patch.

            paf Patrick Farrell (Inactive) added a comment - Well, I'm not sure why it doesn't pass in the single MDT config, but the problem with the test is pretty simple - We're writing beyond the defined layout for the file.  This test doesn't actually instantiate the layout.  The layout goes to 2 MiB, but the dd write is at 2 MiB.  That gets ENODATA because it's beyond the end of the layout. I'll push a patch.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32945/
            Subject: LU-10686 tests: stop running sanity-pfl test 9
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1ca1da79a9e6b2af9f89a6c237d40b0333f64965

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32945/ Subject: LU-10686 tests: stop running sanity-pfl test 9 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ca1da79a9e6b2af9f89a6c237d40b0333f64965

            People

              bobijam Zhenyu Xu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: