Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10686

sanity-pfl test 9 fails with “[0x100010000:0x6025:0x0] != “

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7
    • Fix Version/s: Lustre 2.12.0
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Since lustre-master build # 3703, 2.10.57.57, on 2018-01-31 we see sanity-pfl test_9 failing to get and compare the FID of the file’s second component with the error

      [0x100010000:0x6025:0x0] !=  
      

      The FID of the second component of the file after MDS failover should be on the right hand side of the “!=”.

      Looking at the suite_log, we see that there is some issue writing to the file

      dd: error writing '/mnt/lustre/d9.sanity-pfl/f9.sanity-pfl': No data available
      1+0 records in
      0+0 records out
      0 bytes copied, 0.000605975 s, 0.0 kB/s
      

      We know the file system isn’t full because, earlier in the test, ‘lfs df’ is printed and shows the file system only 2% full. I see this 'No data available' message when trying to reproduce this issue outside of autotest even without the replay-barrier and when the test succeeds. So, this is probably not the cause of the failure.

      Right after the failed write and prior to the MDS failover, we get the FID of the second component

      before MDS recovery, the ost fid of 2nd component is [0x100010000:0x6025:0x0]
      

      We then failover the MDS and, it looks like it is back on-line, we can’t get the FID of the second component

      onyx-32vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec
      onyx-32vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 16 sec
      after MDS recovery, the ost fid of 2nd component is 
       sanity-pfl test_9: @@@@@@ FAIL: [0x100010000:0x6025:0x0] !=  
      

      There doesn’t seem to be anything enlightening in the console and dmesg logs. Looking at the MDS log, we see the second component created

      00020000:01000000:0.0:1518721834.723830:0:29881:0:(lod_pool.c:919:lod_find_pool()) lustre-MDT0000-osd: request for an unknown pool (test_85b)
      00000004:00080000:0.0:1518721834.723866:0:29881:0:(osp_object.c:1546:osp_create()) lustre-OST0001-osc-MDT0000: Wrote last used FID: [0x100010000:0x6025:0x0], index 1: 0
      

      Logs for this failure are at
      https://testing.hpdd.intel.com/test_sets/2df1628e-0736-11e8-a6ad-52540065bddc
      https://testing.hpdd.intel.com/test_sets/8c95d88c-0732-11e8-a7cd-52540065bddc

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bobijam Zhenyu Xu
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: