Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16720

large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Starting from March 30, right after landings on that day, a new assertion crash appeared in large-scale test 3a (only gets run in full testing I guess, so flew under radar)

      LustreError: 676976:0:(osp_precreate.c:488:osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]
      LustreError: 676976:0:(osp_precreate.c:488:osp_precreate_rollover_new_seq()) LBUG
      Pid: 676976, comm: osp-pre-0-0 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Thu Mar 2 00:54:22 UTC 2023
      Call Trace TBD:
      [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
      [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      [<0>] osp_precreate_thread+0x121d/0x1230 [osp]
      [<0>] kthread+0x10b/0x130
      [<0>] ret_from_fork+0x35/0x40 

       

      Example crashes:

      https://testing.whamcloud.com/test_sets/5173c0c5-ff80-4f5b-aec2-d6e1419cbd85

      https://testing.whamcloud.com/test_sets/68c90481-1450-4526-a659-b6d5d6b97f0a

      https://testing.whamcloud.com/test_sets/20a4a76a-e1bf-4f46-985c-b8cbed94e51b

      I suspect this is due to LU-11912 patch landing, the timing checks out.

      Attachments

        Issue Links

          Activity

            [LU-16720] large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]
            pjones Peter Jones added a comment -

            Believed to have been a duplicate of LU-11912

            pjones Peter Jones added a comment - Believed to have been a duplicate of LU-11912
            dongyang Dongyang Li added a comment - - edited

            I think I know what's going on.
            before large-scale, it was replay-ost-single, and it does replay_barrier on ost1, and from the logs the MDT0 osp got a new SEQ after the replay_barrier on ost1.

            [ 9541.509199] Lustre: DEBUG MARKER: == replay-ost-single test 12b: write after OST failover to a missing object ========================================================== 03:08:10 (1680059290)
            [ 9545.683083] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n debug
            [ 9546.092712] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0
            [ 9546.795469] Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x240000401 to 0x240000bd0
            

            the replay_barrier on ost1 dropping writes so we lost the seq range update, after that as we progress to large-scale when we need to allocate new SEQ from ofd we still got the old one because the seq range update is lost.
            forcing new seq on all mdts in replay-ost-single should fix this, I've updated https://review.whamcloud.com/c/fs/lustre-release/+/50478

            dongyang Dongyang Li added a comment - - edited I think I know what's going on. before large-scale, it was replay-ost-single, and it does replay_barrier on ost1, and from the logs the MDT0 osp got a new SEQ after the replay_barrier on ost1. [ 9541.509199] Lustre: DEBUG MARKER: == replay-ost-single test 12b: write after OST failover to a missing object ========================================================== 03:08:10 (1680059290) [ 9545.683083] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n debug [ 9546.092712] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n debug=0 [ 9546.795469] Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x240000401 to 0x240000bd0 the replay_barrier on ost1 dropping writes so we lost the seq range update, after that as we progress to large-scale when we need to allocate new SEQ from ofd we still got the old one because the seq range update is lost. forcing new seq on all mdts in replay-ost-single should fix this, I've updated https://review.whamcloud.com/c/fs/lustre-release/+/50478
            dongyang Dongyang Li added a comment - - edited

            This is a different issue to LU-16692. Looks like the LASSERT happened in osp_precreate_rollover_new_seq()
            During SEQ rollover we get a new SEQ id, and then it has to be different to the previous using SEQ saved in last_used_fid, note the object id from the last_used_fid is 0x3fff(the reduced SEQ width), which means the SEQ is used up and due to be changed.
            I feel like this is actually a bug found by changing the SEQ more frequently, maybe a race when changing the SEQ?

            dongyang Dongyang Li added a comment - - edited This is a different issue to LU-16692 . Looks like the LASSERT happened in osp_precreate_rollover_new_seq() During SEQ rollover we get a new SEQ id, and then it has to be different to the previous using SEQ saved in last_used_fid, note the object id from the last_used_fid is 0x3fff(the reduced SEQ width), which means the SEQ is used up and due to be changed. I feel like this is actually a bug found by changing the SEQ more frequently, maybe a race when changing the SEQ?

            Dongyang, this shouldn't be a case with replay_barrier, just creating a lot of files. It isn't exactly the same as LU-16692, since this is LASSERT that the sequences are different, while that ticket is LASSERT that they are the same.

            It seems like there is an off-by-one in the rollover? Also, it may be that we need to replace the LASSERT with error handling, since they seem too easily hit.

            adilger Andreas Dilger added a comment - Dongyang, this shouldn't be a case with replay_barrier, just creating a lot of files. It isn't exactly the same as LU-16692 , since this is LASSERT that the sequences are different, while that ticket is LASSERT that they are the same. It seems like there is an off-by-one in the rollover? Also, it may be that we need to replace the LASSERT with error handling, since they seem too easily hit.

            People

              dongyang Dongyang Li
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: