Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7822

import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time

Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
      https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
      MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

      Sequence of events:

      • 2016-02-26 08:46:16,864:fsmgmt.fsmgmt:INFO triggering fault mds_failover
        2016-02-26 08:46:16,865:fsmgmt.fsmgmt:INFO reseting MDS node lola-8
      • 2016-02-26 09:01:59,696:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-9
      • 2016-02-26 09:02:52,797:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9
      • 2016-02-26 09:03:13,043:fsmgmt.fsmgmt:INFO Don't wait for recovery to complete. Failback MDT's immediately
      • 2016-02-26 09:03:13,044:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0001 ...
      • 2016-02-26 09:04:04,707:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8 (--> failed back)
      • 2016-02-26 09:04:04,708:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ...
      • 2016-02-26 09:04:25,446:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-8 (--> failed back)
      • 2016-02- 26 09:15:57 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 11:45, of 16 clients 15 recovered and 1 was evicte
        d.
      • 201602-26 09:16:48 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 12:16, of 16 clients 16 recovered and 0 were evicted.

      All Lustre clients show the following error message:

      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456506319 ref 2 fl Interpret:RP/4/0 rc 301/301
      

      The corresponding error message on MGS/MDT node lola-8 is:

      lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0
      

      After recovery completed the errors appear in messages file of lola-8

      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806786, ql: 7, comp: 9, conn: 16, next: 120260806788, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806789, ql: 7, comp: 9, conn: 16, next: 120260806887, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806888, ql: 7, comp: 9, conn: 16, next: 120260806892, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806893, ql: 7, comp: 9, conn: 16, next: 120260806903, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806904, ql: 7, comp: 9, conn: 16, next: 120260806946, next_update 120260806960 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806947, ql: 7, comp: 9, conn: 16, next: 120260806958, next_update 120260806960 last_committed: 120260806785)
      ...
      ... etc etc
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-soaked-MDT0000_UUID lock: ffff8803ecf60dc0/0x73ce4df547baebbb lrc: 3/0,0 mode: EX/EX res: [0x200010da0:0x74:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x40000001000020 nid: 0@lo remote: 0x73ce4df547baebb4 expref: 95 pid: 4836 timeout: 0 lvb_type: 0
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) Skipped 3 previous similar messages
      ...
      ...
      

      Attached console, message files of lola-8,9,27 (27--> as example client)

      Attachments

        1. console-lola-27.log.bz2
          45 kB
        2. console-lola-8.log.bz2
          476 kB
        3. console-lola-9.log.bz2
          520 kB
        4. messages-lola-27.log.bz2
          239 kB
        5. messages-lola-8.log.bz2
          186 kB
        6. messages-lola-9.log.bz2
          228 kB

        Activity

          [LU-7822] import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time
          di.wang Di Wang added a comment -

          looks like the open replay data is already deleted from the replay request. So I do not know why this request still exist in the replay list. I wonder if this related with the recent xid change? Niu, please have a look? thanks.

          LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456470692 ref 2 fl Interpret:RP/4/0 rc 301/301
          

          And also this is probably not a critical ticket here, because I did not see this broke the consistency of the namespace. Only some noise message caused by this mysterious replay request?

          di.wang Di Wang added a comment - looks like the open replay data is already deleted from the replay request. So I do not know why this request still exist in the replay list. I wonder if this related with the recent xid change? Niu, please have a look? thanks. LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data. req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456470692 ref 2 fl Interpret:RP/4/0 rc 301/301 And also this is probably not a critical ticket here, because I did not see this broke the consistency of the namespace. Only some noise message caused by this mysterious replay request?
          di.wang Di Wang added a comment -

          hmm, it looks like the object is missing for this replay(open_no_create) request.

          lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
          

          This is interesting. lola-27 seems keeps the same replay request to lola-8 in every recovery.

          messages-lola-8.log:Feb 25 16:53:17 lola-8 kernel: LustreError: 4536:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8804346976c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:209/0 lens 896/3512 e 0 to 0 dl 1456448059 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 17:15:44 lola-8 kernel: LustreError: 4451:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88082b319cc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:738/0 lens 896/3512 e 0 to 0 dl 1456450098 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 23:10:46 lola-8 kernel: LustreError: 4401:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880816d100c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:191/0 lens 896/3512 e 0 to 0 dl 1456470691 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 23:54:27 lola-8 kernel: LustreError: 4450:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8807b679d0c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:512/0 lens 896/3512 e 0 to 0 dl 1456473277 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 00:30:11 lola-8 kernel: LustreError: 4389:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880821640080 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:391/0 lens 896/3512 e 0 to 0 dl 1456475421 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 00:56:32 lola-8 kernel: LustreError: 4543:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880824a71980 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:498/0 lens 896/3512 e 0 to 0 dl 1456477038 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 02:40:43 lola-8 kernel: LustreError: 4449:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88081e04dcc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:668/0 lens 896/3512 e 0 to 0 dl 1456483248 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 08:00:34 lola-8 kernel: LustreError: 4408:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808119ea9c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:229/0 lens 896/3512 e 0 to 0 dl 1456502439 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0
          

          I think client should throw such replay request once it is failed, since it is only open, no create.

          di.wang Di Wang added a comment - hmm, it looks like the object is missing for this replay(open_no_create) request. lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 This is interesting. lola-27 seems keeps the same replay request to lola-8 in every recovery. messages-lola-8.log:Feb 25 16:53:17 lola-8 kernel: LustreError: 4536:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8804346976c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:209/0 lens 896/3512 e 0 to 0 dl 1456448059 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 17:15:44 lola-8 kernel: LustreError: 4451:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff88082b319cc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:738/0 lens 896/3512 e 0 to 0 dl 1456450098 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 23:10:46 lola-8 kernel: LustreError: 4401:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880816d100c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:191/0 lens 896/3512 e 0 to 0 dl 1456470691 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 23:54:27 lola-8 kernel: LustreError: 4450:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8807b679d0c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:512/0 lens 896/3512 e 0 to 0 dl 1456473277 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 00:30:11 lola-8 kernel: LustreError: 4389:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880821640080 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:391/0 lens 896/3512 e 0 to 0 dl 1456475421 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 00:56:32 lola-8 kernel: LustreError: 4543:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880824a71980 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:498/0 lens 896/3512 e 0 to 0 dl 1456477038 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 02:40:43 lola-8 kernel: LustreError: 4449:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff88081e04dcc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:668/0 lens 896/3512 e 0 to 0 dl 1456483248 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 08:00:34 lola-8 kernel: LustreError: 4408:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8808119ea9c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:229/0 lens 896/3512 e 0 to 0 dl 1456502439 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0 I think client should throw such replay request once it is failed, since it is only open, no create.

          Hi Di,
          Can you take a look at this issue?
          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Di, Can you take a look at this issue? Thanks. Joe

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: