Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7822

import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time

Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
      https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
      MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

      Sequence of events:

      • 2016-02-26 08:46:16,864:fsmgmt.fsmgmt:INFO triggering fault mds_failover
        2016-02-26 08:46:16,865:fsmgmt.fsmgmt:INFO reseting MDS node lola-8
      • 2016-02-26 09:01:59,696:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-9
      • 2016-02-26 09:02:52,797:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9
      • 2016-02-26 09:03:13,043:fsmgmt.fsmgmt:INFO Don't wait for recovery to complete. Failback MDT's immediately
      • 2016-02-26 09:03:13,044:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0001 ...
      • 2016-02-26 09:04:04,707:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8 (--> failed back)
      • 2016-02-26 09:04:04,708:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ...
      • 2016-02-26 09:04:25,446:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-8 (--> failed back)
      • 2016-02- 26 09:15:57 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 11:45, of 16 clients 15 recovered and 1 was evicte
        d.
      • 201602-26 09:16:48 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 12:16, of 16 clients 16 recovered and 0 were evicted.

      All Lustre clients show the following error message:

      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456506319 ref 2 fl Interpret:RP/4/0 rc 301/301
      

      The corresponding error message on MGS/MDT node lola-8 is:

      lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0
      

      After recovery completed the errors appear in messages file of lola-8

      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806786, ql: 7, comp: 9, conn: 16, next: 120260806788, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806789, ql: 7, comp: 9, conn: 16, next: 120260806887, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806888, ql: 7, comp: 9, conn: 16, next: 120260806892, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806893, ql: 7, comp: 9, conn: 16, next: 120260806903, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806904, ql: 7, comp: 9, conn: 16, next: 120260806946, next_update 120260806960 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806947, ql: 7, comp: 9, conn: 16, next: 120260806958, next_update 120260806960 last_committed: 120260806785)
      ...
      ... etc etc
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-soaked-MDT0000_UUID lock: ffff8803ecf60dc0/0x73ce4df547baebbb lrc: 3/0,0 mode: EX/EX res: [0x200010da0:0x74:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x40000001000020 nid: 0@lo remote: 0x73ce4df547baebb4 expref: 95 pid: 4836 timeout: 0 lvb_type: 0
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) Skipped 3 previous similar messages
      ...
      ...
      

      Attached console, message files of lola-8,9,27 (27--> as example client)

      Attachments

        1. console-lola-27.log.bz2
          45 kB
        2. console-lola-8.log.bz2
          476 kB
        3. console-lola-9.log.bz2
          520 kB
        4. messages-lola-27.log.bz2
          239 kB
        5. messages-lola-8.log.bz2
          186 kB
        6. messages-lola-9.log.bz2
          228 kB

        Activity

          [LU-7822] import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time

          looks like the open replay data is already deleted from the replay request. So I do not know why this request still exist in the replay list. I wonder if this related with the recent xid change? Niu, please have a look? thanks.

          LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456470692 ref 2 fl Interpret:RP/4/0 rc 301/301
          

          And also this is probably not a critical ticket here, because I did not see this broke the consistency of the namespace. Only some noise message caused by this mysterious replay request?

          di.wang Di Wang (Inactive) added a comment - looks like the open replay data is already deleted from the replay request. So I do not know why this request still exist in the replay list. I wonder if this related with the recent xid change? Niu, please have a look? thanks. LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data. req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456470692 ref 2 fl Interpret:RP/4/0 rc 301/301 And also this is probably not a critical ticket here, because I did not see this broke the consistency of the namespace. Only some noise message caused by this mysterious replay request?

          hmm, it looks like the object is missing for this replay(open_no_create) request.

          lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
          

          This is interesting. lola-27 seems keeps the same replay request to lola-8 in every recovery.

          messages-lola-8.log:Feb 25 16:53:17 lola-8 kernel: LustreError: 4536:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8804346976c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:209/0 lens 896/3512 e 0 to 0 dl 1456448059 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 17:15:44 lola-8 kernel: LustreError: 4451:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88082b319cc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:738/0 lens 896/3512 e 0 to 0 dl 1456450098 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 23:10:46 lola-8 kernel: LustreError: 4401:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880816d100c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:191/0 lens 896/3512 e 0 to 0 dl 1456470691 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 25 23:54:27 lola-8 kernel: LustreError: 4450:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8807b679d0c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:512/0 lens 896/3512 e 0 to 0 dl 1456473277 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 00:30:11 lola-8 kernel: LustreError: 4389:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880821640080 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:391/0 lens 896/3512 e 0 to 0 dl 1456475421 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 00:56:32 lola-8 kernel: LustreError: 4543:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880824a71980 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:498/0 lens 896/3512 e 0 to 0 dl 1456477038 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 02:40:43 lola-8 kernel: LustreError: 4449:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88081e04dcc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:668/0 lens 896/3512 e 0 to 0 dl 1456483248 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 08:00:34 lola-8 kernel: LustreError: 4408:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808119ea9c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:229/0 lens 896/3512 e 0 to 0 dl 1456502439 ref 1 fl Interpret:/4/0 rc 0/0
          messages-lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0
          

          I think client should throw such replay request once it is failed, since it is only open, no create.

          di.wang Di Wang (Inactive) added a comment - hmm, it looks like the object is missing for this replay(open_no_create) request. lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 This is interesting. lola-27 seems keeps the same replay request to lola-8 in every recovery. messages-lola-8.log:Feb 25 16:53:17 lola-8 kernel: LustreError: 4536:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8804346976c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:209/0 lens 896/3512 e 0 to 0 dl 1456448059 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 17:15:44 lola-8 kernel: LustreError: 4451:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff88082b319cc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:738/0 lens 896/3512 e 0 to 0 dl 1456450098 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 23:10:46 lola-8 kernel: LustreError: 4401:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880816d100c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:191/0 lens 896/3512 e 0 to 0 dl 1456470691 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 25 23:54:27 lola-8 kernel: LustreError: 4450:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8807b679d0c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:512/0 lens 896/3512 e 0 to 0 dl 1456473277 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 00:30:11 lola-8 kernel: LustreError: 4389:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880821640080 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:391/0 lens 896/3512 e 0 to 0 dl 1456475421 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 00:56:32 lola-8 kernel: LustreError: 4543:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff880824a71980 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:498/0 lens 896/3512 e 0 to 0 dl 1456477038 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 02:40:43 lola-8 kernel: LustreError: 4449:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff88081e04dcc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:668/0 lens 896/3512 e 0 to 0 dl 1456483248 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 08:00:34 lola-8 kernel: LustreError: 4408:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8808119ea9c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:229/0 lens 896/3512 e 0 to 0 dl 1456502439 ref 1 fl Interpret:/4/0 rc 0/0 messages-lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid. req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0 I think client should throw such replay request once it is failed, since it is only open, no create.

          Hi Di,
          Can you take a look at this issue?
          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Di, Can you take a look at this issue? Thanks. Joe

          People

            di.wang Di Wang (Inactive)
            heckes Frank Heckes (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: