Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7822

import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time

Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
      https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
      MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

      Sequence of events:

      • 2016-02-26 08:46:16,864:fsmgmt.fsmgmt:INFO triggering fault mds_failover
        2016-02-26 08:46:16,865:fsmgmt.fsmgmt:INFO reseting MDS node lola-8
      • 2016-02-26 09:01:59,696:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-9
      • 2016-02-26 09:02:52,797:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9
      • 2016-02-26 09:03:13,043:fsmgmt.fsmgmt:INFO Don't wait for recovery to complete. Failback MDT's immediately
      • 2016-02-26 09:03:13,044:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0001 ...
      • 2016-02-26 09:04:04,707:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8 (--> failed back)
      • 2016-02-26 09:04:04,708:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ...
      • 2016-02-26 09:04:25,446:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-8 (--> failed back)
      • 2016-02- 26 09:15:57 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 11:45, of 16 clients 15 recovered and 1 was evicte
        d.
      • 201602-26 09:16:48 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 12:16, of 16 clients 16 recovered and 0 were evicted.

      All Lustre clients show the following error message:

      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
      lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456506319 ref 2 fl Interpret:RP/4/0 rc 301/301
      

      The corresponding error message on MGS/MDT node lola-8 is:

      lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0
      

      After recovery completed the errors appear in messages file of lola-8

      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806786, ql: 7, comp: 9, conn: 16, next: 120260806788, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806789, ql: 7, comp: 9, conn: 16, next: 120260806887, next_update 120260806887 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806888, ql: 7, comp: 9, conn: 16, next: 120260806892, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806893, ql: 7, comp: 9, conn: 16, next: 120260806903, next_update 120260806903 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806904, ql: 7, comp: 9, conn: 16, next: 120260806946, next_update 120260806960 last_committed: 120260806785)
      lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806947, ql: 7, comp: 9, conn: 16, next: 120260806958, next_update 120260806960 last_committed: 120260806785)
      ...
      ... etc etc
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-soaked-MDT0000_UUID lock: ffff8803ecf60dc0/0x73ce4df547baebbb lrc: 3/0,0 mode: EX/EX res: [0x200010da0:0x74:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x40000001000020 nid: 0@lo remote: 0x73ce4df547baebb4 expref: 95 pid: 4836 timeout: 0 lvb_type: 0
      lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) Skipped 3 previous similar messages
      ...
      ...
      

      Attached console, message files of lola-8,9,27 (27--> as example client)

      Attachments

        1. console-lola-27.log.bz2
          45 kB
        2. console-lola-8.log.bz2
          476 kB
        3. console-lola-9.log.bz2
          520 kB
        4. messages-lola-27.log.bz2
          239 kB
        5. messages-lola-8.log.bz2
          186 kB
        6. messages-lola-9.log.bz2
          228 kB

        Activity

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: