[LU-7822] import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0001_UUID went back in time Created: 26/Feb/16  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Cannot Reproduce Votes: 1
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-b2_8/8/


Attachments: File console-lola-27.log.bz2     File console-lola-8.log.bz2     File console-lola-9.log.bz2     File messages-lola-27.log.bz2     File messages-lola-8.log.bz2     File messages-lola-9.log.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

Sequence of events:

  • 2016-02-26 08:46:16,864:fsmgmt.fsmgmt:INFO triggering fault mds_failover
    2016-02-26 08:46:16,865:fsmgmt.fsmgmt:INFO reseting MDS node lola-8
  • 2016-02-26 09:01:59,696:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-9
  • 2016-02-26 09:02:52,797:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-9
  • 2016-02-26 09:03:13,043:fsmgmt.fsmgmt:INFO Don't wait for recovery to complete. Failback MDT's immediately
  • 2016-02-26 09:03:13,044:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0001 ...
  • 2016-02-26 09:04:04,707:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8 (--> failed back)
  • 2016-02-26 09:04:04,708:fsmgmt.fsmgmt:INFO Failing back soaked-MDT0000 ...
  • 2016-02-26 09:04:25,446:fsmgmt.fsmgmt:INFO ... soaked-MDT0000 mounted successfully on lola-8 (--> failed back)
  • 2016-02- 26 09:15:57 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 11:45, of 16 clients 15 recovered and 1 was evicte
    d.
  • 201602-26 09:16:48 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 12:16, of 16 clients 16 recovered and 0 were evicted.

All Lustre clients show the following error message:

lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456506319 ref 2 fl Interpret:RP/4/0 rc 301/301

The corresponding error message on MGS/MDT node lola-8 is:

lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0

After recovery completed the errors appear in messages file of lola-8

lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806786, ql: 7, comp: 9, conn: 16, next: 120260806788, next_update 120260806887 last_committed: 120260806785)
lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806789, ql: 7, comp: 9, conn: 16, next: 120260806887, next_update 120260806887 last_committed: 120260806785)
lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806888, ql: 7, comp: 9, conn: 16, next: 120260806892, next_update 120260806903 last_committed: 120260806785)
lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806893, ql: 7, comp: 9, conn: 16, next: 120260806903, next_update 120260806903 last_committed: 120260806785)
lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806904, ql: 7, comp: 9, conn: 16, next: 120260806946, next_update 120260806960 last_committed: 120260806785)
lola-8.log:Feb 26 09:15:42 lola-8 kernel: LustreError: 4664:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0001: waking for gap in transno, VBR is OFF (skip: 120260806947, ql: 7, comp: 9, conn: 16, next: 120260806958, next_update 120260806960 last_committed: 120260806785)
...
... etc etc
lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) ### BUG 6063: lock collide during recovery ns: mdt-soaked-MDT0000_UUID lock: ffff8803ecf60dc0/0x73ce4df547baebbb lrc: 3/0,0 mode: EX/EX res: [0x200010da0:0x74:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x40000001000020 nid: 0@lo remote: 0x73ce4df547baebb4 expref: 95 pid: 4836 timeout: 0 lvb_type: 0
lola-8.log:Feb 26 09:15:51 lola-8 kernel: LustreError: 4583:0:(ldlm_lockd.c:858:ldlm_server_blocking_ast()) Skipped 3 previous similar messages
...
...

Attached console, message files of lola-8,9,27 (27--> as example client)



 Comments   
Comment by Joseph Gmitter (Inactive) [ 26/Feb/16 ]

Hi Di,
Can you take a look at this issue?
Thanks.
Joe

Comment by Di Wang [ 26/Feb/16 ]

hmm, it looks like the object is missing for this replay(open_no_create) request.

lola-27.log:Feb 26 09:04:37 lola-27 kernel: LustreError: 3684:0:(import.c:1249:ptlrpc_connect_interpret()) soaked-MDT0000_UUID went back in time (transno 120267673197 was previously committed, server now claims 120267671307)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646

This is interesting. lola-27 seems keeps the same replay request to lola-8 in every recovery.

messages-lola-8.log:Feb 25 16:53:17 lola-8 kernel: LustreError: 4536:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8804346976c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:209/0 lens 896/3512 e 0 to 0 dl 1456448059 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 25 17:15:44 lola-8 kernel: LustreError: 4451:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88082b319cc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:738/0 lens 896/3512 e 0 to 0 dl 1456450098 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 25 23:10:46 lola-8 kernel: LustreError: 4401:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880816d100c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:191/0 lens 896/3512 e 0 to 0 dl 1456470691 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 25 23:54:27 lola-8 kernel: LustreError: 4450:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8807b679d0c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:512/0 lens 896/3512 e 0 to 0 dl 1456473277 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 26 00:30:11 lola-8 kernel: LustreError: 4389:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880821640080 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:391/0 lens 896/3512 e 0 to 0 dl 1456475421 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 26 00:56:32 lola-8 kernel: LustreError: 4543:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff880824a71980 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:498/0 lens 896/3512 e 0 to 0 dl 1456477038 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 26 02:40:43 lola-8 kernel: LustreError: 4449:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff88081e04dcc0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:668/0 lens 896/3512 e 0 to 0 dl 1456483248 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 26 08:00:34 lola-8 kernel: LustreError: 4408:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808119ea9c0 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:229/0 lens 896/3512 e 0 to 0 dl 1456502439 ref 1 fl Interpret:/4/0 rc 0/0
messages-lola-8.log:Feb 26 09:04:37 lola-8 kernel: LustreError: 4561:0:(mdt_open.c:1251:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid.  req@ffff8808231ac680 x1527085335890828/t0(85900680419) o101->713987fa-97c6-9c64-e2ea-b436903a0b12@192.168.1.127@o2ib100:333/0 lens 896/3512 e 0 to 0 dl 1456506318 ref 1 fl Interpret:/4/0 rc 0/0

I think client should throw such replay request once it is failed, since it is only open, no create.

Comment by Di Wang [ 26/Feb/16 ]

looks like the open replay data is already deleted from the replay request. So I do not know why this request still exist in the replay list. I wonder if this related with the recent xid change? Niu, please have a look? thanks.

LustreError: 3684:0:(mdc_request.c:554:mdc_replay_open()) @@@ Can't properly replay without open data.  req@ffff8806b1593980 x1527085335890828/t85900680419(85900680419) o101->soaked-MDT0000-mdc-ffff88082f749000@192.168.1.108@o2ib10:12/10 lens 896/544 e 0 to 0 dl 1456470692 ref 2 fl Interpret:RP/4/0 rc 301/301

And also this is probably not a critical ticket here, because I did not see this broke the consistency of the namespace. Only some noise message caused by this mysterious replay request?

Generated at Sat Feb 10 02:12:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.