[LU-5283] LU-617 reappears in 2.4.2 Created: 01/Jul/14  Updated: 18/Jul/17  Resolved: 18/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Haisong Cai (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Related
is related to LU-617 LBUG: (mdt_recovery.c:787:mdt_last_rc... Resolved
Severity: 3
Rank (Obsolete): 14744

 Description   

MDS becomes sluggish and error logged:

Jul 1 14:21:13 dolphin-mds-9-2 kernel: LustreError: 13372:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40602062209, new: 40602062208 replay: 0. see LU-617.
Jul 1 14:45:04 dolphin-mds-9-2 kernel: LustreError: 3640:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40602786157, new: 40602786156 replay: 0. see LU-617.
Jul 1 15:02:12 dolphin-mds-9-2 kernel: LustreError: 19741:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40603454105, new: 40603454104 replay: 0. see LU-617.
Jul 1 15:03:03 dolphin-mds-9-2 kernel: LustreError: 6134:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40603487989, new: 40603487988 replay: 0. see LU-617.
Jul 1 15:03:08 dolphin-mds-9-2 kernel: LustreError: 13372:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40603492527, new: 40603492526 replay: 0. see LU-617.
Jul 1 15:19:00 dolphin-mds-9-2 kernel: LustreError: 19741:0:(mdt_recovery.c:418:mdt_last_rcvd_update()) Trying to overwrite bigger transno:on-disk: 40604082957, new: 40604082956 replay: 0. see LU-617.



 Comments   
Comment by Peter Jones [ 02/Jul/14 ]

Niu

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 02/Jul/14 ]

Is there a particular workload being run that triggers this problem?

Comment by Haisong Cai (Inactive) [ 02/Jul/14 ]

= No particular workload was running but the file-system was in
production when the bug appeared.
We did notice MDS system load was relatively high, hovering between
10 and 30.

= We decided failover to standby MDS when the file-system became
sluggish. After about 7 hours, the standby MDS crashed. No LU-617 logged.

= This morning we brought back the MDS from crash and in the first hour,
it logged 5 instances of LU-617.

= Another thing to note is, our clients are running 2.5.* client. Not
for particular reason, but admin of the cluster thought it was the
latest release.

thanks,
Haisong

Comment by Niu Yawei (Inactive) [ 03/Jul/14 ]

Is there any client running older version rather than 2.5.*?

Comment by Haisong Cai (Inactive) [ 07/Jul/14 ]

Hi Yawei,

Yes we are running both 1.8.9 (a few of them) and 2.5.3 clients.

thanks,
Haisong

Comment by Niu Yawei (Inactive) [ 30/Jul/14 ]

LU-617 was fixed in 2.1.2, I suggest you upgrade your client to 2.1.2 or later.

Comment by Niu Yawei (Inactive) [ 18/Jul/17 ]

Upgrade legacy clients to later version (2.1.2 above) can fix the problem.

Generated at Sat Feb 10 01:50:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.