[LU-2257] eviction from MDT during recovery Created: 01/Nov/12  Updated: 22/May/13  Resolved: 22/May/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre/commits/2.3.54-llnl


Severity: 3
Rank (Obsolete): 5400

 Description   

The MDS evicted one client during recovery:

2012-10-30 23:31:31 Lustre: lstest-MDT0000: Recovery over after 1:11, of 448 clients 447 recovered and 1 was evicted.

The client had this to say:

00000100:00000400:1.0:1351665089.361406:0:4759:0:(client.c:2702:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff8808156d5400 x1417272286137308/t60416147281(60416147281) o101->lstest-MDT0000-mdc-ffff88101c53c800@172.20.5.2@o2ib500:12/10 lens 784/544 e 0 to 0 dl 1351665195 ref 2 fl Interpret:R/4/0 rc -75/-75
00000100:00000100:1.0:1351665195.391323:0:4759:0:(client.c:1914:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1351665089/real 1351665089]  req@ffff8808037b2000 x1417272286147651/t0(0) o400->lstest-MDT0000-mdc-ffff88101c53c800@172.20.5.2@o2ib500:12/10 lens 224/224 e 0 to 1 dl 1351665195 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
00000100:00000400:1.0:1351665195.391332:0:4759:0:(import.c:1207:completed_replay_interpret()) lstest-MDT0000-mdc-ffff88101c53c800: version recovery fails, reconnecting
00000100:02020000:1.0:1351665195.404955:0:4759:0:(import.c:1325:ptlrpc_import_recovery_state_machine()) 167-0: lstest-MDT0000-mdc-ffff88101c53c800: This client was evicted by lstest-MDT0000; in progress operations using this service will fail.
00000080:00020000:8.0:1351665195.420248:0:21538:0:(file.c:155:ll_close_inode_openhandle()) inode 144115590443843479 mdc close failed: rc = -5
00000100:02000000:10.0:1351665195.420776:0:21623:0:(import.c:1403:ptlrpc_import_recovery_state_machine()) lstest-MDT0000-mdc-ffff88101c53c800: Connection restored to lstest-MDT0000 (at 172.20.5.2@o2ib500)

The application (IOR) failed with ENOENT on a write:

Commencing write performance test: Tue Oct 30 23:11:39 2012
ior ERROR: stat() failed, errno 2, No such file or directory (aiori-POSIX.c:323)

We'd like to understand why this client was evicted and what the version recovery error messages mean.

LLNL-bug-id: bz1867



 Comments   
Comment by Peter Jones [ 02/Nov/12 ]

Alex

Could someone please look into this one?

Thanks

Peter

Comment by Alex Zhuravlev [ 15/Nov/12 ]

the import message was about version mismatch:

00000100:00000400:1.0:1351665089.361406:0:4759:0:(client.c:2702:ptlrpc_replay_interpret()) @@@ Version mismatch during replay

I think Mike can look at this.

Comment by Mikhail Pershin [ 22/Nov/12 ]

Are lustre logs from MDS available? Or any logs. I can't say what exactly happened there. In general that means operation replay on server expected different version of object than it has.

Comment by Ned Bass [ 22/May/13 ]

This issue is pretty stale. I don't think we've seen it for a long time, so I'm resolving it.

Generated at Sat Feb 10 01:23:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.