[LU-2257] eviction from MDT during recovery Created: 01/Nov/12 Updated: 22/May/13 Resolved: 22/May/13 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Ned Bass | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 5400 |
| Description |
|
The MDS evicted one client during recovery: 2012-10-30 23:31:31 Lustre: lstest-MDT0000: Recovery over after 1:11, of 448 clients 447 recovered and 1 was evicted. The client had this to say: 00000100:00000400:1.0:1351665089.361406:0:4759:0:(client.c:2702:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff8808156d5400 x1417272286137308/t60416147281(60416147281) o101->lstest-MDT0000-mdc-ffff88101c53c800@172.20.5.2@o2ib500:12/10 lens 784/544 e 0 to 0 dl 1351665195 ref 2 fl Interpret:R/4/0 rc -75/-75 00000100:00000100:1.0:1351665195.391323:0:4759:0:(client.c:1914:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1351665089/real 1351665089] req@ffff8808037b2000 x1417272286147651/t0(0) o400->lstest-MDT0000-mdc-ffff88101c53c800@172.20.5.2@o2ib500:12/10 lens 224/224 e 0 to 1 dl 1351665195 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 00000100:00000400:1.0:1351665195.391332:0:4759:0:(import.c:1207:completed_replay_interpret()) lstest-MDT0000-mdc-ffff88101c53c800: version recovery fails, reconnecting 00000100:02020000:1.0:1351665195.404955:0:4759:0:(import.c:1325:ptlrpc_import_recovery_state_machine()) 167-0: lstest-MDT0000-mdc-ffff88101c53c800: This client was evicted by lstest-MDT0000; in progress operations using this service will fail. 00000080:00020000:8.0:1351665195.420248:0:21538:0:(file.c:155:ll_close_inode_openhandle()) inode 144115590443843479 mdc close failed: rc = -5 00000100:02000000:10.0:1351665195.420776:0:21623:0:(import.c:1403:ptlrpc_import_recovery_state_machine()) lstest-MDT0000-mdc-ffff88101c53c800: Connection restored to lstest-MDT0000 (at 172.20.5.2@o2ib500) The application (IOR) failed with ENOENT on a write: Commencing write performance test: Tue Oct 30 23:11:39 2012 ior ERROR: stat() failed, errno 2, No such file or directory (aiori-POSIX.c:323) We'd like to understand why this client was evicted and what the version recovery error messages mean. LLNL-bug-id: bz1867 |
| Comments |
| Comment by Peter Jones [ 02/Nov/12 ] |
|
Alex Could someone please look into this one? Thanks Peter |
| Comment by Alex Zhuravlev [ 15/Nov/12 ] |
|
the import message was about version mismatch: 00000100:00000400:1.0:1351665089.361406:0:4759:0:(client.c:2702:ptlrpc_replay_interpret()) @@@ Version mismatch during replay I think Mike can look at this. |
| Comment by Mikhail Pershin [ 22/Nov/12 ] |
|
Are lustre logs from MDS available? Or any logs. I can't say what exactly happened there. In general that means operation replay on server expected different version of object than it has. |
| Comment by Ned Bass [ 22/May/13 ] |
|
This issue is pretty stale. I don't think we've seen it for a long time, so I'm resolving it. |