[LU-6924] remote regular file are missing after recovery. Created: 29/Jul/15 Updated: 26/Aug/15 Resolved: 13/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
In 24 hours DNE failover test. I found this on one of the MDT, LustreError: 2758:0:(client.c:2869:ptlrpc_replay_interpret()) @@@ status -110, old was 0 req@ffff880feb148cc0 x1507974808149044/t25771723485(25771723485) o1000->lustre-MDT0003-osp-MDT0001@192.168.2.128@o2ib:24/4 lens 248/16576 e 1 to 0 dl 1438129486 ref 2 fl Interpret:R/4/0 rc -110/-110 Lustre: lustre-MDT0003-osp-MDT0001: Connection restored to lustre-MDT0003 (at 192.168.2.128@o2ib) LustreError: 3117:0:(mdt_open.c:1171:mdt_cross_open()) lustre-MDT0001: [0x240000406:0x167f1:0x0] doesn't exist!: rc = -14 Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27221 DURATION=86400 PERIOD=1800 Lustre: DEBUG MARKER: Client load failed on node c05, rc=1 Then on the client side, which cause dbench fails 2 7136 0.00 MB/sec execute 191 sec latency 272510.369 ms 2 7136 0.00 MB/sec execute 192 sec latency 273510.512 ms 2 7136 0.00 MB/sec execute 193 sec latency 274510.637 ms 2 7136 0.00 MB/sec execute 194 sec latency 275510.799 ms 2 7136 0.00 MB/sec execute 195 sec latency 276510.916 ms 2 7136 0.00 MB/sec execute 196 sec latency 277511.069 ms 2 7136 0.00 MB/sec execute 197 sec latency 278511.229 ms 2 7136 0.00 MB/sec execute 198 sec latency 279511.387 ms 2 7330 0.00 MB/sec execute 199 sec latency 280182.929 ms [9431] open ./clients/client1/~dmtmp/EXCEL/RESULTS.XLS failed for handle 11887 (Bad address) (9432) ERROR: handle 11887 was not found Child failed with status 1 Then the test fails. |
| Comments |
| Comment by Di Wang [ 29/Jul/15 ] |
|
Hmm, I do not have enough debug to know what happens, but it mostly like 1. MDS02 do remote unlink, so it destroy local object, then delete the remote name entry on MDS04. Lustre: lustre-MDT0001: already connected client lustre-MDT0003-mdtlov_UUID (at 192.168.2.128@o2ib) with handle 0x2e45787e4dd12a1. Rejecting client with the same UUID trying to reconnect with handle 0xf0284dfd774c7787 Lustre: lustre-MDT0001: haven't heard from client lustre-MDT0003-mdtlov_UUID (at 192.168.2.128@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880ffa181c00, cur 1438129440 expire 1438129290 last 1438129212 5. MDS04 failed on waiting bulk transfer LustreError: 2792:0:(ldlm_lib.c:3041:target_bulk_io()) @@@ network error on bulk WRITE req@ffff880827864850 x1507974808149044/t0(25771723485) o1000->lustre-MDT0001-mdtlov_UUID@192.168.2.126@o2ib:219/0 lens 248/16608 e 1 to 0 dl 1438129504 ref 1 fl Complete:/4/0 rc 0/0 6. MDS02 failed on this unlink replay LustreError: 2758:0:(client.c:2869:ptlrpc_replay_interpret()) @@@ status -110, old was 0 req@ffff880feb148cc0 x1507974808149044/t25771723485(25771723485) o1000->lustre-MDT0003-osp-MDT0001@192.168.2.128@o2ib:24/4 lens 248/16576 e 1 to 0 dl 1438129486 ref 2 fl Interpret:R/4/0 rc -110/-110 7 Because MDS02 already got reply of this replay (note: this is bulk replay), so it will not replay this request again. (see ptlrpc_replay_interpret()). |
| Comment by Di Wang [ 29/Jul/15 ] |
|
So the easiest fix might be in step 7. If it is bulk replay, and even though the server get the request, but the bulk transfer timeout, then we will still resend the replay request. |
| Comment by Gerrit Updater [ 29/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15793 |
| Comment by Gerrit Updater [ 13/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15793/ |