[LU-6924] remote regular file are missing after recovery. Created: 29/Jul/15  Updated: 26/Aug/15  Resolved: 13/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
is related to LU-6883 replay-single test 73a hang and timeo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In 24 hours DNE failover test. I found this on one of the MDT,

LustreError: 2758:0:(client.c:2869:ptlrpc_replay_interpret()) @@@ status -110, old was 0  req@ffff880feb148cc0 x1507974808149044/t25771723485(25771723485) o1000->lustre-MDT0003-osp-MDT0001@192.168.2.128@o2ib:24/4 lens 248/16576 e 1 to 0 dl 1438129486 ref 2 fl Interpret:R/4/0 rc -110/-110
Lustre: lustre-MDT0003-osp-MDT0001: Connection restored to lustre-MDT0003 (at 192.168.2.128@o2ib)
LustreError: 3117:0:(mdt_open.c:1171:mdt_cross_open()) lustre-MDT0001: [0x240000406:0x167f1:0x0] doesn't exist!: rc = -14
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=27221 DURATION=86400 PERIOD=1800
Lustre: DEBUG MARKER: Client load failed on node c05, rc=1

Then on the client side, which cause dbench fails

   2      7136     0.00 MB/sec  execute 191 sec  latency 272510.369 ms
   2      7136     0.00 MB/sec  execute 192 sec  latency 273510.512 ms
   2      7136     0.00 MB/sec  execute 193 sec  latency 274510.637 ms
   2      7136     0.00 MB/sec  execute 194 sec  latency 275510.799 ms
   2      7136     0.00 MB/sec  execute 195 sec  latency 276510.916 ms
   2      7136     0.00 MB/sec  execute 196 sec  latency 277511.069 ms
   2      7136     0.00 MB/sec  execute 197 sec  latency 278511.229 ms
   2      7136     0.00 MB/sec  execute 198 sec  latency 279511.387 ms
   2      7330     0.00 MB/sec  execute 199 sec  latency 280182.929 ms
[9431] open ./clients/client1/~dmtmp/EXCEL/RESULTS.XLS failed for handle 11887 (Bad address)
(9432) ERROR: handle 11887 was not found
Child failed with status 1

Then the test fails.



 Comments   
Comment by Di Wang [ 29/Jul/15 ]

Hmm, I do not have enough debug to know what happens, but it mostly like

1. MDS02 do remote unlink, so it destroy local object, then delete the remote name entry on MDS04.
2. But MDS04 restarts at the moment, after it restarts, it will wait all clients connected, then collecting the debug log.
3. After MDS02 reconnects to MDS04, it will send replay unlink to MDS04, MDS04 got the unlink request and wait for the BULK.
4. At the same time, MDS02 evict MDS04

Lustre: lustre-MDT0001: already connected client lustre-MDT0003-mdtlov_UUID (at 192.168.2.128@o2ib) with handle 0x2e45787e4dd12a1. Rejecting client with the same UUID trying to reconnect with handle 0xf0284dfd774c7787
Lustre: lustre-MDT0001: haven't heard from client lustre-MDT0003-mdtlov_UUID (at 192.168.2.128@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880ffa181c00, cur 1438129440 expire 1438129290 last 1438129212

5. MDS04 failed on waiting bulk transfer

LustreError: 2792:0:(ldlm_lib.c:3041:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff880827864850 x1507974808149044/t0(25771723485) o1000->lustre-MDT0001-mdtlov_UUID@192.168.2.126@o2ib:219/0 lens 248/16608 e 1 to 0 dl 1438129504 ref 1 fl Complete:/4/0 rc 0/0

6. MDS02 failed on this unlink replay

LustreError: 2758:0:(client.c:2869:ptlrpc_replay_interpret()) @@@ status -110, old was 0  req@ffff880feb148cc0 x1507974808149044/t25771723485(25771723485) o1000->lustre-MDT0003-osp-MDT0001@192.168.2.128@o2ib:24/4 lens 248/16576 e 1 to 0 dl 1438129486 ref 2 fl Interpret:R/4/0 rc -110/-110

7 Because MDS02 already got reply of this replay (note: this is bulk replay), so it will not replay this request again. (see ptlrpc_replay_interpret()).

Comment by Di Wang [ 29/Jul/15 ]

So the easiest fix might be in step 7. If it is bulk replay, and even though the server get the request, but the bulk transfer timeout, then we will still resend the replay request.

Comment by Gerrit Updater [ 29/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15793
Subject: LU-6924 ptlrpc: replay bulk request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 722bbb86fa5479bdc16b62b43863eff39a61df56

Comment by Gerrit Updater [ 13/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15793/
Subject: LU-6924 ptlrpc: replay bulk request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0addfa9fa1d48cc9fa5eb05026848e55382f81a8

Generated at Sat Feb 10 02:04:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.