[LU-10958] brw rpc reordering causes data corruption when the writethrough cache is disabled Created: 26/Apr/18 Updated: 11/Feb/21 Resolved: 08/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andrew Perepechko | Assignee: | Andrew Perepechko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master. The following scenario happens:
A reproducer will be uploaded shortly. |
| Comments |
| Comment by Gerrit Updater [ 26/Apr/18 ] |
|
Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/32165 |
| Comment by Oleg Drokin [ 26/Apr/18 ] |
|
hm... so I imagine this would also be the case in the old 1.x code before we got writeback cache enabled, right? So... I imagine while the client is having a page in-flight to the server, the page on the client is market writeback, so no further modifications are possible until the write actually completes, how do you avoid this? directio? |
| Comment by Andrew Perepechko [ 26/Apr/18 ] |
We were able to reproduce the bug with Lustre 2.7, but we think earlier Lustre versions can be affected too. The successful reply to the resent request will clear PG_writeback on the client. However, the original request can still be being processed on the OSS. When caching is enabled, this scenario cannot happen because the stuck OSS thread either holds the page lock which serializes writes or did not get so far as to get the page lock but in this case it also hasn't transferred the stale data to the OSS, so even if it's reordered it gets the latest data. |
| Comment by Oleg Drokin [ 26/Apr/18 ] |
|
Ah, I see there's a resend. then it's even more strange! We have a logic to catch resends so that the same resend it not executed twice - so we should not be able to both be running at the same time, right? Something does not add up. |
| Comment by Andrew Perepechko [ 26/Apr/18 ] |
|
Perhaps, such logic is only implemented for the MDS. What do you think? |
| Comment by Oleg Drokin [ 26/Apr/18 ] |
|
the logic should be generic enough. I have a passing memory about bulk xids being incremented on resend, but that should not affect detection of double request processing. We even have code to plug into existing request processing when this is detected so you don't wait exponentially longer on long operations (though it is possible this does not happens for bulk requests) |
| Comment by Andrew Perepechko [ 26/Apr/18 ] |
|
I uploaded the logs from my text box as 134.tar.bz2. Something like the following happens: root@panda:/tmp/134# egrep "tgt_brw_write|tgt_brw_read" recovery-small.test_134.debug_log.panda-testbox.1524758128.log (1) writing a file of 4096 zeros 00000020:00000001:1.0:1524758106.298943:0:28572:0:(tgt_handler.c:2270:tgt_brw_write()) Process entered 00000020:00000040:1.0:1524758106.299164:0:28572:0:(tgt_handler.c:2385:tgt_brw_write()) Client use short io for data transfer, size = 4096 00000020:00000040:1.0:1524758106.299205:0:28572:0:(tgt_handler.c:2446:tgt_brw_write()) Checksum 1 from 12345-0@lo OK: 10000001 00000020:00000001:1.0:1524758106.341298:0:28572:0:(tgt_handler.c:2514:tgt_brw_write()) Process leaving (rc=0 : 0 : 0) (2) copying this file to /tmp for later cmp 00000020:00000001:1.0:1524758106.362982:0:28572:0:(tgt_handler.c:1956:tgt_brw_read()) Process entered 00000020:00008000:1.0:1524758106.378496:0:28572:0:(tgt_handler.c:2089:tgt_brw_read()) checksum at read origin: 10000001 00000020:00000001:1.0:1524758106.378764:0:28572:0:(tgt_handler.c:2174:tgt_brw_read()) Process leaving (rc=4096 : 4096 : 1000) 00000020:00000001:1.0:1524758106.381031:0:28572:0:(tgt_handler.c:1956:tgt_brw_read()) Process entered 00000020:00008000:1.0:1524758106.381160:0:28572:0:(tgt_handler.c:2089:tgt_brw_read()) checksum at read origin: 1 00000020:00000001:1.0:1524758106.381374:0:28572:0:(tgt_handler.c:2174:tgt_brw_read()) Process leaving (rc=0 : 0 : 0) (3) initial brw rpc 00000020:00000001:1.0:1524758106.521557:0:28572:0:(tgt_handler.c:2270:tgt_brw_write()) Process entered 00000020:00000040:1.0:1524758106.521743:0:28572:0:(tgt_handler.c:2385:tgt_brw_write()) Client use short io for data transfer, size = 4096 00000020:00000040:1.0:1524758106.521785:0:28572:0:(tgt_handler.c:2446:tgt_brw_write()) Checksum 2 from 12345-0@lo OK: 4ebccd1f (4) here things get stalled and the client resends this rpc 00000020:00000001:1.0:1524758126.481399:0:30161:0:(tgt_handler.c:2270:tgt_brw_write()) Process entered 00000020:00000020:1.0:1524758126.481435:0:30161:0:(tgt_handler.c:2361:tgt_brw_write()) @@@ clear resent/replay req grant info req@ffff880125623c50 x1598824716964176/t0(0) o4->60f13dec-9165-f217-961e-91740a2150d5@0@lo:0/0 lens 4704/448 e 0 to 0 dl 1524758146 ref 1 fl Interpret:/2/0 rc 0/0 00000020:00000040:1.0:1524758126.481634:0:30161:0:(tgt_handler.c:2385:tgt_brw_write()) Client use short io for data transfer, size = 4096 00000020:00000001:1.0:1524758126.482205:0:30161:0:(tgt_handler.c:2514:tgt_brw_write()) Process leaving (rc=0 : 0 : 0) (5) the client is now able to rewrite old data 00000020:00000001:0.0:1524758126.606629:0:30161:0:(tgt_handler.c:2270:tgt_brw_write()) Process entered 00000020:00000040:0.0:1524758126.606824:0:30161:0:(tgt_handler.c:2385:tgt_brw_write()) Client use short io for data transfer, size = 4096 00000020:00000040:0.0:1524758126.606865:0:30161:0:(tgt_handler.c:2446:tgt_brw_write()) Checksum 4 from 12345-0@lo OK: 10000001 00000020:00000001:0.0:1524758126.612424:0:30161:0:(tgt_handler.c:2514:tgt_brw_write()) Process leaving (rc=0 : 0 : 0) (6) the data from the initial brw rpc get written to disk (corruption) 00000020:00000001:1.0:1524758128.533347:0:28572:0:(tgt_handler.c:2514:tgt_brw_write()) Process leaving (rc=0 : 0 : 0) (7) the final data verification 00000020:00000001:1.0:1524758128.768109:0:28572:0:(tgt_handler.c:1956:tgt_brw_read()) Process entered 00000020:00008000:1.0:1524758128.778266:0:28572:0:(tgt_handler.c:2089:tgt_brw_read()) checksum at read origin: 4ebccd1f 00000020:00000001:1.0:1524758128.778548:0:28572:0:(tgt_handler.c:2174:tgt_brw_read()) Process leaving (rc=4096 : 4096 : 1000) 00000020:00000001:1.0:1524758128.780809:0:28572:0:(tgt_handler.c:1956:tgt_brw_read()) Process entered 00000020:00008000:1.0:1524758128.780932:0:28572:0:(tgt_handler.c:2089:tgt_brw_read()) checksum at read origin: 1 00000020:00000001:1.0:1524758128.781146:0:28572:0:(tgt_handler.c:2174:tgt_brw_read()) Process leaving (rc=0 : 0 : 0
|
| Comment by Gerrit Updater [ 03/May/18 ] |
|
Andrew Perepechko (c17827@cray.com) uploaded a new patch: https://review.whamcloud.com/32281 |
| Comment by Andrew Perepechko [ 07/May/18 ] |
This topic was raised again during review. It seems like ptlrpc_server_check_resend_in_progress() was written in such a way as not to drop brw resends in progress since bulk transfer may fail after match bits get changed. In our scenario the real delay happens in the dm/mdraid layer after the bulk transfer succeeded. No -EBUSY from ptlrpc_server_request_add() calls in the attached log: root@panda:/tmp/134# egrep "server_request_add.*leaving" recovery-small.test_134.debug_log.panda-testbox.1524758128.log 00000100:00000001:1.0:1524758084.215027:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758084.296625:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.372998:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.374387:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.375747:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.377223:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.378622:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.379762:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.381124:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758084.618182:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) ... 00000100:00000001:2.0:1524758109.428916:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758109.429419:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758109.429583:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758109.429676:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758109.429710:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758109.429816:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758109.429839:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758109.429948:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758109.430084:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758109.430352:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758109.430510:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758109.491963:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758111.940049:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758111.940090:0:28570:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.429713:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.429848:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758114.429912:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758114.430457:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.430567:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758114.430693:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.430713:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758114.431758:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.431789:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758114.432571:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758114.433376:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758114.433438:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758114.499932:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758116.948122:0:29382:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758116.948779:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758119.430075:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.430712:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.430857:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758119.430972:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758119.431110:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758119.431420:0:28566:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758119.432337:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758119.432462:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758119.432919:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.433354:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.433531:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.434360:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758119.508004:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758121.956131:0:29382:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758121.956785:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.431106:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.431686:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.431826:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.432540:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.432774:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.432916:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.433065:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.433192:0:28957:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758124.433610:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.433857:0:28958:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.435074:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758124.435867:0:28839:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758124.515970:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.464702:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.464840:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.465510:0:28256:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.465654:0:28256:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758126.465754:0:28255:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.465784:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.465793:0:28256:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.465928:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.466064:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.466192:0:28568:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.468126:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.468374:0:28252:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.481228:0:28573:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.483509:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.521548:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758126.591581:0:28259:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758126.599733:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758126.603148:0:29095:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758126.606446:0:30161:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758126.613821:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.629660:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758126.651194:0:28259:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758126.964226:0:29382:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2.0:1524758126.964243:0:30160:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758128.709591:0:29095:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758128.759534:0:28588:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758128.762873:0:29095:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1524758128.763899:0:28567:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758128.767953:0:28572:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:1.0:1524758128.780653:0:28572:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:0.0:1524758128.785932:0:28258:0:(service.c:1732:ptlrpc_server_request_add()) Process leaving (rc=0 : 0 : 0) |
| Comment by Colin Faber [X] (Inactive) [ 06/Jul/18 ] |
|
Hi Peter, Can we get some attention on this bug? It's low hanging fruit. -cf
|
| Comment by Andreas Dilger [ 09/Aug/18 ] |
Isn't that a problem of the DM/mdraid later that it is reordering writes incorrectly? If the OST thread submit writes to disk as A, A', B, but the disk writes A', B, A because A was blocked in the IO stack, then there isn't much we can do about it. |
| Comment by Andrew Perepechko [ 09/Aug/18 ] |
|
No, I don't think so. There's nothing wrong in the md layer except the delay itself which makes it possible for the resent RPC and the RPC after it to complete before the initial delayed RPC. This delay is the analogue of OBD_FAIL_OST_BRW_PAUSE_BULK2 from https://review.whamcloud.com/#/c/32165/6/lustre/tests/recovery-small.sh Delay can happen anywhere on a non-RTOS system. |
| Comment by Cory Spitz [ 04/Feb/21 ] |
|
Proposed for 2.14.0. With -RC1 already available, I realize that its candidacy might not hold. |
| Comment by Gerrit Updater [ 08/Feb/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32281/ |
| Comment by Peter Jones [ 08/Feb/21 ] |
|
Landed for 2.14 |
| Comment by Etienne Aujames [ 11/Feb/21 ] |
|
Hello, Is a backport planned for the b2_12 branch for this issue ? |