Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
When kfilnd issues a bulk get, the originator posts a tagged receive buffer and then the BULK_GET TN transitions to TN_STATE_WAIT_COMP where it waits for the TN_EVENT_TX_OK and TN_EVENT_TAG_RX_OK events to complete the transaction.
These events may arrive at originator in any order. If the TX_OK arrives first, the BULK_GET TN is transitioned to TN_STATE_WAIT_TAG_COMP. If TAG_RX_OK arrives with a non-zero status then this is handled correctly.
If the TAG_RX_OK arrives first, then the status of this event is not recorded in the TN, and the BULK_GET TN is transitioned to TN_STATE_WAIT_SEND_COMP . When the TX_OK event arrives the message is finalized as though the TN completed successfully when in fact it failed. This can result in data corruption.
This shows the good/working case:
# TX_OK first with status 0 00000800:40000000:3.0:1699032572.037560:0:61649:0:(kfilnd_tn.c:299:kfilnd_tn_state_change()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_STATE_TAGGED_RECV_POSTED -> TN_STATE_WAIT_COMP state change 00000800:40000000:5.0:1699032572.037584:0:61651:0:(kfilnd_tn.c:1084:kfilnd_tn_state_wait_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TX_OK event status 0 <<<< TX_OK first # TAG_RX_OK second with status -61 00000800:40000000:5.0:1699032572.374710:0:61651:0:(kfilnd_tn.c:1241:kfilnd_tn_state_wait_tag_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TAG_RX_OK event status -61 00000800:40000000:5.0:1699032572.374712:0:61651:0:(kfilnd_tn.c:313:kfilnd_tn_status_update()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 0 -> -61 status change 00000800:40000000:5.0:1699032572.374713:0:61651:0:(kfilnd_tn.c:319:kfilnd_tn_status_update()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 0 -> 0 health status change 00000400:00000200:5.0:1699032572.374714:0:61651:0:(lib-msg.c:1046:lnet_is_health_check()) Msg 0000000039f6c737 is in inconsistent state, don't perform health checking (-61, 0) 00000400:00000200:5.0:1699032572.374715:0:61651:0:(lib-msg.c:1051:lnet_is_health_check()) health check = 0, status = -61, hstatus = 0 00000400:40000000:5.0:1699032572.374715:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000b5e76aa5 msg 0000000039f6c737 # Correct status is passed to upper layer 00000100:40000000:5.0:1699032572.374716:0:61651:0:(events.c:481:server_bulk_callback()) event type 5, status -61, desc 000000002a3c77d3
This shows the broken case:
# TAG_RX_OK first with status -61 00000800:40000000:5.0:1699032576.237177:0:61651:0:(kfilnd_tn.c:1084:kfilnd_tn_state_wait_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TAG_RX_OK event status -61 00000800:40000000:5.0:1699032576.237178:0:61651:0:(kfilnd_tn.c:299:kfilnd_tn_state_change()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_STATE_WAIT_COMP -> TN_STATE_WAIT_SEND_COMP state change # TX_OK second with status 0 00000800:40000000:5.0:1699032576.237242:0:61651:0:(kfilnd_tn.c:1178:kfilnd_tn_state_wait_send_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TX_OK event status 0 00000400:00000200:5.0:1699032576.237243:0:61651:0:(lib-msg.c:1051:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0 00000400:00000200:5.0:1699032576.237244:0:61651:0:(lib-msg.c:822:lnet_health_check()) health check: 17@kfi->0@kfi: GET: OK 00000400:40000000:5.0:1699032576.237247:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000c76d55f1 msg 00000000c54fcba7 00000400:40000000:5.0:1699032576.237248:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000c76d55f1 msg 000000003842aa4c 00000400:00000200:5.0:1699032576.237249:0:61651:0:(lib-md.c:68:lnet_md_unlink()) Unlinking md 00000000c76d55f1 00000800:40000000:5.0:1699032576.237251:0:61651:0:(kfilnd_tn.c:1473:kfilnd_tn_free()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 Transaction freed # Upper layer got status 0, corruption detected 00000001:00020000:4.0:1699032576.261312:0:58034:0:(brw_test.c:230:brw_check_page()) Bad data in page 000000008c1819d6: 0xbeefbeefbeefbeef, 0xeeb0eeb1eeb2eeb3 expected 00000001:00020000:4.0:1699032576.261312:0:58034:0:(brw_test.c:266:brw_check_bulk()) Bulk page 000000008c1819d6 (0/256) is corrupted! 00000001:00020000:4.0:1699032576.261313:0:58034:0:(brw_test.c:432:brw_bulk_ready()) Bulk data rpc 0000000007c0e22a from 12345-0@kfi is corrupted!