Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17270

TN status is lost if TN_EVENT_TAG_RX_OK occurs before TN_EVENT_TX_OK

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When kfilnd issues a bulk get, the originator posts a tagged receive buffer and then the BULK_GET TN transitions to TN_STATE_WAIT_COMP where it waits for the TN_EVENT_TX_OK and TN_EVENT_TAG_RX_OK events to complete the transaction.

      These events may arrive at originator in any order. If the TX_OK arrives first, the BULK_GET TN is transitioned to TN_STATE_WAIT_TAG_COMP. If TAG_RX_OK arrives with a non-zero status then this is handled correctly.

      If the TAG_RX_OK arrives first, then the status of this event is not recorded in the TN, and the BULK_GET TN is transitioned to TN_STATE_WAIT_SEND_COMP . When the TX_OK event arrives the message is finalized as though the TN completed successfully when in fact it failed. This can result in data corruption.

      This shows the good/working case:

      # TX_OK first with status 0
      00000800:40000000:3.0:1699032572.037560:0:61649:0:(kfilnd_tn.c:299:kfilnd_tn_state_change()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_STATE_TAGGED_RECV_POSTED -> TN_STATE_WAIT_COMP state change
      00000800:40000000:5.0:1699032572.037584:0:61651:0:(kfilnd_tn.c:1084:kfilnd_tn_state_wait_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TX_OK event status 0 <<<< TX_OK first
      
      # TAG_RX_OK second with status -61
      00000800:40000000:5.0:1699032572.374710:0:61651:0:(kfilnd_tn.c:1241:kfilnd_tn_state_wait_tag_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TAG_RX_OK event status -61
      00000800:40000000:5.0:1699032572.374712:0:61651:0:(kfilnd_tn.c:313:kfilnd_tn_status_update()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 0 -> -61 status change
      00000800:40000000:5.0:1699032572.374713:0:61651:0:(kfilnd_tn.c:319:kfilnd_tn_status_update()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 0 -> 0 health status change
      00000400:00000200:5.0:1699032572.374714:0:61651:0:(lib-msg.c:1046:lnet_is_health_check()) Msg 0000000039f6c737 is in inconsistent state, don't perform health checking (-61, 0)
      00000400:00000200:5.0:1699032572.374715:0:61651:0:(lib-msg.c:1051:lnet_is_health_check()) health check = 0, status = -61, hstatus = 0
      00000400:40000000:5.0:1699032572.374715:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000b5e76aa5 msg 0000000039f6c737
      
      # Correct status is passed to upper layer
      00000100:40000000:5.0:1699032572.374716:0:61651:0:(events.c:481:server_bulk_callback()) event type 5, status -61, desc 000000002a3c77d3
      

      This shows the broken case:

      # TAG_RX_OK first with status -61
      00000800:40000000:5.0:1699032576.237177:0:61651:0:(kfilnd_tn.c:1084:kfilnd_tn_state_wait_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TAG_RX_OK event status -61
      00000800:40000000:5.0:1699032576.237178:0:61651:0:(kfilnd_tn.c:299:kfilnd_tn_state_change()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_STATE_WAIT_COMP -> TN_STATE_WAIT_SEND_COMP state change
      
      # TX_OK second with status 0
      00000800:40000000:5.0:1699032576.237242:0:61651:0:(kfilnd_tn.c:1178:kfilnd_tn_state_wait_send_comp()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 TN_EVENT_TX_OK event status 0
      00000400:00000200:5.0:1699032576.237243:0:61651:0:(lib-msg.c:1051:lnet_is_health_check()) health check = 1, status = 0, hstatus = 0
      00000400:00000200:5.0:1699032576.237244:0:61651:0:(lib-msg.c:822:lnet_health_check()) health check: 17@kfi->0@kfi: GET: OK
      00000400:40000000:5.0:1699032576.237247:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000c76d55f1 msg 00000000c54fcba7
      00000400:40000000:5.0:1699032576.237248:0:61651:0:(lib-msg.c:979:lnet_msg_detach_md()) md 00000000c76d55f1 msg 000000003842aa4c
      00000400:00000200:5.0:1699032576.237249:0:61651:0:(lib-md.c:68:lnet_md_unlink()) Unlinking md 00000000c76d55f1
      00000800:40000000:5.0:1699032576.237251:0:61651:0:(kfilnd_tn.c:1473:kfilnd_tn_free()) KFILND_MSG_BULK_GET_REQ Transaction ID 000000000105cad8: 17@kfi:1 -> 0@kfi(00000000eb641e7b):0x0 Transaction freed
      
      # Upper layer got status 0, corruption detected
      00000001:00020000:4.0:1699032576.261312:0:58034:0:(brw_test.c:230:brw_check_page()) Bad data in page 000000008c1819d6: 0xbeefbeefbeefbeef, 0xeeb0eeb1eeb2eeb3 expected
      00000001:00020000:4.0:1699032576.261312:0:58034:0:(brw_test.c:266:brw_check_bulk()) Bulk page 000000008c1819d6 (0/256) is corrupted!
      00000001:00020000:4.0:1699032576.261313:0:58034:0:(brw_test.c:432:brw_bulk_ready()) Bulk data rpc 0000000007c0e22a from 12345-0@kfi is corrupted!
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: