Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-142

system hang when running replay-single or replay-dual with three clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 3
    • 10067

    Description

      System hangs there when running replay-single or replay-dual with three clients.

      MDT syslogs

      Lustre: DEBUG MARKER: == replay-dual test 3: |X| mkdir adir, mkdir adir/bdir =============================================== 15:25:29 (1300400729)
      LustreError: 16978:0:(ldlm_request.c:88:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1300400765, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-ffff810634719000 lock: ffff810634d62000/0xc740c432497b7dd3 lrc: 3/0,1 mode: --/CW res: 64945/1800942150 bits 0x2 rrc: 6 type: IBT flags: 0x4004030 remote: 0x0 expref: -99 pid: 16978 timeout: 0
      LustreError: 16860:0:(ldlm_request.c:88:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1300400765, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-ffff810634719000 lock: ffff81032829d000/0xc740c432497b7de8 lrc: 3/1,0 mode: --/PR res: 64945/1800942150 bits 0x2 rrc: 6 type: IBT flags: 0x4004000 remote: 0x0 expref: -99 pid: 16860 timeout: 0
      LustreError: 16860:0:(ldlm_request.c:88:ldlm_expired_completion_wait()) Skipped 1 previous similar message
      LustreError: dumping log to /tmp/lustre-log.1300401065.16859
      Lustre: 16997:0:(service.c:1000:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-175), not sending early reply
      req@ffff8102e0384800 x1363568821797227/t0(0) o-1->8274223f-8957-01db-a6c1-85c4a27e713c@NET_0x50000c0a80405_UUID:0/0 lens 544/1016 e 4 to 0 dl 1300401545 ref 2 fl Interpret:/ffffffff/ffffffff rc 0/-1
      Lustre: 16997:0:(service.c:1000:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-175), not sending early reply
      req@ffff8102dd62d850 x1363568698065665/t0(0) o-1->5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@NET_0x50000c0a80401_UUID:0/0 lens 544/1016 e 4 to 0 dl 1300401545 ref 2 fl Interpret:/ffffffff/ffffffff rc 0/-1
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@192.168.4.2@o2ib to 0xffff810325d28000/1
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff810636acc000 x1363568749446727/t0(0) o-1>e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@NET_0x50000c0a80402_UUID:0/0 lens 368/264 e 0 to 0 dl 1300401672 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 34 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 1 previous similar message
      Lustre: 16981:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@192.168.4.2@o2ib to 0xffff810325d28000/1
      Lustre: 16981:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@192.168.4.2@o2ib to 0xffff810325d28000/1
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 2 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@192.168.4.2@o2ib to 0xffff810325d28000/1
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 2 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@192.168.4.2@o2ib to 0xffff810325d28000/1
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 2 previous similar messages
      LustreError: 16997:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff8102bd915800 x1363568749446750/t0(0) o-1>e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@NET_0x50000c0a80402_UUID:0/0 lens 368/264 e 0 to 0 dl 1300401692 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16997:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 11 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 2 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 7 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 5 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16997:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 11 previous similar messages
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16997:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 11 previous similar messages
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff81032271b400 x1363568698066991/t0(0) o-1>5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@NET_0x50000c0a80401_UUID:0/0 lens 368/264 e 0 to 0 dl 1300401732 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 25 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 20 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 20 previous similar messages
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff8102e5f07800 x1363568749446911/t0(0) o-1>e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@NET_0x50000c0a80402_UUID:0/0 lens 368/264 e 0 to 0 dl 1300401807 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 44 previous similar messages
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 38 previous similar messages
      Lustre: 16981:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16981:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 38 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 77 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 77 previous similar messages
      LustreError: 16912:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff810636367000 x1363568698067306/t0(0) o-1>5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@NET_0x50000c0a80401_UUID:0/0 lens 368/264 e 0 to 0 dl 1300401957 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16912:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 89 previous similar messages
      Lustre: 16980:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16980:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 155 previous similar messages
      Lustre: 16980:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16980:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 155 previous similar messages
      LustreError: 17037:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff8106378aa800 x1363568698067733/t0(0) o-1>5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@NET_0x50000c0a80401_UUID:0/0 lens 368/264 e 0 to 0 dl 1300402262 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 17037:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 182 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf@192.168.4.1@o2ib to 0xffff810325d28400/1
      Lustre: 16912:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 308 previous similar messages
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 5765646f-b9c3-fbc2-d58c-fd81f8d1a4cf reconnecting
      Lustre: 16912:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 306 previous similar messages
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff810635532800 x1363568821800131/t0(0) o-1>8274223f-8957-01db-a6c1-85c4a27e713c@NET_0x50000c0a80405_UUID:0/0 lens 368/264 e 0 to 0 dl 1300402862 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1
      LustreError: 16981:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 359 previous similar messages
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) lustre-MDT0000: refuse reconnection from 8274223f-8957-01db-a6c1-85c4a27e713c@192.168.4.5@o2ib to 0xffff810325d28800/1
      Lustre: 16981:0:(ldlm_lib.c:846:target_handle_connect()) Skipped 359 previous similar messages
      Lustre: 16980:0:(ldlm_lib.c:606:target_handle_reconnect()) lustre-MDT0000: 8274223f-8957-01db-a6c1-85c4a27e713c reconnecting
      Lustre: 16980:0:(ldlm_lib.c:606:target_handle_reconnect()) Skipped 359 previous similar messages
      LustreError: 16912:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (16) req@ffff810631ed9400 x1363568749449228/t0(0) o-1>e0ef2add-a061-7cd8-cd0f-2a288e5fab8a@NET_0x50000c0a80402_UUID:0/0 lens 368/264 e 0 to 0 dl 1300403462 ref 1 fl Interpret:/ffffffff/ffffffff rc -16/-1

      client syslogs

      Lustre: DEBUG MARKER: == replay-dual test 3: |X| mkdir adir, mkdir adir/bdir =============================================== 15:25:29 (1300400729)
      LustreError: 21048:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81032cb66c00 x1363568698065666/t0(0) o-1->MGS@192.168.4.128@o2ib:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 21048:0:(client.c:1057:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
      LustreError: 21048:0:(client.c:1057:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81032cb66c00 x1363568698065673/t0(0) o-1->MGS@192.168.4.128@o2ib:26/25 lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1
      LustreError: 21048:0:(client.c:1057:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
      Lustre: 17200:0:(import.c:529:import_select_connection()) MGC192.168.4.128@o2ib: tried all connections, increasing latency to 11s
      Lustre: 17199:0:(import.c:885:ptlrpc_connect_interpret()) MGS@192.168.4.128@o2ib changed server handle from 0xc740c432497b7838 to 0xc740c432497b7e27
      Lustre: MGC192.168.4.128@o2ib: Reactivating import
      Lustre: MGC192.168.4.128@o2ib: Connection restored to service MGS using nid 192.168.4.128@o2ib.
      Lustre: Skipped 1 previous similar message
      Lustre: 21742:0:(client.c:1772:ptlrpc_expire_one_request()) @@@ Request x1363568698065665 sent from lustre-MDT0000-mdc-ffff81032b0c5400 to NID 192.168.4.128@o2ib has timed out for slow reply: [sent 1300400729] [real_sent 1300400729] [current 1300401616] [deadline 887s] [delay 0s] req@ffff810323b7a800 x1363568698065665/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.128@o2ib:12/10 lens 544/1016 e 4 to 1 dl 1300401616 ref 2 fl Rpc:X/ffffffff/ffffffff rc 0/-1
      Lustre: 21742:0:(client.c:1772:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
      Lustre: lustre-MDT0000-mdc-ffff81032b0c5400: Connection to service lustre-MDT0000 via nid 192.168.4.128@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Lustre: Skipped 1 previous similar message
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 3 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 1 previous similar message
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 3 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 6 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 12 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 25 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 51 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 102 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 120 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 120 previous similar messages
      LustreError: 11-0: an error occurred while communicating with 192.168.4.128@o2ib. The mds_connect operation failed with -16
      LustreError: Skipped 120 previous similar messages

      Here are the stack trace and debug log. With 1 client, it seems these tests can pass.

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: