Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1733

Intermittent issue with some clients not being able to access OSTS. LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Blocker
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • Terascala Lustre appliance. Lustre version: 1.8.4. OS: CentOS 5.4
    • 2
    • 3985

    Description

      The issue as I understand is that the user will have clients (both 1g and 10g) that will lose connection to an OSS. That client will not be able to lctl ping the OSS, though I believe it is able to ping it. The OSS is reachable by other clients on the same network, via ping and lctl ping. The OSS appears to be fine from what we've been able to see.

      Logs are from ts-server-05 and 06 (because they're not 07 or 08).

      The UUID messages, with context, no intervening lines have been
      removed:

      Aug 6 19:24:59 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 5df98fca-2127-4997-2987-27812afa3ec8 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it.
      Aug 6 19:26:46 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 474cbe0a-7ca0-b06e-994f-3765fd83b501 (at 192.168.255.244@tcp) in 227 seconds. I think it's dead, a nd I am evicting it.
      Aug 6 19:31:48 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target)
      Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23182:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8102a3b2a800 x1408051660878194/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344281608 ref 1 fl Interpret:/0/0 rc -19/0
      Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23182:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message Aug 6 19:31:48 ts-server-05 kernel: LustreError: Skipped 1 previous similar message
      Aug 6 19:31:48 ts-server-05 kernel: LustreError: 23230:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff81041551b000 x1408051660878193/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344281608 ref 1 fl Interpret:/0/0 rc -19/0
      Aug 6 20:50:15 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client e867e149-84ad-a3a2-b7d3-fba60c7e6f6d (at 10.145.62.25@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug 6 20:50:15 ts-server-05 kernel: Lustre: Skipped 1 previous similar message
      Aug 7 17:05:47 ts-server-05 ts-log-date: Timestamp 1344359143
      Aug 7 17:06:01 ts-server-05 ts-log-date: Timestamp 1344359158
      Aug 7 17:26:38 ts-server-05 kernel: LustreError: 14619:0:(acceptor.c:435:lnet_acceptor()) Refusing connection from 165.140.72.11: insecure port 42301
      Aug 7 17:26:38 ts-server-05 rpc.statd[5109]: recv_rply: can't decode RPC message!
      Aug 7 17:28:48 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 29c8167a-8f77-18d6-9d6c-40740141bd9f (at 160.62.219.80@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug 7 17:28:48 ts-server-05 kernel: Lustre: Skipped 1 previous similar message
      Aug 7 17:28:48 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 29c8167a-8f77-18d6-9d6c-40740141bd9f (at 160.62.219.80@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug 7 17:42:38 ts-server-05 kernel: Lustre: 23275:0:(ldlm_lib.c:572:target_handle_reconnect()) fs-OST0005: ccc0fafc-ecbd-ea39-5e31-bba746b6693e reconnecting
      Aug 7 17:42:38 ts-server-05 kernel: Lustre: 23173:0:(ldlm_lib.c:572:target_handle_reconnect()) fs-OST0004: ccc0fafc-ecbd-ea39-5e31-bba746b6693e reconnecting
      Aug 7 19:02:45 ts-server-05 kernel: schedule_timeout: wrong timeout value ffffffffffffff1d from ffffffff886c8354
      Aug 7 21:57:59 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 50ec0e03-4fbc-d138-c26d-6554dfa789ed (at 160.62.219.68@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug 7 22:00:52 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0007_UUID' is not available for connect (no target)
      Aug 7 22:00:52 ts-server-05 kernel: LustreError: 23257:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810126baf800 x1409022347227196/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344376952 ref 1 fl Interpret:/0/0 rc -19/0
      Aug 7 22:00:52 ts-server-05 kernel: LustreError: Skipped 1 previous similar message
      Aug 7 22:01:33 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0007_UUID' is not available for connect (no target)
      Aug 7 22:01:33 ts-server-05 kernel: LustreError: 23293:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810147de7c00 x1409022347230552/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344376993 ref 1 fl Interpret:/0/0 rc -19/0
      Aug 7 22:01:33 ts-server-05 kernel: LustreError: 23293:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message
      Aug 7 22:01:33 ts-server-05 kernel: LustreError: Skipped 1 previous similar message
      Aug 8 00:49:15 ts-server-05 kernel: Lustre: fs-OST0005: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it.
      Aug 8 00:49:15 ts-server-05 kernel: Lustre: Skipped 1 previous similar message
      Aug 8 00:49:16 ts-server-05 kernel: Lustre: fs-OST0004: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, a nd I am evicting it.
      Aug 8 00:50:52 ts-server-05 kernel: LustreError: 137-5: UUID 'fs-OST0006_UUID' is not available for connect (no target)
      Aug 8 00:50:52 ts-server-05 kernel: LustreError: 23239:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff810082de2400 x1409580924063913/t0 o8-><?>@<?>:0/0 l ens 368/0 e 0 to 0 dl 1344387152 ref 1 fl Interpret:/0/0 rc -19/0
      Aug 8 00:50:52 ts-server-05 kernel: LustreError: 23239:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message
      Aug 8 00:50:52 ts-server-05 kernel: LustreError: Skipped 1 previous similar message
      

      It's interesting to note that all the UUID messages pertain to OSTs that are active on the other OSS in this pair.

      Looking at the other node in this pair for roughly the same time frame gives us the following messages:

      Aug  7 23:40:28 ts-server-06 kernel: LustreError: Skipped 23 previous similar messages
      Aug  7 23:50:33 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available  for connect (no target)
      Aug  7 23:50:33 ts-server-06 kernel: LustreError: 24143:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req@ffff8103c899d000 x1408051529009808/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344383533 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  7 23:50:33 ts-server-06 kernel: LustreError: 24143:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 27 previous similar messages
      Aug  7 23:50:33 ts-server-06 kernel: LustreError: Skipped 27 previous similar messages
      Aug  8 00:01:18 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available  for connect (no target)
      Aug  8 00:01:18 ts-server-06 kernel: LustreError: 24204:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102df7ee000 x1408051529010179/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344384178 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:01:18 ts-server-06 kernel: LustreError: 24204:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 29 previous similar messages
      Aug  8 00:01:18 ts-server-06 kernel: LustreError: Skipped 29 previous similar messages
      Aug  8 00:11:34 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available  for connect (no target)
      Aug  8 00:11:34 ts-server-06 kernel: LustreError: 24198:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff8104138fa800 x1408051529010508/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344384794 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:11:34 ts-server-06 kernel: LustreError: 24198:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 19 previous similar messages
      Aug  8 00:11:34 ts-server-06 kernel: LustreError: Skipped 19 previous similar messages
      Aug  8 00:21:47 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available  for connect (no target)
      Aug  8 00:21:47 ts-server-06 kernel: LustreError: 24583:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff8101d8259000 x1408051529010879/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344385407 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:21:47 ts-server-06 kernel: LustreError: 24583:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 35 previous similar messages
      Aug  8 00:21:47 ts-server-06 kernel: LustreError: Skipped 35 previous similar messages
      Aug  8 00:32:20 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available  for connect (no target)
      Aug  8 00:32:20 ts-server-06 kernel: LustreError: 24190:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff810414436800 x1408051529011224/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344386040 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:32:20 ts-server-06 kernel: LustreError: 24190:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 21 previous similar messages
      Aug  8 00:32:20 ts-server-06 kernel: LustreError: Skipped 21 previous similar messages
      Aug  8 00:38:29 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffe5 from ffffffff886c8354
      Aug  8 00:42:29 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0004_UUID' is not available  for connect (no target)
      Aug  8 00:42:29 ts-server-06 kernel: LustreError: 24668:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff81003d97dc00 x1408051529011582/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344386649 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:42:29 ts-server-06 kernel: LustreError: 24668:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 31 previous similar messages
      Aug  8 00:42:29 ts-server-06 kernel: LustreError: Skipped 31 previous similar messages
      Aug  8 00:48:14 ts-server-06 kernel: schedule_timeout: wrong timeout value ffffffffffffffb2 from ffffffff886c8354
      Aug  8 00:49:19 ts-server-06 kernel: Lustre: fs-OST0006: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug  8 00:49:19 ts-server-06 kernel: Lustre: Skipped 1 previous similar message
      Aug  8 00:49:19 ts-server-06 kernel: Lustre: fs-OST0007: haven't heard from client 3566e95c-981b-26a1-b84b-bb0125950925 (at 192.168.255.254@tcp) in 227 seconds. I think it's dead, and I am evicting it.
      Aug  8 00:52:05 ts-server-06 kernel: schedule_timeout: wrong timeout value fffffffffffffe82 from ffffffff886c8354
      Aug  8 00:53:06 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available  for connect (no target)
      Aug  8 00:53:06 ts-server-06 kernel: LustreError: 24591:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff8103661ad400 x1408051529011939/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344387286 ref 1 fl Interpret:/0/0 rc -19/0
      Aug  8 00:53:06 ts-server-06 kernel: LustreError: 24591:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 25 previous similar messages
      Aug  8 00:53:06 ts-server-06 kernel: LustreError: Skipped 25 previous similar messages
      Aug  8 01:02:10 ts-server-06 kernel: schedule_timeout: wrong timeout value fffffffffffffee6 from ffffffff886c8354
      Aug  8 01:03:20 ts-server-06 kernel: LustreError: 137-5: UUID 'fs-OST0005_UUID' is not available  for connect (no target)
      Aug  8 01:03:20 ts-server-06 kernel: LustreError: 24237:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19)  req@ffff8101ce3ba400 x1408051529012283/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1344387900 ref 1 fl Interpret:/0/0 rc -19/0
      

      It's interesting to note that the times do not line up. Without direct access to the servers I can't verify that ntp hasn't failed and that time hasn't drifted, but from what I can correlate within the log files, I'm certain the times are within a minute of each other, and very likely are closer than that.

      Client evictions seem to happen mostly in blocks, as you'd expect (dead client). There is, in the current log, one exception which is a single client evicted only from the 04 server, and the 08 server which I'm narrowing down to see if it's an exception due to its network issue (resolved on the 4th?)

      Attachments

        1. 160_62_219_92.LOGS.tgz
          12 kB
        2. 192_168_255_254.LOGS.tgz
          12 kB
        3. 192_168_255_254.LOGS.tgz
          12 kB
        4. Networklayout.docx
          48 kB

        Issue Links

          Activity

            People

              isaac Isaac Huang (Inactive)
              pcpiela Peter Piela (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: