Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • lustre 1.8.8 RHEL5
    • 3
    • 5754

    Description

      we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

      e.g.)
      Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
      Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (107) req@ffff8106a3c46400 x1415646605273905/t0 o400><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

      Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

      Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

      Any good idea to easy find bad client when the above logs happen?

      Attachments

        Issue Links

          Activity

            [LU-2429] easy to find bad client

            this is OSS's sysrq-t output that we got right now.

            ihara Shuichi Ihara (Inactive) added a comment - this is OSS's sysrq-t output that we got right now.

            full OSS"s messages attached.

            Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

            johann Johann Lombardi (Inactive) added a comment - full OSS"s messages attached. Ihara, threads are stuck waiting for commit. Any chance to collect the output of a sysrq-t (or even better a crash dump)?

            full OSS"s messages attached.

            ihara Shuichi Ihara (Inactive) added a comment - full OSS"s messages attached.

            I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general.
            But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS?

            # grep "still busy" 20121210_t2s007037.log 
            Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs
            Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs
            Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs
            Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs
            Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs
            ...
            
            ihara Shuichi Ihara (Inactive) added a comment - I saw "still busy with x active RPCs" problems a couple of time, and posted on here in general. But, just in now, we got same problem at the one of our customers. I think it should be a root cause, but want to find what client is stacking RPCs? can we find bad client from following logs on OSS? # grep "still busy" 20121210_t2s007037.log Dec 10 19:15:28 t2s007037 kernel: Lustre: 16504:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 448ded8b-6867-b4e3-b095-24a1194a0311@192.168.20.53@tcp1 to 0xffff81060f828e00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 20370:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 98aabcbb-79bf-0dd8-3a0e-f869054aa095@192.168.19.31@tcp1 to 0xffff81028b12ba00; still busy with 4 active RPCs Dec 10 19:15:28 t2s007037 kernel: Lustre: 5499:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 1e0c4bbc-b2a9-1268-afaf-811307e85c34@192.168.19.80@tcp1 to 0xffff81006d77b600; still busy with 3 active RPCs Dec 10 19:15:31 t2s007037 kernel: Lustre: 5534:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from 8e966893-d9e9-3508-a406-c2132095af5f@10.1.10.84@o2ib to 0xffff81018682a200; still busy with 8 active RPCs Dec 10 19:15:33 t2s007037 kernel: Lustre: 16481:0:(ldlm_lib.c:874:target_handle_connect()) gscr0-OST0000: refuse reconnection from c2b698fa-a4d9-ff0c-6dc5-298134339777@192.168.19.50@tcp1 to 0xffff8100d2fb0400; still busy with 8 active RPCs ...

            Note that there is a bug open for the "still busy" problem (LU-793), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616).

            While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara.

            Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

            johann Johann Lombardi (Inactive) added a comment - Note that there is a bug open for the "still busy" problem ( LU-793 ), and I believe Oleg had a patch for this ( http://review.whamcloud.com/1616 ). While i agree that we should consider removing this protection, i think we first need to understand how a service thread can be stuck forever as reported by Ihara. Ihara, there should definitely be some watchdogs printed on the console. It would be very helpful if you could provide us with those logs. Otherwise, there is not much we can do, i'm afraid.

            We need to progress here and at least to try understand the real conditions when your problem occurs.
            So can you at least provide us the syslogs, covering the problem time-frame, of one concerned Client and associated Server/OSS ??
            Thank's.

            bfaccini Bruno Faccini (Inactive) added a comment - We need to progress here and at least to try understand the real conditions when your problem occurs. So can you at least provide us the syslogs, covering the problem time-frame, of one concerned Client and associated Server/OSS ?? Thank's.

            As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid.

            Note that there is a bug open for the "still busy" problem (LU-793), and I believe Oleg had a patch for this (http://review.whamcloud.com/1616). The last time I spoke with him about this he wasn't quite happy with the patch, but maybe someone else could look into fixing and landing it? I think this is a common and long-standing problem and it would be nice to fix it if possible.

            adilger Andreas Dilger added a comment - As Johann mentions, the client NID is in this message. It is also possible to grep for the client UUID (6279e611-9d6b-3d6a-bab4-e76cf925282f in this case) in /proc/fs/lustre/obdfilter/LARGE01-OST004c/exports/*/uuid. Note that there is a bug open for the "still busy" problem ( LU-793 ), and I believe Oleg had a patch for this ( http://review.whamcloud.com/1616 ). The last time I spoke with him about this he wasn't quite happy with the patch, but maybe someone else could look into fixing and landing it? I think this is a common and long-standing problem and it would be nice to fix it if possible.

            So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained?

            It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting.

            yes, it's aborted normally,

            ok

            but sometimes, it doesn't abort and the client can't reconnect forever.

            That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck.

            I wonder if we can do force abort and skip waiting for this processing.

            I'm afraid that we can't

            johann Johann Lombardi (Inactive) added a comment - So, "still busy with 1 active RPCs" means reconnected client's RPC is sitll remained? It means that there is still a service thread processing a request from the previous connection which prevents the client from reconnecting. yes, it's aborted normally, ok but sometimes, it doesn't abort and the client can't reconnect forever. That's not normal. In this case, you should see watchdogs on the server side and the stack trace would help us understanding where the service thread is stuck. I wonder if we can do force abort and skip waiting for this processing. I'm afraid that we can't

            BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ??

            Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

            bfaccini Bruno Faccini (Inactive) added a comment - BTW, are there any msgs on Client, let say 560@gni for example from you Server logs, side around the same time ?? Also, is there any way to get some debug analysis (live "crash" tool session, Alt+SysRq, ...) on client-side that may help to find if some thread is stuck ???

            Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

            ihara Shuichi Ihara (Inactive) added a comment - Bruno, yes, understood, although in this case, the network problem causes this situation, the problem is that we sometimes saw this problem even if the network problem doesn't happen. I want to avoid this still active RPC and evict that client manually otherwise we need to wait very long time to reconnect.

            You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side.

            But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

            bfaccini Bruno Faccini (Inactive) added a comment - You can also monitor the log/msgs directly on all Clients and /proc/fs/lustre/osc/*/state, it will give you the picture from Clients side. But don't forget that if you suspect network/interconnect problems, you better have to 1st troubleshoot it using appropriated tools.

            People

              bfaccini Bruno Faccini (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: