Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • our lustre source tree is at:
      https://github.com/jlan/lustre-nas
    • 3
    • 12492

    Description

      met threads hug. Forced reboot of mds 2 different time.

      uploading the following to ftp site:
      lustre-log.1391239242.7851.txt.gz
      vmcore-dmesg.txt.gz

      Lustre: MGS: haven't heard from client c546719d-1bcc-571f-a4e3-17f67dc35b50 (at 10.151.31.4@o2ib) in 199 seconds. I think it's dead, and I am evicting it. exp ffff880fd0587800, cur 1391236411 expire 1391236261 last 1391236212
      LNet: Service thread pid 7851 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 7851, comm: mdt01_055
      
      Call Trace:
       [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81080610>] ? process_timeout+0x0/0x10
       [<ffffffffa04156d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
       [<ffffffffa06d201d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
       [<ffffffffa06cd790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
       [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
       [<ffffffffa06d1758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
       [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
       [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
       [<ffffffffa0dddc0c>] mdt_object_lock0+0x28c/0xaf0 [mdt]
       [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
       [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
       [<ffffffffa0dde534>] mdt_object_lock+0x14/0x20 [mdt]
       [<ffffffffa0dde5a1>] mdt_object_find_lock+0x61/0x170 [mdt]
       [<ffffffffa0e0c80c>] mdt_reint_open+0x8cc/0x20e0 [mdt]
       [<ffffffffa043185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
       [<ffffffffa06fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
       [<ffffffffa05921b0>] ? lu_ucred+0x20/0x30 [obdclass]
       [<ffffffffa0dd7015>] ? mdt_ucred+0x15/0x20 [mdt]
       [<ffffffffa0df31cc>] ? mdt_root_squash+0x2c/0x410 [mdt]
       [<ffffffffa0df7981>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa0ddcb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa0ddd090>] mdt_intent_reint+0x1f0/0x530 [mdt]
       [<ffffffffa0ddaf3e>] mdt_intent_policy+0x39e/0x720 [mdt]
       [<ffffffffa06b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
       [<ffffffffa06d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
       [<ffffffffa0ddb3c6>] mdt_enqueue+0x46/0xe0 [mdt]
       [<ffffffffa0de1ad7>] mdt_handle_common+0x647/0x16d0 [mdt]
       [<ffffffffa0e1b615>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa070b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa04155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa0426d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0702729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffff81055813>] ? __wake_up+0x53/0x70
       [<ffffffffa070c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
       [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      LustreError: dumping log to /tmp/lustre-log.1391239242.7851
      

      Attachments

        Issue Links

          Activity

            [LU-4572] hung mdt threads

            I decided to fix this issue in the ticket U-7372

            jay Jinshan Xiong (Inactive) added a comment - I decided to fix this issue in the ticket U-7372

            I reopened this ticket because I tend to think ldlm_expired_completion_wait() should return an error code at least on client side so that the client enqueue process won't be stuck on completion AST forever.

            The same issue is seen on LU-7372.

            jay Jinshan Xiong (Inactive) added a comment - I reopened this ticket because I tend to think ldlm_expired_completion_wait() should return an error code at least on client side so that the client enqueue process won't be stuck on completion AST forever. The same issue is seen on LU-7372 .
            pjones Peter Jones added a comment -

            ok thanks Mahmoud

            pjones Peter Jones added a comment - ok thanks Mahmoud

            please close not seen the issue again.

            mhanafi Mahmoud Hanafi added a comment - please close not seen the issue again.
            green Oleg Drokin added a comment -

            I see. Let's see if switching to HPET helps any, I guess.
            With no cpu usage when this happened I do not see why things would be delayed by 70 seconds other than a broken clock.

            green Oleg Drokin added a comment - I see. Let's see if switching to HPET helps any, I guess. With no cpu usage when this happened I do not see why things would be delayed by 70 seconds other than a broken clock.

            1. I am working through the logs to see if there were any IB fabric issues.

            2. When the system was in that state cpu utilization was minimal.

            3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet.

            4. This filesystem has 84 OSTs.

            mhanafi Mahmoud Hanafi added a comment - 1. I am working through the logs to see if there were any IB fabric issues. 2. When the system was in that state cpu utilization was minimal. 3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet. 4. This filesystem has 84 OSTs.
            green Oleg Drokin added a comment -

            One more question, how many OSTs do you have in your system?

            green Oleg Drokin added a comment - One more question, how many OSTs do you have in your system?
            green Oleg Drokin added a comment -

            Well, the timeout you posted in the ticket seems to be related to client 10.151.31.4@o2ib being unresponsive, since the client is unresponsive, it does not send any lock cancels too I assume, so the locks it holds could not be cacelled and others who need said locks are forced to wait.
            I also see a number of network errors in logs (and client logs too), are you having some network problems there?

            Further on I see a different kind of problem:

            <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391239042/real 1391239042]  req@ffff8806c8f5d800 x1458619173294584/t0(0) o601->nbp7-MDT0000-lwp-MDT0000@0@lo:12/10 lens 336/336 e 3 to 1 dl 1391240399 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
            <4>Lustre: 65757:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1110:247s); client may timeout.  req@ffff8806d18a2000 x1458067638265204/t0(0) o101->446c8dd2-5b53-9b80-51b7-1573c85a021b@10.151.17.177@o2ib:0/0 lens 616/544 e 3 to 0 dl 1391240152 ref 1 fl Complete:/0/0 rc 301/301
            <4>LNet: Service thread pid 7918 completed after 1357.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            

            So here we are trying to send a quota rpc over local-connected lwp and it's timing out? that's not sounding right. Also the timeout for this local connection is huge for whatever reason. Also I do not see it being assigned any sane value too.

            We can see that overtime the timeout there grows even more.

            The other worrying sign is this:

            <4>Clocksource tsc unstable (delta = -68719428120 ns).  Enable clocksource failo
            ver by adding clocksource_failover kernel parameter.
            <4>Lustre: 6577:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in
             handling 76s  req@ffff880f7ded2850 x1447159886197772/t0(0) o400->449796f8-1d26-
            c76e-7df6-0123fb984781@10.151.57.111@o2ib:0/0 lens 192/0 e 0 to 0 dl 0 ref 1 fl 
            New:/0/ffffffff rc 0/-1
            <4>Lustre: mdt: This server is not able to keep up with request traffic (cpu-bou
            nd).
            

            So 70+ seconds to handle request_in code is way too excessive. When this happens do you experience huge cpu utilization.
            Also I see the tsc unstable message - I think kernel uses that as internal timestamp counter, so if it deviates this hugely, could it be all internal timeouts are just not working properly? Is this something you can address just to be sure?

            green Oleg Drokin added a comment - Well, the timeout you posted in the ticket seems to be related to client 10.151.31.4@o2ib being unresponsive, since the client is unresponsive, it does not send any lock cancels too I assume, so the locks it holds could not be cacelled and others who need said locks are forced to wait. I also see a number of network errors in logs (and client logs too), are you having some network problems there? Further on I see a different kind of problem: <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391239042/real 1391239042] req@ffff8806c8f5d800 x1458619173294584/t0(0) o601->nbp7-MDT0000-lwp-MDT0000@0@lo:12/10 lens 336/336 e 3 to 1 dl 1391240399 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages <4>Lustre: 65757:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1110:247s); client may timeout. req@ffff8806d18a2000 x1458067638265204/t0(0) o101->446c8dd2-5b53-9b80-51b7-1573c85a021b@10.151.17.177@o2ib:0/0 lens 616/544 e 3 to 0 dl 1391240152 ref 1 fl Complete:/0/0 rc 301/301 <4>LNet: Service thread pid 7918 completed after 1357.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). So here we are trying to send a quota rpc over local-connected lwp and it's timing out? that's not sounding right. Also the timeout for this local connection is huge for whatever reason. Also I do not see it being assigned any sane value too. We can see that overtime the timeout there grows even more. The other worrying sign is this: <4>Clocksource tsc unstable (delta = -68719428120 ns). Enable clocksource failo ver by adding clocksource_failover kernel parameter. <4>Lustre: 6577:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 76s req@ffff880f7ded2850 x1447159886197772/t0(0) o400->449796f8-1d26- c76e-7df6-0123fb984781@10.151.57.111@o2ib:0/0 lens 192/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 <4>Lustre: mdt: This server is not able to keep up with request traffic (cpu-bou nd). So 70+ seconds to handle request_in code is way too excessive. When this happens do you experience huge cpu utilization. Also I see the tsc unstable message - I think kernel uses that as internal timestamp counter, so if it deviates this hugely, could it be all internal timeouts are just not working properly? Is this something you can address just to be sure?

            I have attached the console logs from 10.151.17.136 (r445i5n0.conlog.gz).

            We have both 2.4 and 2.1.5 clients.

            at_max = 600.

            The info was taken from crash dump.

            mhanafi Mahmoud Hanafi added a comment - I have attached the console logs from 10.151.17.136 (r445i5n0.conlog.gz). We have both 2.4 and 2.1.5 clients. at_max = 600. The info was taken from crash dump.
            green Oleg Drokin added a comment -

            Also: do you run 2.4 clients from that source as well? Can you please show me dmesg from e.g. 10.151.17.136 during all of this happening?
            Did you just crashdump the node to get all of this info or did you let it stay alive and just live-dumped?

            green Oleg Drokin added a comment - Also: do you run 2.4 clients from that source as well? Can you please show me dmesg from e.g. 10.151.17.136 during all of this happening? Did you just crashdump the node to get all of this info or did you let it stay alive and just live-dumped?

            People

              green Oleg Drokin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: