Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4136

MDT temporarily unhealthy when restarting

Details

    • 3
    • 11221

    Description

      Hi,

      When restarting an MDT, we consistently see that its status under /proc/fs/lustre/health_check is temporarily unhealthy.

      Here are some logs:

      00000004:02000400:1.0F:Tue Oct 22 15:23:52 CEST 2013:0:11263:0:(mdt_recovery.c:233:mdt_server_data_init()) fs1-MDT0000: used disk, loading
      00000020:02000000:8.0F:Tue Oct 22 15:23:52 CEST 2013:0:11086:0:(obd_mount_server.c:1776:server_calc_timeout()) fs1-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
      00000100:00000400:4.0:Tue Oct 22 15:23:57 CEST 2013:0:5640:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1382448232/real 0]  req@ffff8810789a9000 x1449588289516076/t0(0) o38->fs1-MDT0000-lwp-MDT0000@10.3.0.11@o2ib:12/10 lens 400/544 e 0 to 1 dl 1382448237 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      00000100:00000400:4.0:Tue Oct 22 15:23:57 CEST 2013:0:5640:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1382448232/real 0]  req@ffff881071d6f400 x1449588289513784/t0(0) o8->fs1-OST0006-osc-MDT0000@10.4.0.6@o2ib1:28/4 lens 400/544 e 0 to 1 dl 1382448237 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      00000100:00000400:4.0:Tue Oct 22 15:23:57 CEST 2013:0:5640:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1382448232/real 0]  req@ffff8808653cb800 x1449588289513644/t0(0) o8->fs1-OST0005-osc-MDT0000@10.3.0.6@o2ib:28/4 lens 400/544 e 0 to 1 dl 1382448237 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      00000100:00000400:1.0:Tue Oct 22 15:23:59 CEST 2013:0:6207:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1382448232/real 0]  req@ffff880876dbc000 x1449588289516292/t0(0) o104->MGS@10.3.0.11@o2ib:15/16 lens 296/224 e 0 to 1 dl 1382448239 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      00010000:02020000:1.0:Tue Oct 22 15:23:59 CEST 2013:0:6207:0:(ldlm_lockd.c:641:ldlm_failed_ast()) 138-a: MGS: A client on nid 10.3.0.11@o2ib was evicted due to a lock blocking callback time out: rc -107
      00000100:02000000:9.0F:Tue Oct 22 15:24:18 CEST 2013:0:5640:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) fs1-OST0006-osc-MDT0000: Connection restored to fs1-OST0006 (at 10.4.0.3@o2ib1)
      00010000:02000400:5.0F:Tue Oct 22 15:24:29 CEST 2013:0:11219:0:(ldlm_lib.c:1581:target_start_recovery_timer()) fs1-MDT0000: Will be in recovery for at least 2:30, or until 1 client reconnects
      00010000:02000000:3.0F:Tue Oct 22 15:24:29 CEST 2013:0:11285:0:(ldlm_lib.c:1420:target_finish_recovery()) fs1-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
      00000100:02000000:9.0:Tue Oct 22 15:24:55 CEST 2013:0:5640:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) fs1-OST0005-osc-MDT0000: Connection restored to fs1-OST0005 (at 10.3.0.3@o2ib)
      

      As we can see, as soon as MDT is started it has troubles connecting to several OSTs. Moreover a recovery is beginning, but it finishes soon. However, the MDT becomes healthy only when the connection to all OSTs is restored, ie at 15:24:55. Indeed, from 15:23:52 when it is started to 15:24:55 when the connection to last OST is restored, the MDT is reporting unhealthy status.

      We can understand that an MDT that has not been able to connect to its OSTs is unhealthy, but we do not understand why it has troubles connecting to them, as there are no errors on the network.
      It seems that with Lustre 2.4 the connection between MDT and OSTs is hard to establish, and takes some time before being restored (we have other examples where it took more than 2 minutes to do so).

      The problem with this situation is that we monitor Lustre MDT and OST health status for HA purpose. If a target is seen as unhealthy, the node hosting this resource can be fenced.

      Thanks,
      Sebastien.

      Attachments

        Activity

          [LU-4136] MDT temporarily unhealthy when restarting
          bobijam Zhenyu Xu added a comment - back port for b2_5 http://review.whamcloud.com/8585 and for b2_4 http://review.whamcloud.com/8587

          Hi,

          We gave a try to the new implementation of the patch at http://review.whamcloud.com/8408 (patchset 2), and with it we do not see the MDT unhealthy anymore when starting while OSTs are in recovery.

          And thanks for the explanations Mikhail.

          Now we would like to know if this patch can be merged, or if it can be used in production.

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, We gave a try to the new implementation of the patch at http://review.whamcloud.com/8408 (patchset 2), and with it we do not see the MDT unhealthy anymore when starting while OSTs are in recovery. And thanks for the explanations Mikhail. Now we would like to know if this patch can be merged, or if it can be used in production. Thanks, Sebastien.

          Sebastien, normally each obd device reported unhealthy when it is in setup or cleanup process, but each device may declare own additional checks, e.g. OFD checks statfs returns no error and os_state is not READONLY. OST checks that ptlrpc services are healthy. MDT has no specific checks.
          It is still worth to check /proc/lustre/health_check as before because it reports all key devices are healthy and fully set up.

          tappro Mikhail Pershin added a comment - Sebastien, normally each obd device reported unhealthy when it is in setup or cleanup process, but each device may declare own additional checks, e.g. OFD checks statfs returns no error and os_state is not READONLY. OST checks that ptlrpc services are healthy. MDT has no specific checks. It is still worth to check /proc/lustre/health_check as before because it reports all key devices are healthy and fully set up.

          Hi,

          We gave a try to the patch at http://review.whamcloud.com/8408, and with it we do not see the MDT unhealthy anymore when starting while OSTs are in recovery.
          So this is very good news for our HA setup, but now with this patch we need to know under which circumstances a Lustre target can be declared unhealthy. What can lead to an MDT or OST being not healthy? Is it still worth examaning /proc/fs/lustre/health_check contents in HA context?

          Thanks,
          Sebastien.

          sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, We gave a try to the patch at http://review.whamcloud.com/8408 , and with it we do not see the MDT unhealthy anymore when starting while OSTs are in recovery. So this is very good news for our HA setup, but now with this patch we need to know under which circumstances a Lustre target can be declared unhealthy. What can lead to an MDT or OST being not healthy? Is it still worth examaning /proc/fs/lustre/health_check contents in HA context? Thanks, Sebastien.
          bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/8408

          Another possible solution for this is to avoid the using of o_health_check() for reporting network status to MDT. That may be replaced with o_get_info(), so it will not interfere with health_check

          tappro Mikhail Pershin added a comment - Another possible solution for this is to avoid the using of o_health_check() for reporting network status to MDT. That may be replaced with o_get_info(), so it will not interfere with health_check

          Malcolm, as Kalpak noted the best way right now is to ignore 'not healthy' report from devices like MDD, LOD and OSP. The problem is we are using their o_health_check() functionality to report network status of OSTs for the MDT to decide when it should accept connections. At the same time the obd_proc_read_health() scans all OBD devices and call o_health_check to report their health status. It is not quite the same, because for proc_read_health() we expect to see status of devices themself more than network related things. I am not sure how to properly report different status to MDT and to procfs, one possible solution would be just ignoring MDD, LOD and OSP on obd_proc_read_health() itself considering they are internal devices in MDS stack and if MDT (top) and OSD (bottom) devices are healthy then all device in-between are healthy too.

          Zhenyu Xu, could you prepare such patch and push it to gerrit as first step?

          tappro Mikhail Pershin added a comment - Malcolm, as Kalpak noted the best way right now is to ignore 'not healthy' report from devices like MDD, LOD and OSP. The problem is we are using their o_health_check() functionality to report network status of OSTs for the MDT to decide when it should accept connections. At the same time the obd_proc_read_health() scans all OBD devices and call o_health_check to report their health status. It is not quite the same, because for proc_read_health() we expect to see status of devices themself more than network related things. I am not sure how to properly report different status to MDT and to procfs, one possible solution would be just ignoring MDD, LOD and OSP on obd_proc_read_health() itself considering they are internal devices in MDS stack and if MDT (top) and OSD (bottom) devices are healthy then all device in-between are healthy too. Zhenyu Xu, could you prepare such patch and push it to gerrit as first step?

          Assuming then that the current behaviour is correct, is there a reliable [i.e. programmatic] way to differentiate between a genuinely "unhealthy" MDT and one that is pending connections from an OST? Monitoring health check scripts and HA resource management scripts are somewhat dependent upon a reliable status indicator. If one can be described, then an alternative monitoring probe can be implemented.

          malkolm Malcolm Cowe (Inactive) added a comment - Assuming then that the current behaviour is correct, is there a reliable [i.e. programmatic] way to differentiate between a genuinely "unhealthy" MDT and one that is pending connections from an OST? Monitoring health check scripts and HA resource management scripts are somewhat dependent upon a reliable status indicator. If one can be described, then an alternative monitoring probe can be implemented.
          tappro Mikhail Pershin added a comment - - edited

          Well, as osp_obd_health_check() says:

          	/*
          	 * 1.8/2.0 behaviour is that OST being connected once at least
          	 * is considired "healthy". and one "healty" OST is enough to
          	 * allow lustre clients to connect to MDS
          	 */
          

          as it behaves. If that is definition of 'healthy' for MDT then it works correctly. If MDT restarts then no OST were 'connected once' yet, so it will wait for first OST connection to be established. Of course after that if some OST will go offline the MDT will remain healthy because that OST was seen 'once' already. That is exactly behavior described in comments above so I tend to think it is correct. If before that was not so then either that was incorrect behavior or definition of 'healthy MDT' in osp_obd_health_check() is not quite true.

          tappro Mikhail Pershin added a comment - - edited Well, as osp_obd_health_check() says: /* * 1.8/2.0 behaviour is that OST being connected once at least * is considired "healthy" . and one "healty" OST is enough to * allow lustre clients to connect to MDS */ as it behaves. If that is definition of 'healthy' for MDT then it works correctly. If MDT restarts then no OST were 'connected once' yet, so it will wait for first OST connection to be established. Of course after that if some OST will go offline the MDT will remain healthy because that OST was seen 'once' already. That is exactly behavior described in comments above so I tend to think it is correct. If before that was not so then either that was incorrect behavior or definition of 'healthy MDT' in osp_obd_health_check() is not quite true.
          bobijam Zhenyu Xu added a comment -

          Tappro, what do you think about this?

          bobijam Zhenyu Xu added a comment - Tappro, what do you think about this?

          People

            bobijam Zhenyu Xu
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: