Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3844

Double recovery period in 1.8.9 after OSS failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.9
    • None
    • kernel 2.6.18-348.3.1.el5, rhel5.9, distribution-provided ofed, gni behind o2iblnd routers
    • 3
    • 9949

    Description

      We frequently observe that the recovery process repeats itself after reaching the timer expiration. Often times it reaches the first timer expiration because a client has died, so it is going to go through the whole recovery period again the second time.

      In this example, recovery took 60s rather than the better case of 30. Does this fall under the case implied by the wording 'Will be in recovery for at least 30:00'. During the failure several OSS had to be rebooted.
      [blakec@widow-mgmt3 ~]$ tail -10000 /data/log/apps/lustrekernel|grep oss12a2|grep -i recov|grep widow1-OST00b5:
      Aug 27 14:16:46 widow-oss12a2 kernel: [ 544.793348] Lustre: widow1-OST00b5: Now serving widow1-OST00b5 on /dev/mpath/widow-ddn12a-l48 with recovery enabled
      Aug 27 14:16:46 widow-oss12a2 kernel: [ 544.793363] Lustre: widow1-OST00b5: Will be in recovery for at least 30:00, or until 12345 clients reconnect
      Aug 27 14:17:13 widow-oss12a2 kernel: [ 572.156957] LustreError: 16171:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.31@o2ib3 (66a3b9fa-4a34-e8cb-c90e-1c98fa38f2e6): 12336 clients in recovery for 1772s
      Aug 27 14:19:01 widow-oss12a2 kernel: [ 679.468170] LustreError: 15390:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.25@o2ib3 (e65b92af-bef9-b8b2-5069-d8894f06fb19): 12294 clients in recovery for 1665s
      Aug 27 14:55:51 widow-oss12a2 kernel: [ 2885.008362] LustreError: 16132:0:(ldlm_lib.c:946:target_handle_connect()) widow1-OST00b5: denying connection for new client 172.30.221.18@o2ib3 (5ac80ce4-58aa-16fd-8c53-889df4ba3118): 12174 clients in recovery for 1254s
      Aug 27 14:59:35 widow-oss12a2 kernel: [ 3108.624936] Lustre: 15386:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) widow1-OST00b5: 12173 recoverable clients remain
      Aug 27 15:16:46 widow-oss12a2 kernel: [ 4137.296402] Lustre: widow1-OST00b5: Recovery period over after 60:00, of 12345 clients 12342 recovered and 2 were evicted.
      Aug 27 15:16:46 widow-oss12a2 kernel: [ 4137.296415] Lustre: widow1-OST00b5: sending delayed replies to recovered clients

      In an earlier case, only the MSD failed, and recovery finished in less that 30min because all clients reconnected. I'm sure there are cases where recovery just takes 30min, but mostly now when it goes to 30min, it will also go to 60 min

      obd_timeout is set to 600
      [root@widow-oss10a1 ~]# cat /proc/sys/lustre/timeout
      600

      ldlm timeout is set to 200
      [root@widow-oss10a1 ~]# cat /proc/sys/lustre/ldlm_timeout
      200

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            blakecaldwell Blake Caldwell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: