Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5724

IR recovery doesn't behave properly with Lustre 2.5

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.
    • 2
    • 16076

    Description

      Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

      Attachments

        Issue Links

          Activity

            [LU-5724] IR recovery doesn't behave properly with Lustre 2.5
            yujian Jian Yu added a comment -

            With the patch for LU-4119, the recovery issue did not occur at small scale testing in ORNL. Large scale testing will be performed.

            yujian Jian Yu added a comment - With the patch for LU-4119 , the recovery issue did not occur at small scale testing in ORNL. Large scale testing will be performed.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node
            only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?

            hongchao.zhang Hongchao Zhang added a comment - - edited Is the failover mode the same for both tests at Dec 31, 2014 and at Jan 14, 2015, which is there is a separated node only running MGS connected by MDS, OSSs and the clients nodes, and the MDS and OSSs are failed over together?

            No the MGS is left up. We failed over the MDS and OSS together.

            simmonsja James A Simmons added a comment - No the MGS is left up. We failed over the MDS and OSS together.

            Are both MGS and MDS failed over in this test?

            the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds
            (default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as
            "LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case.

            for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery
            (imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).

            hongchao.zhang Hongchao Zhang added a comment - Are both MGS and MDS failed over in this test? the IR status will be set IR_STARTUP after MGS is started and will be changed to IR_FULL after "ir_timeout" seconds (default is OBD_IR_MGS_TIMEOUT = "4*obd_timeout"). the target(MDT or OST) registered to MGS will only be marked as "LDD_F_IR_CAPABLE" if the IR status is IR_FULL, and "IR" will be printed as "DISABLED" in this case. for the client side, the imperative_recovery will be marked as "Enabled" if the connection with the server supports recovery (imp->imp_connect_data & OBD_CONNECT_IMP_RECOV == TRUE).
            simmonsja James A Simmons added a comment - - edited

            Testing failover and we see with just the MDS being failed over:

            very 1.0s: cat recovery_status Wed Jan 14 14:11:11 2015

            status: COMPLETE
            recovery_start: 1421262582
            recovery_duration: 60
            completed_clients: 82/82
            replayed_requests: 0
            last_transno: 90194315377
            VBR: DISABLED
            IR: DISABLED

            All clients are 2.5+ so there should be no reason that IR is disabled. Can you report this problem on your side?

            On the MGS we see during failover.

            root@atlas-tds-mds1 MGC10.36.226.79@o2ib]# cat ir_state
            imperative_recovery: ENABLED
            client_state:

            • { client: atlastds-MDT0000, nidtbl_version: 957 }
            simmonsja James A Simmons added a comment - - edited Testing failover and we see with just the MDS being failed over: very 1.0s: cat recovery_status Wed Jan 14 14:11:11 2015 status: COMPLETE recovery_start: 1421262582 recovery_duration: 60 completed_clients: 82/82 replayed_requests: 0 last_transno: 90194315377 VBR: DISABLED IR: DISABLED All clients are 2.5+ so there should be no reason that IR is disabled. Can you report this problem on your side? On the MGS we see during failover. root@atlas-tds-mds1 MGC10.36.226.79@o2ib]# cat ir_state imperative_recovery: ENABLED client_state: { client: atlastds-MDT0000, nidtbl_version: 957 }

            sorry for misunderstanding! Yes, Lustre supports both MDT and OSS to fail over.

            hongchao.zhang Hongchao Zhang added a comment - sorry for misunderstanding! Yes, Lustre supports both MDT and OSS to fail over.
            yujian Jian Yu added a comment -

            In Lustre test suite, the following sub-tests in insanity.sh test failing MDS and OSS at the same time:

            run_test 2 "Second Failure Mode: MDS/OST `date`"
            run_test 4 "Fourth Failure Mode: OST/MDS `date`"
            

            The basic test steps for sub-test 2 are:

            fail MDS
            fail OSS
            start OSS
            start MDS
            

            And the basic test steps for sub-test 4 are:

            fail OSS
            fail MDS
            start OSS
            start MDS
            

            Here is the insanity test report for Lustre b2_5 build #107: https://testing.hpdd.intel.com/test_sets/bfd812b0-8a4d-11e4-a10b-5254006e85c2

            yujian Jian Yu added a comment - In Lustre test suite, the following sub-tests in insanity.sh test failing MDS and OSS at the same time: run_test 2 "Second Failure Mode: MDS/OST `date`" run_test 4 "Fourth Failure Mode: OST/MDS `date`" The basic test steps for sub-test 2 are: fail MDS fail OSS start OSS start MDS And the basic test steps for sub-test 4 are: fail OSS fail MDS start OSS start MDS Here is the insanity test report for Lustre b2_5 build #107: https://testing.hpdd.intel.com/test_sets/bfd812b0-8a4d-11e4-a10b-5254006e85c2
            green Oleg Drokin added a comment -

            I think what HongChao tries to say is that when MDS and OST both go down, then since MDS is a client of OST, the OST recovery can never complete because of the missing client.

            But on the other hand we took two steps to help this. First, MDS client UUID should be always the same, so even after restart it still should be allowed to reconnect as the old known client (this is assuming it actually got up and into reconecting state in time for OST recovery. if your MDS takes ages to reboot, for example, it might miss this window, esp. if it's a shortened window thanks to IR).

            Second, we have VBR to deal with missing clients during recovery, which is esp. easy with MDT client, since it never has any outstanding uncommitted transactions to replay.

            green Oleg Drokin added a comment - I think what HongChao tries to say is that when MDS and OST both go down, then since MDS is a client of OST, the OST recovery can never complete because of the missing client. But on the other hand we took two steps to help this. First, MDS client UUID should be always the same, so even after restart it still should be allowed to reconnect as the old known client (this is assuming it actually got up and into reconecting state in time for OST recovery. if your MDS takes ages to reboot, for example, it might miss this window, esp. if it's a shortened window thanks to IR). Second, we have VBR to deal with missing clients during recovery, which is esp. easy with MDT client, since it never has any outstanding uncommitted transactions to replay.
            simmonsja James A Simmons added a comment - - edited

            We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

            simmonsja James A Simmons added a comment - - edited We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

            Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"?
            this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55".

            at Client (rhea513, 10.38.146.45)

             
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not!
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported.
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033]  req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages
            Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600]  req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client
            

            at MDT (10.36.226.69)

            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814766]  sdbd:
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.835086]  unknown partition table
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.845364]  unknown partition table
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table)
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib)
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul)
            Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected.
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180]
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180]
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64
            

            btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover.
            and Lustre seems not to support both MDS and OSS to failover, IIRC.

            hongchao.zhang Hongchao Zhang added a comment - Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"? this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55". at Client (rhea513, 10.38.146.45) Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not! Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported. Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033] req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600] req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client at MDT (10.36.226.69) Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814766] sdbd: Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.835086] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.845364] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4 Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul) Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected. Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64 btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover. and Lustre seems not to support both MDS and OSS to failover, IIRC.

            Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

            simmonsja James A Simmons added a comment - Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

            People

              hongchao.zhang Hongchao Zhang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: