Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5724

IR recovery doesn't behave properly with Lustre 2.5

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • MDS server running RHEL6.5 running ORNL 2.5.3 branch with about 12 patches.
    • 2
    • 16076

    Description

      Today we experienced a hardware failure with our MDS. The MDS rebooted and then came back. We restarted the MDS but IR behaved strangely. Four clients got evicted but when the timer to completion got down to zero IR restarted all over again. Then once it got to the 700 second range the timer starting to go up. It did this a few times before letting the timer running out. Once the timer did finally get to zero the recovery state was reported as still being in recovery. It removed this way for several more minutes before finally being in a recovered state. In all it toke 54 minutes to recover.

      Attachments

        Issue Links

          Activity

            [LU-5724] IR recovery doesn't behave properly with Lustre 2.5
            simmonsja James A Simmons added a comment - - edited

            We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

            simmonsja James A Simmons added a comment - - edited We have had success before with MDS+OSS failing over at the same time in the past. We really like to have that functionality restored.

            Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"?
            this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55".

            at Client (rhea513, 10.38.146.45)

             
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not!
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported.
            Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033]  req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail
            Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages
            Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600]  req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client
            

            at MDT (10.36.226.69)

            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814766]  sdbd:
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.835086]  unknown partition table
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.845364]  unknown partition table
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk
            Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [  117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table)
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib)
            Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul)
            Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected.
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180]
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180]
            Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64
            

            btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover.
            and Lustre seems not to support both MDS and OSS to failover, IIRC.

            hongchao.zhang Hongchao Zhang added a comment - Is the client (rhea513, 10.38.146.45) the Lustre client connected to "atlastds"? this client mounted at "Dec 29 12:55:33", but umounted at "Dec 29 14:04:54", but the MDT failed over at "Dec 29 13:55:12" and the Lustre was started only at "Dec 29 14:26:55". at Client (rhea513, 10.38.146.45) Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: client wants to enable acl, but mdt not! Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Layout lock feature supported. Dec 29 12:55:33 rhea513.ccs.ornl.gov kernel: Lustre: Mounted atlastds-client Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879033/real 1419879033] req@ffff881000328000 x1487757986577588/t0(0) o400->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 224/224 e 0 to 1 dl 1419879600 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST0023-osc-ffff880820a28c00: Connection to atlastds-OST0023 (at 10.36.226.67@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: atlastds-OST002e-osc-ffff880820a28c00: Connection to atlastds-OST002e (at 10.36.226.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: LustreError: 166-1: MGC10.36.226.79@o2ib: Connection to MGS (at 10.36.226.79@o2ib) was lost; in progress operations using this service will fail Dec 29 14:00:00 rhea513.ccs.ornl.gov kernel: Lustre: 5367:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1419879600/real 1419879600] req@ffff88102a64b800 x1487757986596960/t0(0) o8->atlastds-OST0033-osc-ffff880820a28c00@10.36.226.67@o2ib:28/4 lens 400/544 e 0 to 1 dl 1419879880 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Dec 29 14:04:40 rhea513.ccs.ornl.gov kernel: Lustre: 5358:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message Dec 29 14:04:54 rhea513.ccs.ornl.gov kernel: Lustre: Unmounted atlastds-client at MDT (10.36.226.69) Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814766] sdbd: Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.814952] sd 7:0:0:98: [sdaj] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.835086] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.845364] unknown partition table Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864189] sd 6:0:0:71: [sdau] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.864257] sd 6:0:0:101: [sdbd] Attached SCSI disk Dec 29 13:56:16 atlas-tds-oss6.ccs.ornl.gov kernel: [ 117.970252] device-mapper: multipath round-robin: version 1.0.0 loaded Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.955421] LNet: HW CPU cores: 16, npartitions: 4 Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.963523] alg: No test for crc32 (crc32-table) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.977845] alg: No test for adler32 (adler32-zlib) Dec 29 14:26:55 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1957.988788] alg: No test for crc32 (crc32-pclmul) Dec 29 14:26:59 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1962.022800] padlock: VIA PadLock Hash Engine not detected. Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.277906] LNet: Added LNI 10.36.226.69@o2ib [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1966.814213] LNet: Added LNI 10.36.226.69@o2ib200 [63/2560/0/180] Dec 29 14:27:04 atlas-tds-oss6.ccs.ornl.gov kernel: [ 1967.054096] Lustre: Lustre: Build Version: 2.5.3-g6158f83-CHANGED-2.6.32-431.29.2.el6.atlas.x86_64 btw, in this failover test of both MDT and OSS was rebooted, and recovery will be slow for the OSSs will wait time to evict the client from MDT, which won't reconnect to recover. and Lustre seems not to support both MDS and OSS to failover, IIRC.

            Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.

            simmonsja James A Simmons added a comment - Here are the kern logs for a client and a router. If you want the logs for all the clients let me know.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above?

            [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects
            Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59
            Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54
            

            the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as
            a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover.

            Could you please attach the console&sys logs of the client? Thanks!

            hongchao.zhang Hongchao Zhang added a comment - - edited is there only one Lustre client at 10.38.144.11 in this configuration? are these logs in the same failover test above? [ 2267.379541] Lustre: atlastds-MDT0000: Will be in recovery for at least 30:00, or until 1 client reconnects Dec 29 14:31:02 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2267.409294] Lustre: atlastds-MDT0000: Denying connection for new client 3ae0ecec-84ef-cf8f-c128-51873c53d1ad (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:59 Dec 29 14:31:08 atlas-tds-mds1.ccs.ornl.gov kernel: [ 2272.910080] Lustre: atlastds-MDT0000: Denying connection for new client 5116891d-0ace-dffd-7497-218db0b23e98 (at 10.38.144.11@o2ib4), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 29:54 the MDT & OSSs are waiting the client to reconnect for recovery, but it somehow failed to reconnect and seems to connect as a new Lustre client which was denied by the MDS&OSSs for it was recovering from failover. Could you please attach the console&sys logs of the client? Thanks!

            Here you go. These are the logs from the clients and servers.

            simmonsja James A Simmons added a comment - Here you go. These are the logs from the clients and servers.

            The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

            simmonsja James A Simmons added a comment - The OSS reconnected to the MDS but none of the clients every reconnected. The clients appeared stuck. The client logs are from the client nodes we used. As for the configuration the MGS is a stand alone node and we tested with 4 nodes. Will grab the logs.

            as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT.

            00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted.
            

            which nodes does the client log "client-dump.log" contain? no eviction record was found in this log.

            btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with
            those debug logs?

            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - as per the log "dump_atlas-tds-mds1-after-recovery.log", there are 3 out of 4 clients completed the recovery at MDT. 00010000:02000000:13.0:1419964653.561987:0:15786:0:(ldlm_lib.c:1392:target_finish_recovery()) atlastds-MDT0000: Recovery over after 30:00, of 4 clients 3 recovered and 1 was evicted. which nodes does the client log "client-dump.log" contain? no eviction record was found in this log. btw, do you use 4 clients and a separated MGS in this test? and could you please attach the console/sys logs along with those debug logs? Thanks!

            We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/LU-5724/*.log. The OSS seem to recovery but the MDS did not recovery properly.

            simmonsja James A Simmons added a comment - We did another test run for recovery in the case of both MDS and OSS fail. I collected logs and placed them at ftp.whamcloud.com/uploads/ LU-5724 /*.log. The OSS seem to recovery but the MDS did not recovery properly.

            No. Only the MDS and OSS were restarted.

            simmonsja James A Simmons added a comment - No. Only the MDS and OSS were restarted.

            Does "single server node" mean that the MGS was also restarted in the test?

            jay Jinshan Xiong (Inactive) added a comment - Does "single server node" mean that the MGS was also restarted in the test?

            Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

            simmonsja James A Simmons added a comment - Some more info from todays testings. The failure to recovery occurred when both the MDS and an OSS were failed over. If we did just a MDS or a OSS recovery would complete. When we did the second round of testing with a single server node we noticed that IR was reported as disabled even tho we have no non-IR clients. We checked that on the MGS.

            People

              hongchao.zhang Hongchao Zhang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: