Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6057

replay-dual test_9 failed - post-failover df: 1

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • OpenSFS cluster running lustre-master tag 2.6.91 build # 2771 with two MDSs with one MDT each, three OSSs with two OSTs each and three clients.
    • 3
    • 16869

    Description

      While running the LFSCK Phase 3 test plan, replay-dual test 9 failed with the error in the routine fail():

      c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
      c12: stat: cannot read file system information for `/lustre/scratch': Input/output error
      

      replay-dual test 10 failed with the same error message:

      c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      c11: stat: cannot read file system information for `/lustre/scratch': Input/output error
      pdsh@c13: c11: ssh exited with exit code 1
      c13: stat: cannot read file system information for `/lustre/scratch': Input/output error
      

      The test results are at https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2

      It’s not clear from the logs what is related to this error. For test 9, the client that could not stat the file, c12, has the following in dmesg right before the test fails

      00800000:00020000:5.0:1418766109.528501:0:25671:0:(lmv_obd.c:1477:lmv_statfs()) can't stat MDS #0 (scratch-MDT0000-mdc-ffff8808028cbc00), error -5
      

      On the primary MDS, MDS0, the recovery looks like it having issues:

      Lustre: *** cfs_fail_loc=119, val=2147483648***
      LustreError: 12646:0:(ldlm_lib.c:2384:target_send_reply_msg()) @@@ dropping reply  req@ffff880d0ee74c80 x1487677070285728/t128849018882(128849018882) o36->558cba8f-7f43-4143-5d8a-c7adfced85eb@192.168.2.112@o2ib:308/0 lens 488/448 e 0 to 0 dl 1418766108 ref 1 fl Complete:/4/0 rc 0/0
      Lustre: scratch-MDT0000: recovery is timed out, evict stale exports
      Lustre: scratch-MDT0000: disconnecting 1 stale clients
      Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
      Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) Skipped 2 previous similar messages
      Lustre: 12646:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880275011380 x1487683234832604/t0(128849018884) o36->d08d2f7b-4c89-7208-ad20-237f0ed0a102@192.168.2.113@o2ib:294/0 lens 488/0 e 6 to 0 dl 1418766094 ref 1 fl Complete:/4/ffffffff rc 0/-1
      Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
      Lustre: 12646:0:(ldlm_lib.c:2060:target_recovery_thread()) too long recovery - read logs
      Lustre: scratch-MDT0000: Recovery over after 3:01, of 7 clients 1 recovered and 6 were evicted.
      LustreError: dumping log to /tmp/lustre-log.1418766079.12646
      Lustre: Skipped 3 previous similar messages
      Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
      Lustre: DEBUG MARKER: replay-dual test_9: @@@@@@ FAIL: post-failover df: 1
      

      Attachments

        Issue Links

          Activity

            [LU-6057] replay-dual test_9 failed - post-failover df: 1

            It is another failure instance of LU-6084.

            yong.fan nasf (Inactive) added a comment - It is another failure instance of LU-6084 .

            There is unexpected recovery abort so test cannot be considered as valid. It is related to LU-6084, I mean recovery abort issue only.

            tappro Mikhail Pershin added a comment - There is unexpected recovery abort so test cannot be considered as valid. It is related to LU-6084 , I mean recovery abort issue only.

            No other programs were running while replay-dual was running.

            jamesanunez James Nunez (Inactive) added a comment - No other programs were running while replay-dual was running.

            The client on the node c12 was evicted. According to its log, the c12 client tried to replay the create RPC after the MDS0 restart:

            00000080:00000001:6.0:1418765877.138456:0:25481:0:(namei.c:927:ll_mknod()) Process entered
            00000080:00200000:6.0:1418765877.138457:0:25481:0:(namei.c:931:ll_mknod()) VFS Op:name=fsa-c12, dir=[0x200000007:0x1:0x0](ffff88080bc08bb8) mode 100644 dev 0
            ...
            00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:2140:ptlrpc_set_wait()) Process entered
            00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:1423:ptlrpc_send_new_req()) Process entered
            00000100:00000040:6.0:1418765877.338801:0:25510:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff88080a994680 x1487677070285744/t0(0) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/568 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
            ...
            00000100:00100000:4.0:1418765898.178881:0:6899:0:(client.c:2544:ptlrpc_free_committed()) @@@ stopping search  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00000001:4.0:1418765898.178886:0:6899:0:(client.c:2576:ptlrpc_free_committed()) Process leaving
            00000100:00080000:4.0:1418765898.178887:0:6899:0:(recover.c:93:ptlrpc_replay_next()) import ffff88081075c000 from scratch-MDT0000_UUID committed 124554051594 last 0
            00000100:00000001:4.0:1418765898.178890:0:6899:0:(client.c:2842:ptlrpc_replay_req()) Process entered
            00000100:00080000:4.0:1418765898.178892:0:6899:0:(client.c:2866:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl New:R/4/0 rc 0/0
            00000100:00000001:4.0:1418765898.178897:0:6899:0:(client.c:2643:ptlrpc_request_addref()) Process entered
            00000100:00000001:4.0:1418765898.178898:0:6899:0:(client.c:2645:ptlrpc_request_addref()) Process leaving (rc=18446612166851774080 : -131906857777536 : ffff88080a994680)
            00000100:00000040:4.0:1418765898.178901:0:6899:0:(ptlrpcd.c:246:ptlrpcd_add_req()) @@@ add req [ffff88080a994680] to pc [ptlrpcd_rcv:-1]  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 2 fl New:R/4/0 rc 0/0
            ...
            00000100:00000001:2.0:1418766079.080356:0:6906:0:(ptlrpcd.c:363:ptlrpcd_check()) Process leaving (rc=0 : 0 : 0)
            00000100:00000001:3.0:1418766079.088831:0:6899:0:(client.c:1176:ptlrpc_check_status()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
            00000100:00000001:3.0:1418766079.088834:0:6899:0:(client.c:1340:after_reply()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
            00000100:00000001:7.0:1418766079.088837:0:6902:0:(client.c:2001:ptlrpc_expired_set()) Process entered
            00000100:00000001:7.0:1418766079.088838:0:6902:0:(client.c:2037:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1)
            00000100:00000040:3.0:1418766079.088838:0:6899:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Interpret"  req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/192 e 0 to 0 dl 1418766109 ref 2 fl Rpc:R/4/0 rc -107/-107
            ...
            

            When the c12 replay the create RPC, it got the failure -107 finally. Such -107 was because the MDS0 has marked the recovery as failed.

            On the MDS0 side, some expected RPCs were not sent to the MDS0 for replay, (we do not what the missed RPCs are), as to the recovery was not completed. Unfortunately, the MDS0 debug logs only contained the information after the recovery.

            yong.fan nasf (Inactive) added a comment - The client on the node c12 was evicted. According to its log, the c12 client tried to replay the create RPC after the MDS0 restart: 00000080:00000001:6.0:1418765877.138456:0:25481:0:(namei.c:927:ll_mknod()) Process entered 00000080:00200000:6.0:1418765877.138457:0:25481:0:(namei.c:931:ll_mknod()) VFS Op:name=fsa-c12, dir=[0x200000007:0x1:0x0](ffff88080bc08bb8) mode 100644 dev 0 ... 00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:2140:ptlrpc_set_wait()) Process entered 00000100:00000001:6.0:1418765877.338798:0:25510:0:(client.c:1423:ptlrpc_send_new_req()) Process entered 00000100:00000040:6.0:1418765877.338801:0:25510:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff88080a994680 x1487677070285744/t0(0) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/568 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 ... 00000100:00100000:4.0:1418765898.178881:0:6899:0:(client.c:2544:ptlrpc_free_committed()) @@@ stopping search req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00000001:4.0:1418765898.178886:0:6899:0:(client.c:2576:ptlrpc_free_committed()) Process leaving 00000100:00080000:4.0:1418765898.178887:0:6899:0:(recover.c:93:ptlrpc_replay_next()) import ffff88081075c000 from scratch-MDT0000_UUID committed 124554051594 last 0 00000100:00000001:4.0:1418765898.178890:0:6899:0:(client.c:2842:ptlrpc_replay_req()) Process entered 00000100:00080000:4.0:1418765898.178892:0:6899:0:(client.c:2866:ptlrpc_replay_req()) @@@ REPLAY req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 1 fl New:R/4/0 rc 0/0 00000100:00000001:4.0:1418765898.178897:0:6899:0:(client.c:2643:ptlrpc_request_addref()) Process entered 00000100:00000001:4.0:1418765898.178898:0:6899:0:(client.c:2645:ptlrpc_request_addref()) Process leaving (rc=18446612166851774080 : -131906857777536 : ffff88080a994680) 00000100:00000040:4.0:1418765898.178901:0:6899:0:(ptlrpcd.c:246:ptlrpcd_add_req()) @@@ add req [ffff88080a994680] to pc [ptlrpcd_rcv:-1] req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/416 e 0 to 0 dl 1418766088 ref 2 fl New:R/4/0 rc 0/0 ... 00000100:00000001:2.0:1418766079.080356:0:6906:0:(ptlrpcd.c:363:ptlrpcd_check()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:3.0:1418766079.088831:0:6899:0:(client.c:1176:ptlrpc_check_status()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95) 00000100:00000001:3.0:1418766079.088834:0:6899:0:(client.c:1340:after_reply()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95) 00000100:00000001:7.0:1418766079.088837:0:6902:0:(client.c:2001:ptlrpc_expired_set()) Process entered 00000100:00000001:7.0:1418766079.088838:0:6902:0:(client.c:2037:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1) 00000100:00000040:3.0:1418766079.088838:0:6899:0:(lustre_net.h:3328:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Interpret" req@ffff88080a994680 x1487677070285744/t128849018888(128849018888) o36->scratch-MDT0000-mdc-ffff880802a13000@192.168.2.125@o2ib:12/10 lens 488/192 e 0 to 0 dl 1418766109 ref 2 fl Rpc:R/4/0 rc -107/-107 ... When the c12 replay the create RPC, it got the failure -107 finally. Such -107 was because the MDS0 has marked the recovery as failed. On the MDS0 side, some expected RPCs were not sent to the MDS0 for replay, (we do not what the missed RPCs are), as to the recovery was not completed. Unfortunately, the MDS0 debug logs only contained the information after the recovery.

            Because some of the client missed to replay the expected requests during the MDS0 restart, they were evicted by the MDS0. So it is normal that the subsequent statfs() on those evicted client got failure. Unfortunately, the debug log on the MDS0 only contained the events after the recovery, so we cannot know what happened during the recovery and what caused the eviction.

            According to our current LFSCK implementation, if former LFSCK was paused or crashed before the MDS0 restart, then it would be auto resumed after the MDS0 recovery. I found related logs on MDS0:

            00000004:00000001:0.0:1418766079.340082:0:12646:0:(mdt_handler.c:5813:mdt_postrecov()) Process entered
            00000004:00000001:0.0:1418766079.340083:0:12646:0:(mdd_device.c:1458:mdd_iocontrol()) Process entered
            00100000:00000001:0.0:1418766079.340085:0:12646:0:(lfsck_lib.c:2618:lfsck_start()) Process entered
            00100000:00000001:0.0:1418766079.340086:0:12646:0:(lfsck_lib.c:2632:lfsck_start()) Process leaving via put (rc=0 : 0 : 0x0)
            00000004:00000001:0.0:1418766079.340088:0:12646:0:(mdd_device.c:1478:mdd_iocontrol()) Process leaving (rc=0 : 0 : 0)
            

            That means the MDS0 tried to resume the LFSCK after the recovery, because there were no inconsistency or paused/crashed LFSCK before the MDS0 restart, the lfsck_start() exit directly as expected. So replay-dual test_9 failure is not related with LFSCK, but more like the master branch general issues. Unfortunately, because of no recovery logs on the MDS0, it is not easy to give further investigation.

            yong.fan nasf (Inactive) added a comment - Because some of the client missed to replay the expected requests during the MDS0 restart, they were evicted by the MDS0. So it is normal that the subsequent statfs() on those evicted client got failure. Unfortunately, the debug log on the MDS0 only contained the events after the recovery, so we cannot know what happened during the recovery and what caused the eviction. According to our current LFSCK implementation, if former LFSCK was paused or crashed before the MDS0 restart, then it would be auto resumed after the MDS0 recovery. I found related logs on MDS0: 00000004:00000001:0.0:1418766079.340082:0:12646:0:(mdt_handler.c:5813:mdt_postrecov()) Process entered 00000004:00000001:0.0:1418766079.340083:0:12646:0:(mdd_device.c:1458:mdd_iocontrol()) Process entered 00100000:00000001:0.0:1418766079.340085:0:12646:0:(lfsck_lib.c:2618:lfsck_start()) Process entered 00100000:00000001:0.0:1418766079.340086:0:12646:0:(lfsck_lib.c:2632:lfsck_start()) Process leaving via put (rc=0 : 0 : 0x0) 00000004:00000001:0.0:1418766079.340088:0:12646:0:(mdd_device.c:1478:mdd_iocontrol()) Process leaving (rc=0 : 0 : 0) That means the MDS0 tried to resume the LFSCK after the recovery, because there were no inconsistency or paused/crashed LFSCK before the MDS0 restart, the lfsck_start() exit directly as expected. So replay-dual test_9 failure is not related with LFSCK, but more like the master branch general issues. Unfortunately, because of no recovery logs on the MDS0, it is not easy to give further investigation.

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: