Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5312

sanity test_161a: cannot create regular file '...f.sanity.161a': Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.6.0
    • 3
    • 14840

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ba2bfa4a-6e59-11e3-bae0-52540035b04c.

      The sub-test test_161a failed with the following error:

      failed to hardlink many files

      Info required for matching: sanity 161a

      Attachments

        Issue Links

          Activity

            [LU-5312] sanity test_161a: cannot create regular file '...f.sanity.161a': Input/output error

            Patches landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patches landed to Master.
            di.wang Di Wang added a comment - http://review.whamcloud.com/11443
            di.wang Di Wang added a comment -

            This is actually different problem, it seems related with OSP, instead of LWP

            00000100:00080000:1.0:1407905310.504717:0:3659:0:(import.c:1004:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-osp-MDT0001: evicting (reconnect/recover flags not set: 4)  req@ffff88007c0c2800 x1476289415644816/t0(0) o38->lustre-MDT0000-osp-MDT0001@10.2.4.185@tcp:24/4 lens 400/264 e 0 to 0 dl 1407905346 ref 1 fl Interpret:RN/0/0 rc 0/0
            00000100:00080000:1.0:1407905310.504720:0:3659:0:(import.c:1007:ptlrpc_connect_interpret()) ffff880079bdf800 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED
            

            So It seems MDT1 is being evicted by MDT0(being restarted during the test)

            According to the debug log

            1. MDT1 is in the final stage of recovery at 1407570696, so it sends the final PING to MDT0

            00000100:00080000:1.0:1407570696.905010:0:3680:0:(import.c:1420:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS
            00010000:00010000:1.0:1407570696.905011:0:3680:0:(ldlm_request.c:2278:ldlm_cancel_unused_locks_for_replay()) Dropping as many unused locks as possible beforereplay for namespace lustre-MDT0000-osp-MDT0001 (0)
            00010000:00010000:1.0:1407570696.905012:0:3680:0:(ldlm_request.c:2287:ldlm_cancel_unused_locks_for_replay()) Canceled 0 unused locks from namespace lustre-MDT0000-osp-MDT0001
            00000100:00080000:1.0:1407570696.905287:0:3680:0:(import.c:1430:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT
            00000100:00100000:1.0:1407570696.905292:0:3680:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525140:10.2.4.194@tcp:38
            00000100:00100000:1.0:1407570696.905297:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0002-mdtlov_UUID:3680:1475938571525152:10.2.4.194@tcp:400
            00000100:00100000:1.0:1407570696.905318:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525156:10.2.4.194@tcp:400
            

            2. MDT0 queue the ping at the same time 1407570696

            00000100:00100000:0.0:1407570696.907586:0:710:0:(service.c:2092:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400
            00010000:00080000:0.0:1407570696.907589:0:710:0:(ldlm_lib.c:2226:target_queue_recovery_request()) @@@ queue final req  req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 0 to 0 dl 1407570702 ref 2 fl Interpret:/c0/ffffffff rc 0/-1
            00000100:00100000:0.0:1407570696.907593:0:710:0:(service.c:2142:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 Request procesed in 7us (51us total) trans 0 rc 0/-999
            

            3. MDT0 processing the final ping req 24 seconds later, and it should reply MDT1 with "RECOVERY complete" to tell the recovery is done.

            00010000:00080000:1.0:1407570720.856384:0:747:0:(ldlm_lib.c:2061:target_recovery_thread()) @@@ processing final ping from 10.2.4.198@tcp:   req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 710876 to 0 dl 1407570712 ref 1 fl Complete:/c0/ffffffff rc 0/-1
            

            4. For some unknown reasons, MDT1 get the "RECOVERY complete" reply from MDT0 after 20 seconds

            00000100:00080000:1.0:1407570744.024908:0:3680:0:(import.c:1439:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_WAIT to RECOVER
            00000100:00080000:1.0:1407570744.024910:0:3680:0:(import.c:1446:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.2.4.194@tcp
            00000100:00080000:1.0:1407570744.024911:0:3680:0:(client.c:2578:ptlrpc_resend_req()) @@@ going to resend  req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0
            00000100:00080000:1.0:1407570744.024914:0:3680:0:(client.c:2585:ptlrpc_resend_req()) @@@ it has reply, so skip it  req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0
            00000100:00080000:1.0:1407570744.024916:0:3680:0:(import.c:1451:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from RECOVER to FULL
            00000100:02000000:1.0:1407570744.024918:0:3680:0:(import.c:1459:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-osp-MDT0001: Connection restored to lustre-MDT0000 (at 10.2.4.194@tcp)
            

            5. In the mean time, the ping_evictor on MDT0 evict the export from MDT1, because MDT1 can not ping MDT0 during recovery stage, i.e. the import state is not FULL

            00000100:00080000:0.0:1407570744.291927:0:700:0:(pinger.c:641:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1407570714
            00000100:02000400:0.0:1407570744.291936:0:700:0:(pinger.c:667:ping_evictor_main()) lustre-MDT0000: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 10.2.4.198@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007a9cec00, cur 1407570744 expire 1407570714 last 1407570695
            

            I am not sure why we see this now, probably because some recent changes, I did not dig yet. I think the way to fix this might be update exp_last_request_time in stage 3, because the client can not ping the server when it is waiting for the "final recovery" signal. So we should refresh the exp_last_request_time once server is ready to accept ping and other request.

            di.wang Di Wang added a comment - This is actually different problem, it seems related with OSP, instead of LWP 00000100:00080000:1.0:1407905310.504717:0:3659:0:(import.c:1004:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-osp-MDT0001: evicting (reconnect/recover flags not set: 4) req@ffff88007c0c2800 x1476289415644816/t0(0) o38->lustre-MDT0000-osp-MDT0001@10.2.4.185@tcp:24/4 lens 400/264 e 0 to 0 dl 1407905346 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:1.0:1407905310.504720:0:3659:0:(import.c:1007:ptlrpc_connect_interpret()) ffff880079bdf800 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED So It seems MDT1 is being evicted by MDT0(being restarted during the test) According to the debug log 1. MDT1 is in the final stage of recovery at 1407570696, so it sends the final PING to MDT0 00000100:00080000:1.0:1407570696.905010:0:3680:0:(import.c:1420:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS 00010000:00010000:1.0:1407570696.905011:0:3680:0:(ldlm_request.c:2278:ldlm_cancel_unused_locks_for_replay()) Dropping as many unused locks as possible beforereplay for namespace lustre-MDT0000-osp-MDT0001 (0) 00010000:00010000:1.0:1407570696.905012:0:3680:0:(ldlm_request.c:2287:ldlm_cancel_unused_locks_for_replay()) Canceled 0 unused locks from namespace lustre-MDT0000-osp-MDT0001 00000100:00080000:1.0:1407570696.905287:0:3680:0:(import.c:1430:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT 00000100:00100000:1.0:1407570696.905292:0:3680:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525140:10.2.4.194@tcp:38 00000100:00100000:1.0:1407570696.905297:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0002-mdtlov_UUID:3680:1475938571525152:10.2.4.194@tcp:400 00000100:00100000:1.0:1407570696.905318:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525156:10.2.4.194@tcp:400 2. MDT0 queue the ping at the same time 1407570696 00000100:00100000:0.0:1407570696.907586:0:710:0:(service.c:2092:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 00010000:00080000:0.0:1407570696.907589:0:710:0:(ldlm_lib.c:2226:target_queue_recovery_request()) @@@ queue final req req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 0 to 0 dl 1407570702 ref 2 fl Interpret:/c0/ffffffff rc 0/-1 00000100:00100000:0.0:1407570696.907593:0:710:0:(service.c:2142:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 Request procesed in 7us (51us total) trans 0 rc 0/-999 3. MDT0 processing the final ping req 24 seconds later, and it should reply MDT1 with "RECOVERY complete" to tell the recovery is done. 00010000:00080000:1.0:1407570720.856384:0:747:0:(ldlm_lib.c:2061:target_recovery_thread()) @@@ processing final ping from 10.2.4.198@tcp: req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 710876 to 0 dl 1407570712 ref 1 fl Complete:/c0/ffffffff rc 0/-1 4. For some unknown reasons, MDT1 get the "RECOVERY complete" reply from MDT0 after 20 seconds 00000100:00080000:1.0:1407570744.024908:0:3680:0:(import.c:1439:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_WAIT to RECOVER 00000100:00080000:1.0:1407570744.024910:0:3680:0:(import.c:1446:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.2.4.194@tcp 00000100:00080000:1.0:1407570744.024911:0:3680:0:(client.c:2578:ptlrpc_resend_req()) @@@ going to resend req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0 00000100:00080000:1.0:1407570744.024914:0:3680:0:(client.c:2585:ptlrpc_resend_req()) @@@ it has reply, so skip it req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0 00000100:00080000:1.0:1407570744.024916:0:3680:0:(import.c:1451:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from RECOVER to FULL 00000100:02000000:1.0:1407570744.024918:0:3680:0:(import.c:1459:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-osp-MDT0001: Connection restored to lustre-MDT0000 (at 10.2.4.194@tcp) 5. In the mean time, the ping_evictor on MDT0 evict the export from MDT1, because MDT1 can not ping MDT0 during recovery stage, i.e. the import state is not FULL 00000100:00080000:0.0:1407570744.291927:0:700:0:(pinger.c:641:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1407570714 00000100:02000400:0.0:1407570744.291936:0:700:0:(pinger.c:667:ping_evictor_main()) lustre-MDT0000: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 10.2.4.198@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007a9cec00, cur 1407570744 expire 1407570714 last 1407570695 I am not sure why we see this now, probably because some recent changes, I did not dig yet. I think the way to fix this might be update exp_last_request_time in stage 3, because the client can not ping the server when it is waiting for the "final recovery" signal. So we should refresh the exp_last_request_time once server is ready to accept ping and other request.
            green Oleg Drokin added a comment -

            Apparently this is still being hit from time to time:
            https://testing.hpdd.intel.com/test_sets/9c3f9cea-22d0-11e4-94dd-5254006e85c2

            green Oleg Drokin added a comment - Apparently this is still being hit from time to time: https://testing.hpdd.intel.com/test_sets/9c3f9cea-22d0-11e4-94dd-5254006e85c2
            di.wang Di Wang added a comment - http://review.whamcloud.com/11039

            This appears to have the same symptoms as LU-4420, but we're moving it to a separate bug because LU-4420 had a fix landed for 2.6.0 and it isn't clear if this new failure is the same root cause or just the same failure symptoms. There haven't been any other failures of this test in the past 4 weeks, while LU-4420 failed regularly.

            adilger Andreas Dilger added a comment - This appears to have the same symptoms as LU-4420 , but we're moving it to a separate bug because LU-4420 had a fix landed for 2.6.0 and it isn't clear if this new failure is the same root cause or just the same failure symptoms. There haven't been any other failures of this test in the past 4 weeks, while LU-4420 failed regularly.

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: