Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6994

MDT recovery timer goes negative, recovery never ends

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Attempting to mount a client, the recovery timer counts down, and then apparently rolls over to a negative value - recovery never ends

      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24
      LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.116@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 13 previous similar messages
      Lustre: Skipped 2 previous similar messages
      LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 7:55
      Lustre: Skipped 4 previous similar messages
      LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439394552/real 1439394552]  req@ffff880815c0dcc0 x1509313907525360/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439394607 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
      LustreError: Skipped 23 previous similar messages
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:20
      Lustre: Skipped 10 previous similar messages
      LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439395077/real 1439395077]  req@ffff880812ded9c0 x1509313907526388/t0(0) o38->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439395088 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 183 previous similar messages
      LustreError: Skipped 46 previous similar messages
      LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
      LustreError: Skipped 1 previous similar message
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 21188499:54
      

      Attachments

        Issue Links

          Activity

            [LU-6994] MDT recovery timer goes negative, recovery never ends
            pjones Peter Jones added a comment -

            Thanks for the tipoff Di

            pjones Peter Jones added a comment - Thanks for the tipoff Di
            di.wang Di Wang added a comment -

            This recovery status reporting issue will be resolved by LU-8407.

            di.wang Di Wang added a comment - This recovery status reporting issue will be resolved by LU-8407 .
            pjones Peter Jones added a comment -

            Ok Giuseppe I'll reopen the ticket and defer to Mike to comment. For now, I'll drop the priority and move this to 2.9 to reflect the reduced criticality of the issue.

            pjones Peter Jones added a comment - Ok Giuseppe I'll reopen the ticket and defer to Mike to comment. For now, I'll drop the priority and move this to 2.9 to reflect the reduced criticality of the issue.

            Peter,

            I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts?

            Giuseppe

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Peter, I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts? Giuseppe
            pjones Peter Jones added a comment -

            The current belief is that this is a duplicate of LU-7039 and/or LU-7450. We can reopen if evidence comes to light that contradicts this.

            pjones Peter Jones added a comment - The current belief is that this is a duplicate of LU-7039 and/or LU-7450 . We can reopen if evidence comes to light that contradicts this.
            di.wang Di Wang added a comment - - edited

            Just found another problem which might contribute to this issue LU-7450.

            di.wang Di Wang added a comment - - edited Just found another problem which might contribute to this issue LU-7450 .
            pjones Peter Jones added a comment -

            Giuseppe

            Does the recommended fix solve the issue for your reproducer?

            Peter

            pjones Peter Jones added a comment - Giuseppe Does the recommended fix solve the issue for your reproducer? Peter

            yes, that looks like LU-7039

            tappro Mikhail Pershin added a comment - yes, that looks like LU-7039
            di.wang Di Wang added a comment -
            2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
            2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted
            

            should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in LU-7039

            di.wang Di Wang added a comment - 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in LU-7039

            Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned.

            https://jira.hpdd.intel.com/browse/LU-7419

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned. https://jira.hpdd.intel.com/browse/LU-7419

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: