Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6994

MDT recovery timer goes negative, recovery never ends

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Attempting to mount a client, the recovery timer counts down, and then apparently rolls over to a negative value - recovery never ends

      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24
      LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.116@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 13 previous similar messages
      Lustre: Skipped 2 previous similar messages
      LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 7:55
      Lustre: Skipped 4 previous similar messages
      LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439394552/real 1439394552]  req@ffff880815c0dcc0 x1509313907525360/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439394607 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 109 previous similar messages
      LustreError: Skipped 23 previous similar messages
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:20
      Lustre: Skipped 10 previous similar messages
      LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1439395077/real 1439395077]  req@ffff880812ded9c0 x1509313907526388/t0(0) o38->soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1439395088 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 4255:0:(client.c:2020:ptlrpc_expire_one_request()) Skipped 183 previous similar messages
      LustreError: Skipped 46 previous similar messages
      LustreError: 11-0: soaked-MDT0003-osp-MDT0000: operation mds_connect to node 0@lo failed: rc = -19
      LustreError: Skipped 1 previous similar message
      Lustre: soaked-MDT0000: Denying connection for new client 7f50b61a-34a7-dd26-60bd-7487f4a8a6ee(at 192.168.1.116@o2ib100), waiting for 7 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 21188499:54
      

      Attachments

        Issue Links

          Activity

            [LU-6994] MDT recovery timer goes negative, recovery never ends

            Peter,

            I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts?

            Giuseppe

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Peter, I still believe this is a minor issue, nothing critical. I think the issue is that recovery can in fact fail (or enter an unrecoverable state) and that is not being reported properly. I suggest that this issue can be used to implement reporting that the recovery status is failure/unrecoverable if the timer expires. Thoughts? Giuseppe
            pjones Peter Jones added a comment -

            The current belief is that this is a duplicate of LU-7039 and/or LU-7450. We can reopen if evidence comes to light that contradicts this.

            pjones Peter Jones added a comment - The current belief is that this is a duplicate of LU-7039 and/or LU-7450 . We can reopen if evidence comes to light that contradicts this.
            di.wang Di Wang (Inactive) added a comment - - edited

            Just found another problem which might contribute to this issue LU-7450.

            di.wang Di Wang (Inactive) added a comment - - edited Just found another problem which might contribute to this issue LU-7450 .
            pjones Peter Jones added a comment -

            Giuseppe

            Does the recommended fix solve the issue for your reproducer?

            Peter

            pjones Peter Jones added a comment - Giuseppe Does the recommended fix solve the issue for your reproducer? Peter

            yes, that looks like LU-7039

            tappro Mikhail Pershin added a comment - yes, that looks like LU-7039
            2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
            2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted
            

            should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in LU-7039

            di.wang Di Wang (Inactive) added a comment - 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted should be fixed by the patch http://review.whamcloud.com/#/c/16969/ in LU-7039

            Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned.

            https://jira.hpdd.intel.com/browse/LU-7419

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - Thanks! Went ahead and created a new ticket with the details and I attached the llog files I mentioned. https://jira.hpdd.intel.com/browse/LU-7419

            Yes, that is good work!

            One way or another we do probably need another ticket. I suspect that fixing the assertion and llog corruption will be the blocker that needs fixing for 2.8. At that point fixing the fact that the recovery timer can go negative will drop in priority. The recovery timer really needs to have more sane behavior in the face of errors, and we don't want to lose track of that issue.

            morrone Christopher Morrone (Inactive) added a comment - Yes, that is good work! One way or another we do probably need another ticket. I suspect that fixing the assertion and llog corruption will be the blocker that needs fixing for 2.8. At that point fixing the fact that the recovery timer can go negative will drop in priority. The recovery timer really needs to have more sane behavior in the face of errors, and we don't want to lose track of that issue.
            pjones Peter Jones added a comment -

            Excellent work Giuseppe (or Joe if you prefer)!

            pjones Peter Jones added a comment - Excellent work Giuseppe (or Joe if you prefer)!

            I managed to reproduce this issue. Using the setup in my previous comment, I determined that a kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes. Perhaps a negative timer is a symptom of an "unrecoverable" error? If so, is it reflected somewhere that recovery cannot complete?

            The details:
            To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below:

            2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed:
            2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG
            2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005
            2015-11-04 13:36:26 Nov  4 13:36:26
            ...
            2015-11-04 13:36:26 Kernel panic - not syncing: LBUG
            2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1
            2015-11-04 13:36:26 Call Trace:
            2015-11-04 13:36:26  [<ffffffff8152d471>] ? panic+0xa7/0x16f
            2015-11-04 13:36:26  [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            2015-11-04 13:36:26  [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc]
            2015-11-04 13:36:26  [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod]
            2015-11-04 13:36:26  [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd]
            2015-11-04 13:36:26  [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd]
            2015-11-04 13:36:26  [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt]
            2015-11-04 13:36:26  [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt]
            2015-11-04 13:36:26  [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass]
            2015-11-04 13:36:26  [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt]
            2015-11-04 13:36:26  [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
            2015-11-04 13:36:26  [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
            2015-11-04 13:36:26  [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
            2015-11-04 13:36:26  [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt]
            2015-11-04 13:36:26  [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt]
            2015-11-04 13:36:26  [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
            2015-11-04 13:36:26  [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
            2015-11-04 13:36:26  [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130
            2015-11-04 13:36:26  [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0
            2015-11-04 13:36:26  [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
            2015-11-04 13:36:26  [<ffffffff8109fffe>] ? kthread+0x9e/0xc0
            2015-11-04 13:36:27  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
            2015-11-04 13:36:27  [<ffffffff8109ff60>] ? kthread+0x0/0xc0
            2015-11-04 13:36:27  [<ffffffff8100c240>] ? child_rip+0x0/0x20
            

            After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below:

            2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
            2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted
            

            Eventually, the recovery timer went negative and displayed messages similar to what is in the ticket. I manually fixed the llog file on MDS4 and recovery now completes. I think that covers it. If necessary, I can put all this information in another ticket and I should be able to provide the corrupted and fixed llog file for diagnosis as well.

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - - edited I managed to reproduce this issue. Using the setup in my previous comment, I determined that a kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes. Perhaps a negative timer is a symptom of an "unrecoverable" error? If so, is it reflected somewhere that recovery cannot complete? The details: To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed: 2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 2015-11-04 13:36:26 Nov 4 13:36:26 ... 2015-11-04 13:36:26 Kernel panic - not syncing: LBUG 2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P --------------- 2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1 2015-11-04 13:36:26 Call Trace: 2015-11-04 13:36:26 [<ffffffff8152d471>] ? panic+0xa7/0x16f 2015-11-04 13:36:26 [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2015-11-04 13:36:26 [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass] 2015-11-04 13:36:26 [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass] 2015-11-04 13:36:26 [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod] 2015-11-04 13:36:26 [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd] 2015-11-04 13:36:26 [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd] 2015-11-04 13:36:26 [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass] 2015-11-04 13:36:26 [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt] 2015-11-04 13:36:26 [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt] 2015-11-04 13:36:26 [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt] 2015-11-04 13:36:26 [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc] 2015-11-04 13:36:26 [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130 2015-11-04 13:36:26 [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0 2015-11-04 13:36:26 [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] 2015-11-04 13:36:26 [<ffffffff8109fffe>] ? kthread+0x9e/0xc0 2015-11-04 13:36:27 [<ffffffff8100c24a>] ? child_rip+0xa/0x20 2015-11-04 13:36:27 [<ffffffff8109ff60>] ? kthread+0x0/0xc0 2015-11-04 13:36:27 [<ffffffff8100c240>] ? child_rip+0x0/0x20 After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below: 2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5 2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted Eventually, the recovery timer went negative and displayed messages similar to what is in the ticket. I manually fixed the llog file on MDS4 and recovery now completes. I think that covers it. If necessary, I can put all this information in another ticket and I should be able to provide the corrupted and fixed llog file for diagnosis as well.

            Peter,

            I am currently running a test cluster with DNE that has 4 MDSs and 2 OSTs. No failover is set up currently. I had an MDS crash while running a multi-node job which was writing to the lustre filesystem. I rebooted the MDS and it managed to re-establish a connection with the other MDSs. I believe I powered off the clients for some reason. The recovering MDS's timer went negative and it displayed similar messages to what is in the ticket. The clients were never evicted due to the hard timeout that was supposed to occur. Attempting to abort the recovery against the MDT on the recovering MDS would evict the clients, but the abort would eventually hang and never finish. Unfortunately, I had to reformat the filesystem and am currently trying to reproduce the above.

            Giuseppe

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - - edited Peter, I am currently running a test cluster with DNE that has 4 MDSs and 2 OSTs. No failover is set up currently. I had an MDS crash while running a multi-node job which was writing to the lustre filesystem. I rebooted the MDS and it managed to re-establish a connection with the other MDSs. I believe I powered off the clients for some reason. The recovering MDS's timer went negative and it displayed similar messages to what is in the ticket. The clients were never evicted due to the hard timeout that was supposed to occur. Attempting to abort the recovery against the MDT on the recovering MDS would evict the clients, but the abort would eventually hang and never finish. Unfortunately, I had to reformat the filesystem and am currently trying to reproduce the above. Giuseppe

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: