Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8806

LFSCK hangs on MDT - osp_precreate_cleanup_orphans - cannot cleanup orphans: rc = -107

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.9.0, Lustre 2.10.0, Lustre 2.10.3
    • Soak cluster, tip of Lustre-master lustre: 2.8.59_79_gb8811a0
    • 3
    • 9223372036854775807

    Description

      After a failover, lfsck is started.
      2016-11-04 15:18:59,065:fsmgmt.fsmgmt:INFO lfsck started on lola-8
      lfsck never completes.
      2016-11-06 01:08:25,976:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 122865s
      Examining the logs, we see that the connection to OST0002 is going up and down, orphans cannot be cleaned up.

      Nov  4 15:18:32 lola-8 kernel: LustreError: 11-0: soaked-OST0017-osc-MDT0000: operation obd_ping to node 192.168.1.106@o2ib10 failed: rc = -107
      Nov  4 15:18:36 lola-8 kernel: LustreError: 11-0: soaked-OST0005-osc-MDT0000: operation ost_statfs to node 192.168.1.106@o2ib10 failed: rc = -107
      Nov  4 15:18:46 lola-8 kernel: LustreError: 6010:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) soaked-OST0002-osc-MDT0000: cannot cleanup orphans: rc = -107
      Nov  4 15:18:57 lola-8 kernel: LustreError: 11-0: soaked-OST000b-osc-MDT0000: operation obd_ping to node 192.168.1.106@o2ib10 failed: rc = -107
      Nov  4 15:18:57 lola-8 kernel: LustreError: Skipped 1 previous similar message
      Nov  4 15:19:27 lola-8 kernel: Lustre: 5851:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478297366/real 1478297366]  req@ffff8807e61689c0 x1550005401109488/t0(0) o6->soaked-OST0002-osc-MDT0000@192.168.1.104@o2ib10:28/4 lens 664/432 e 23 to 1 dl 1478297967 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      Nov  4 15:19:27 lola-8 kernel: Lustre: soaked-OST0002-osc-MDT0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Nov  4 15:19:27 lola-8 kernel: Lustre: Skipped 9 previous similar messages
      Nov  4 15:19:27 lola-8 kernel: Lustre: soaked-OST0002-osc-MDT0000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
      Nov  4 15:19:27 lola-8 kernel: Lustre: Skipped 18 previous similar messages
      Nov  4 15:19:27 lola-8 kernel: Lustre: 5851:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
      Nov  4 15:29:28 lola-8 kernel: Lustre: 5830:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478297967/real 1478297967]  req@ffff8803e013d380 x1550005401094688/t0(0) o1000->soaked-OST0002-osc-MDT0000@192.168.1.104@o2ib10:24/4 lens 520/4320 e 23 to 1 dl 1478298568 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      Nov  4 15:29:28 lola-8 kernel: Lustre: 5830:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Nov  4 15:29:28 lola-8 kernel: Lustre: soaked-OST0002-osc-MDT0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      Nov  4 15:29:28 lola-8 kernel: Lustre: Skipped 1 previous similar message
      Nov  4 15:29:28 lola-8 kernel: Lustre: soaked-OST0002-osc-MDT0000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
      Nov  4 15:29:28 lola-8 kernel: Lustre: Skipped 1 previous similar message
      Nov  4 15:31:22 lola-8 kernel: LustreError: 6010:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) soaked-OST0002-osc-MDT0000: cannot cleanup orphans: rc = -107
      Nov  4 15:39:29 lola-8 kernel: Lustre: 5840:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478298568/real 1478298568]  req@ffff8803c0c980c0 x1550005401096032/t0(0) o1000->soaked-OST0002-osc-MDT0000@192.168.1.104@o2ib10:24/4 lens 520/4320 e 23 to 1 dl 1478299169 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      Nov  4 15:39:29 lola-8 kernel: Lustre: 5840:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      Nov  4 15:39:29 lola-8 kernel: Lustre: soaked-OST0002-osc-MDT0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      

      lola-4 during this time was experiencing difficulties communicating with the MDS, prior to the start of the lfsck.

      Nov  4 14:49:18 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Nov  4 14:49:18 lola-4 kernel:  req@ffff88050f9c2450 
      x1550005401098800/t0(0) o1000->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib10:-1/-1 lens 520/0 e 0 to 0 dl 1478296163 ref 2 fl New:/2/ffffffff rc 0/-1
      Nov  4 14:49:18 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 96 previous similar messages
      Nov  4 14:49:24 lola-4 kernel: Lustre: soaked-OST0002: Client soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib10) reconnecting
      Nov  4 14:38:26 lola-4 kernel: Lustre: soaked-OST0002: Export ffff8802e106a800 already connecting from 192.168.1.113@o2ib100
      Nov  4 14:38:26 lola-4 kernel: Lustre: Skipped 12 previous similar messages
      Nov  4 14:39:17 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Nov  4 14:39:17 lola-4 kernel:  req@ffff88052335ec50 x1550005401085472/t0(0) o1000->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib10:-1/-1 lens 520/0 e 0 to 0 dl 1478295562 ref 2 fl New:/2/ffffffff rc 0/-1
      Nov  4 14:39:17 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 95 previous similar messages
      Nov  4 14:39:23 lola-4 kernel: Lustre: soaked-OST0002: Client soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib10) reconnecting
      Nov  4 14:39:23 lola-4 kernel: Lustre: Skipped 13 previous similar messages
      Nov  4 14:39:23 lola-4 kernel: Lustre: soaked-OST0002: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib10)
      Nov  4 14:49:16 lola-4 kernel: Lustre: soaked-OST0002: Export ffff8802e106a800 already connecting from 192.168.1.113@o2ib100
      Nov  4 14:49:16 lola-4 kernel: Lustre: Skipped 12 previous similar messages
      Nov  4 14:49:18 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Nov  4 14:49:18 lola-4 kernel:  req@ffff88050f9c2450 x1550005401098800/t0(0) o1000->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib10:-1/-1 lens 520/0 e 0 to 0 dl 1478296163 ref 2 fl New:/2/ffffffff rc 0/-1
      Nov  4 14:49:18 lola-4 kernel: Lustre: 33277:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 96 previous similar messages
      Nov  4 14:49:24 lola-4 kernel: Lustre: soaked-OST0002: Client soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib10) reconnecting
      

      lola-4 continued dumping stacks, finally restarted the cluster
      Prior to restart, dumped lctl logs from lola-4 and lola-8, those are attached with message logs

      Attachments

        1. lola-4.syslog.txt
          1.56 MB
        2. lola-8.lustre-log.txt.gz
          889 kB
        3. lola-8.syslog.txt
          3.46 MB
        4. oss.lola-4.lustre-log.txt.gz
          2.63 MB

        Issue Links

          Activity

            [LU-8806] LFSCK hangs on MDT - osp_precreate_cleanup_orphans - cannot cleanup orphans: rc = -107
            adilger Andreas Dilger made changes -
            Fix Version/s Original: Lustre 2.12.0 [ 13495 ]
            Resolution New: Duplicate [ 3 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]

            Closing this as a duplicate of LU-10048, which landed a patch for 2.12.

            adilger Andreas Dilger added a comment - Closing this as a duplicate of LU-10048 , which landed a patch for 2.12.
            standan Saurabh Tandan (Inactive) made changes -
            Remote Link New: This issue links to "Page (HPDD Community Wiki)" [ 22711 ]
            standan Saurabh Tandan (Inactive) made changes -
            Affects Version/s New: Lustre 2.10.3 [ 13591 ]

            Alexey, the https://review.whamcloud.com/23689 patch points at https://review.whamcloud.com/27488 and 23689 has been abandoned. If you are having problems with this issue, then we welcome you to review and test that patch as you said you would do a few months ago, or submit a patch of your own. That is especially useful if this code is broken continuously for you, since it means you have a good reproducer and can verify if the fix works. I would also welcome an alternative patch based on the old ext3 delete thread, that pushes the extents/blocks to be truncated off to another temporary inode during the main transaction, and then does the actual truncate asynchronously. That would speed up all ext4 truncate/unlink operations, and not require changes to Lustre.

            adilger Andreas Dilger added a comment - Alexey, the https://review.whamcloud.com/23689 patch points at https://review.whamcloud.com/27488 and 23689 has been abandoned. If you are having problems with this issue, then we welcome you to review and test that patch as you said you would do a few months ago, or submit a patch of your own. That is especially useful if this code is broken continuously for you, since it means you have a good reproducer and can verify if the fix works. I would also welcome an alternative patch based on the old ext3 delete thread, that pushes the extents/blocks to be truncated off to another temporary inode during the main transaction, and then does the actual truncate asynchronously. That would speed up all ext4 truncate/unlink operations, and not require changes to Lustre.

            Peter,

            You have say this is targeted for 2.11 but i don't see a progress in ticket and lfsck broken continuously.

            shadow Alexey Lyashkov added a comment - Peter, You have say this is targeted for 2.11 but i don't see a progress in ticket and lfsck broken continuously.
            yong.fan nasf (Inactive) made changes -
            Assignee Original: nasf [ yong.fan ] New: Alex Zhuravlev [ bzzz ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s New: Lustre 2.12.0 [ 13495 ]
            Fix Version/s Original: Lustre 2.11.0 [ 13091 ]

            Peter,

            i see no updates for this patch for year. If you talk about an Alex patch, it's just a part of this solution and i will add to watchers to it.
            As about alternative solution - i don't see any good for now. It just a notification - it have a 2.7,2.8, 2.9 affected.
            Not about 2.10/2.11.

            shadow Alexey Lyashkov added a comment - Peter, i see no updates for this patch for year. If you talk about an Alex patch, it's just a part of this solution and i will add to watchers to it. As about alternative solution - i don't see any good for now. It just a notification - it have a 2.7,2.8, 2.9 affected. Not about 2.10/2.11.
            pjones Peter Jones added a comment -

            Alexey

            This issue is presently targeted for the 2.11 release and a draft patch of one possible patch exists already. Feel free to provide an alternative approach or feedback on any testing you conduct on the existing prototype.

            Peter

            pjones Peter Jones added a comment - Alexey This issue is presently targeted for the 2.11 release and a draft patch of one possible patch exists already. Feel free to provide an alternative approach or feedback on any testing you conduct on the existing prototype. Peter

            People

              bzzz Alex Zhuravlev
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: