Details

    • 3
    • 6217

    Description

      Hi,

      When trying to unmount a Lustre client, we got the following problem:

      Lustre: DEBUG MARKER: Wed Nov 21 06:25:01 2012
      
      LustreError: 11559:0:(ldlm_lock.c:1697:ldlm_lock_cancel()) ### lock still has references ns:
      ptmp-MDT0000-mdc-ffff88030871bc00 lock: ffff88060dbd2d80/0x4618f3ec8d79d8be lrc: 4/0,1 mode: PW/PW res: 8590405073/266
      rrc: 2 type: FLK pid: 4414 [0->551] flags: 0x22002890 remote: 0xc8980c051f8f6afd expref: -99 pid: 4414 timeout: 0
      LustreError: 11559:0:(ldlm_lock.c:1698:ldlm_lock_cancel()) LBUG
      Pid: 11559, comm: umount
      
      Call Trace:
       [<ffffffffa040d7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa040de07>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa063343d>] ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
       [<ffffffffa064d245>] ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
       [<ffffffffa06510b8>] ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
       [<ffffffffa063ae18>] cleanup_resource+0x168/0x300 [ptlrpc]
       [<ffffffffa063afda>] ldlm_resource_clean+0x2a/0x50 [ptlrpc]
       [<ffffffffa041e28f>] cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
       [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
       [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
       [<ffffffffa041fcaf>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
       [<ffffffffa0637a69>] ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
       [<ffffffffa0638adb>] __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa041fcb7>] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
       [<ffffffffa063903f>] ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
       [<ffffffffa063fc4c>] client_disconnect_export+0x23c/0x460 [ptlrpc]
       [<ffffffffa0b42a44>] lmv_disconnect+0x644/0xc70 [lmv]
       [<ffffffffa0a470bc>] client_common_put_super+0x46c/0xe80 [lustre]
       [<ffffffffa0a47ba0>] ll_put_super+0xd0/0x360 [lustre]
       [<ffffffff8117e01c>] ? dispose_list+0x11c/0x140
       [<ffffffff8117e4a8>] ? invalidate_inodes+0x158/0x1a0
       [<ffffffff8116542b>] generic_shutdown_super+0x5b/0x110
       [<ffffffff81165546>] kill_anon_super+0x16/0x60
       [<ffffffffa050897a>] lustre_kill_super+0x4a/0x60 [obdclass]
       [<ffffffff811664e0>] deactivate_super+0x70/0x90
       [<ffffffff811826bf>] mntput_no_expire+0xbf/0x110
       [<ffffffff81183188>] sys_umount+0x78/0x3c0
       [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      
      Kernel panic - not syncing: LBUG
      Pid: 11559, comm: umount Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 #1
      Call Trace:
       [<ffffffff81484650>] ? panic+0x78/0x143
       [<ffffffffa040de5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
       [<ffffffffa063343d>] ? ldlm_lock_cancel+0x1ad/0x1b0 [ptlrpc]
       [<ffffffffa064d245>] ? ldlm_cli_cancel_local+0xb5/0x380 [ptlrpc]
       [<ffffffffa06510b8>] ? ldlm_cli_cancel+0x58/0x3b0 [ptlrpc]
       [<ffffffffa063ae18>] ? cleanup_resource+0x168/0x300 [ptlrpc]
       [<ffffffffa063afda>] ? ldlm_resource_clean+0x2a/0x50 [ptlrpc]
       [<ffffffffa041e28f>] ? cfs_hash_for_each_relax+0x17f/0x380 [libcfs]
       [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
       [<ffffffffa063afb0>] ? ldlm_resource_clean+0x0/0x50 [ptlrpc]
       [<ffffffffa041fcaf>] ? cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
       [<ffffffffa0637a69>] ? ldlm_namespace_cleanup+0x29/0xb0 [ptlrpc]
       [<ffffffffa0638adb>] ? __ldlm_namespace_free+0x4b/0x540 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa06502d0>] ? ldlm_cli_hash_cancel_unused+0x0/0xa0 [ptlrpc]
       [<ffffffffa041fcb7>] ? cfs_hash_for_each_nolock+0x87/0x1c0 [libcfs]
       [<ffffffffa063903f>] ? ldlm_namespace_free_prior+0x6f/0x230 [ptlrpc]
       [<ffffffffa063fc4c>] ? client_disconnect_export+0x23c/0x460 [ptlrpc]
       [<ffffffffa0b42a44>] ? lmv_disconnect+0x644/0xc70 [lmv]
       [<ffffffffa0a470bc>] ? client_common_put_super+0x46c/0xe80 [lustre]
       [<ffffffffa0a47ba0>] ? ll_put_super+0xd0/0x360 [lustre]
       [<ffffffff8117e01c>] ? dispose_list+0x11c/0x140
       [<ffffffff8117e4a8>] ? invalidate_inodes+0x158/0x1a0
       [<ffffffff8116542b>] ? generic_shutdown_super+0x5b/0x110
       [<ffffffff81165546>] ? kill_anon_super+0x16/0x60
       [<ffffffffa050897a>] ? lustre_kill_super+0x4a/0x60 [obdclass]
       [<ffffffff811664e0>] ? deactivate_super+0x70/0x90
       [<ffffffff811826bf>] ? mntput_no_expire+0xbf/0x110
       [<ffffffff81183188>] ? sys_umount+0x78/0x3c0
       [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      

      This issue is exactly the same as the one described in LU-1429, which is a duplicate of LU-1328, which itself seems to be related to LU-1421.
      The issue seems to be resolved, but it is very unclear to me which patches are needed in order to completely fix the issue.
      I add that we need of fix for b2_1.

      Can you please advise?

      TIA,
      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-2665] LBUG while unmounting client
            yujian Jian Yu added a comment -

            Patch http://review.whamcloud.com/6415 was cherry-picked to Lustre b2_4 branch.

            yujian Jian Yu added a comment - Patch http://review.whamcloud.com/6415 was cherry-picked to Lustre b2_4 branch.
            pjones Peter Jones added a comment -

            ok - thanks Sebastien!

            pjones Peter Jones added a comment - ok - thanks Sebastien!

            Sure, this bug can be closed, as additional work was carried out by Bruno in LU-3701.

            Thanks,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Sure, this bug can be closed, as additional work was carried out by Bruno in LU-3701 . Thanks, Sebastien.

            Patches are landed to b2_1 and master for 2.5.0. Can this bug be closed?

            adilger Andreas Dilger added a comment - Patches are landed to b2_1 and master for 2.5.0. Can this bug be closed?

            > But "retry" mechanism doesn't guarantee that lock will reach MDS
            Yes, but we need to do our best! And particularly for FLock/F_UNLCKs (where Server must know, unless other Clients/processes will stay stuck forever) which can not be trashed. At least with retries we now also cover MDS crashes and communications problems, if reboot/restart/failover/fix/... occurs finally and if not Lustre is dead on this Client and who cares that we retry forever ?

            > cleanup_resource() should deal with orphaned locks also
            Only during evict or "umount -f".

            bfaccini Bruno Faccini (Inactive) added a comment - > But "retry" mechanism doesn't guarantee that lock will reach MDS Yes, but we need to do our best! And particularly for FLock/F_UNLCKs (where Server must know, unless other Clients/processes will stay stuck forever) which can not be trashed. At least with retries we now also cover MDS crashes and communications problems, if reboot/restart/failover/fix/... occurs finally and if not Lustre is dead on this Client and who cares that we retry forever ? > cleanup_resource() should deal with orphaned locks also Only during evict or "umount -f".

            But "retry" mechanism doesn't guarantee that lock will reach MDS. cleanup_resource() should deal with orphaned locks also.

            askulysh Andriy Skulysh added a comment - But "retry" mechanism doesn't guarantee that lock will reach MDS. cleanup_resource() should deal with orphaned locks also.

            Cory: Thanks to add a reference to LU-3701. This is the follow-on to this ticket as its change introduced some regressions vs the POSIX test suite.

            Andriy: If you read carefully history/comments for this/LU-2665 ticket, you will find it addresses a very particular scenario which can be simplified as "FLock/F_UNLCK requests can be trashed upon MDS crash or communications problem, leaving orphaned FLocks". This is definitelly not the race problem described in LU-2177. And concerning the "retry" mechanism in LU-3701 change it becomes fully limited to FLock/F_UNLCK requests.

            bfaccini Bruno Faccini (Inactive) added a comment - Cory: Thanks to add a reference to LU-3701 . This is the follow-on to this ticket as its change introduced some regressions vs the POSIX test suite. Andriy: If you read carefully history/comments for this/ LU-2665 ticket, you will find it addresses a very particular scenario which can be simplified as "FLock/F_UNLCK requests can be trashed upon MDS crash or communications problem, leaving orphaned FLocks". This is definitelly not the race problem described in LU-2177 . And concerning the "retry" mechanism in LU-3701 change it becomes fully limited to FLock/F_UNLCK requests.

            I don't like an idea of resending flocks. There is no guarantee that request will succeed.
            Client has cleanup mechanism already.
            ldlm_flock_completion_ast() should call ldlm_lock_decref_internal() correctly.
            The patch should be combined with similar patch from LU-2177 .

            askulysh Andriy Skulysh added a comment - I don't like an idea of resending flocks. There is no guarantee that request will succeed. Client has cleanup mechanism already. ldlm_flock_completion_ast() should call ldlm_lock_decref_internal() correctly. The patch should be combined with similar patch from LU-2177 .
            spitzcor Cory Spitz added a comment - Related to LU-3701 and http://review.whamcloud.com/#/c/7453 .

            Hi Bruno,

            We have retrieved the patch for b2_1. It will be rolled out at CEA's next maintenance, planned for early September.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, We have retrieved the patch for b2_1. It will be rolled out at CEA's next maintenance, planned for early September. Cheers, Sebastien.

            People

              bfaccini Bruno Faccini (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: