Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10949

LBUG lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.6
    • Lustre 2.12.0, Lustre 2.10.4
    • Soak cluster - lustre-master-ib build 81
    • 3
    • 9223372036854775807

    Description

      This also occurred under 2.10.4, but I missed it due to some syslog issues.
      System failed over soak-8 (MDS) to soak-9, upon failback, soak-9 crashed.
      vmcore available.

      [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
      [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
      [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
      [67387.811814] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) LBUG
      [67387.811815] Pid: 8615, comm: mdt01_014
      [67387.811816]
      Call Trace:
      [67387.811831]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [67387.811837]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [67387.811847] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
      [67387.811850] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) LBUG
      [67387.811851] Pid: 8692, comm: mdt01_017
      [67387.811851]
      Call Trace:
      [67387.811860]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
      [67387.811861]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [67387.811867]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [67387.811873]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.811884]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
      [67387.811895]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.811901]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.811912]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.811918]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.811927]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.811928]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.811940]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.811949]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.811965]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.811968]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.811980]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      [67387.811984]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.811990]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
      [67387.811995]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      [67387.812002]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]  
      [67387.812006]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
      [67387.812012]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [67387.812017]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]
      [67387.812022]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
      [67387.812027]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [67387.812032]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
      [67387.812036]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
      [67387.812046]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
      [67387.812090]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [67387.812099]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [67387.812102]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
      [67387.812111]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
      [67387.812127]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [67387.812136]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [67387.812137]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [67387.812145]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [67387.812164]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
      [67387.812173]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
      [67387.812198]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [67387.812204]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812245]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812288]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [67387.812294]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [67387.812322]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [67387.812330]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [67387.812355]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [67387.812361]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      [67387.812364]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      [67387.812366]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [67387.812370]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      [67387.812372]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      [67387.812397]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
      [67387.812405]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
      [67387.812429]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
      [67387.812433]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [67387.812435]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812439]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [67387.812440]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
      [67387.812441]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812442]
      [67387.812443]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [67387.812443] Kernel panic - not syncing: LBUG
      [67387.812446] CPU: 25 PID: 8615 Comm: mdt01_014 Tainted: P        W  OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
      [67387.812447]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812448] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [67387.812449] Call Trace:
      [67387.812450]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [67387.812468]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [67387.812469]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812471]  [<ffffffff816a8634>] panic+0xe8/0x21f
      [67387.812472]
      [67387.812478]  [<ffffffffc0a56854>] lbug_with_loc+0x64/0xb0 [libcfs]
      [67387.812492]  [<ffffffffc12527c9>] lu_object_get.isra.33.part.34+0x36/0x36 [mdt]
      [67387.812502]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.812521]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.812531]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.812543]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.812563]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.812582]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.812592]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      .....
      

      Full vmcore at /scratch/dumps/soak-9.spirit.hpdd.intel.com/10.10.1.109-2018-04-24-18:20:36

      Attachments

        Issue Links

          Activity

            [LU-10949] LBUG lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39976/
            Subject: LU-10949 mdt: lost reference on mdt_md_root
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 2040144a54fbd0da81f3e1fdacb6b35f473689e6

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39976/ Subject: LU-10949 mdt: lost reference on mdt_md_root Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 2040144a54fbd0da81f3e1fdacb6b35f473689e6

            Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39976
            Subject: LU-10949 mdt: lost reference on mdt_md_root
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 46bd7408f6e0f06cd9af2848660800cd90e5c945

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39976 Subject: LU-10949 mdt: lost reference on mdt_md_root Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 46bd7408f6e0f06cd9af2848660800cd90e5c945
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34181/
            Subject: LU-10949 mdt: lost reference on mdt_md_root
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b2368774a01eb89981e2ceb92be9673e4b403d62

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34181/ Subject: LU-10949 mdt: lost reference on mdt_md_root Project: fs/lustre-release Branch: master Current Patch Set: Commit: b2368774a01eb89981e2ceb92be9673e4b403d62

            Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/34181
            Subject: LU-10949 mdt: lost reference on mdt_md_root
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cf41c8ad2ec38b0697b9cc029c016ec49607bfc

            gerrit Gerrit Updater added a comment - Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/34181 Subject: LU-10949 mdt: lost reference on mdt_md_root Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cf41c8ad2ec38b0697b9cc029c016ec49607bfc
            laisiyao Lai Siyao added a comment -

            I'm sure the debug patch doesn't change any logic, so I'll push to land the debug patch, so in the future it can help debug.

            laisiyao Lai Siyao added a comment - I'm sure the debug patch doesn't change any logic, so I'll push to land the debug patch, so in the future it can help debug.

            Tested for 24 hours with the debug patch, the error is not repeating. Any idea what is going on?

            cliffw Cliff White (Inactive) added a comment - Tested for 24 hours with the debug patch, the error is not repeating. Any idea what is going on?
            laisiyao Lai Siyao added a comment -

            Hi Cliff, I just uploaded a debug patch, would you test with that build?

            laisiyao Lai Siyao added a comment - Hi Cliff, I just uploaded a debug patch, would you test with that build?

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32244
            Subject: LU-10949 debug: assert object refcount > 1 in lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7ec5d9141c1eed09397b2e501494e6213e2a56a4

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32244 Subject: LU-10949 debug: assert object refcount > 1 in lock cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7ec5d9141c1eed09397b2e501494e6213e2a56a4
            laisiyao Lai Siyao added a comment -

            This asserts on mdt_lock_root_xattr() -> mdt_remote_object_lock(), and the message shows it was evicted:

            [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
            [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
            [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
            

            Since root XATTR lock holds a refcount of root, there should be race here.

            laisiyao Lai Siyao added a comment - This asserts on mdt_lock_root_xattr() -> mdt_remote_object_lock(), and the message shows it was evicted: [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc = -5 [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: Since root XATTR lock holds a refcount of root, there should be race here.

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: