Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10949

LBUG lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.6
    • Lustre 2.12.0, Lustre 2.10.4
    • Soak cluster - lustre-master-ib build 81
    • 3
    • 9223372036854775807

    Description

      This also occurred under 2.10.4, but I missed it due to some syslog issues.
      System failed over soak-8 (MDS) to soak-9, upon failback, soak-9 crashed.
      vmcore available.

      [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
      [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
      [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
      [67387.811814] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) LBUG
      [67387.811815] Pid: 8615, comm: mdt01_014
      [67387.811816]
      Call Trace:
      [67387.811831]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [67387.811837]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [67387.811847] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed:
      [67387.811850] LustreError: 8692:0:(lu_object.h:688:lu_object_get()) LBUG
      [67387.811851] Pid: 8692, comm: mdt01_017
      [67387.811851]
      Call Trace:
      [67387.811860]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
      [67387.811861]  [<ffffffffc0a567ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [67387.811867]  [<ffffffffc0a5683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [67387.811873]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.811884]  [<ffffffffc12527c9>] err_serious.part.38+0x0/0x36 [mdt]
      [67387.811895]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.811901]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.811912]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.811918]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.811927]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.811928]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.811940]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.811949]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.811965]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.811968]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.811980]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      [67387.811984]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.811990]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
      [67387.811995]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      [67387.812002]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]  
      [67387.812006]  [<ffffffffc12105c1>] ? mdt_root_squash+0x21/0x430 [mdt]
      [67387.812012]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [67387.812017]  [<ffffffffc12208e3>] mdt_reint_rec+0x83/0x210 [mdt]
      [67387.812022]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
      [67387.812027]  [<ffffffffc120020b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      [67387.812032]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
      [67387.812036]  [<ffffffffc120c787>] mdt_intent_reint+0x157/0x420 [mdt]
      [67387.812046]  [<ffffffffc1203365>] mdt_intent_opc+0x455/0xae0 [mdt]
      [67387.812090]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [67387.812099]  [<ffffffffc0dd3920>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
      [67387.812102]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
      [67387.812111]  [<ffffffffc120afb3>] mdt_intent_policy+0x1a3/0x360 [mdt]
      [67387.812127]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [67387.812136]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [67387.812137]  [<ffffffffc0d83f1e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [67387.812145]  [<ffffffffc0a6a63e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [67387.812164]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
      [67387.812173]  [<ffffffffc0dabb13>] ldlm_handle_enqueue0+0x8f3/0x1410 [ptlrpc]
      [67387.812198]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [67387.812204]  [<ffffffffc0dd39a0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812238]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812245]  [<ffffffffc0e33392>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [67387.812288]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [67387.812294]  [<ffffffffc0e3994a>] tgt_request_handle+0x92a/0x13b0 [ptlrpc]
      [67387.812322]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [67387.812330]  [<ffffffffc0ddda53>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc]
      [67387.812355]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [67387.812361]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      [67387.812364]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      [67387.812366]  [<ffffffffc0ddb5d8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [67387.812370]  [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
      [67387.812372]  [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
      [67387.812397]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
      [67387.812405]  [<ffffffffc0de1202>] ptlrpc_main+0xa92/0x1f40 [ptlrpc]
      [67387.812429]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
      [67387.812433]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [67387.812435]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812439]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [67387.812440]  [<ffffffffc0de0770>] ? ptlrpc_main+0x0/0x1f40 [ptlrpc]
      [67387.812441]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812442]
      [67387.812443]  [<ffffffff810b4031>] kthread+0xd1/0xe0
      [67387.812443] Kernel panic - not syncing: LBUG
      [67387.812446] CPU: 25 PID: 8615 Comm: mdt01_014 Tainted: P        W  OE  ------------   3.10.0-693.21.1.el7_lustre.x86_64 #1
      [67387.812447]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812448] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      [67387.812449] Call Trace:
      [67387.812450]  [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [67387.812468]  [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [67387.812469]  [<ffffffff810b3f60>] ? kthread+0x0/0xe0
      [67387.812471]  [<ffffffff816a8634>] panic+0xe8/0x21f
      [67387.812472]
      [67387.812478]  [<ffffffffc0a56854>] lbug_with_loc+0x64/0xb0 [libcfs]
      [67387.812492]  [<ffffffffc12527c9>] lu_object_get.isra.33.part.34+0x36/0x36 [mdt]
      [67387.812502]  [<ffffffffc1204146>] mdt_remote_object_lock_try+0x736/0x770 [mdt]
      [67387.812521]  [<ffffffffc0b74be4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      [67387.812531]  [<ffffffffc12041aa>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [67387.812543]  [<ffffffffc122ccf9>] mdt_reint_open+0x2769/0x3270 [mdt]
      [67387.812563]  [<ffffffffc0baae51>] ? upcall_cache_get_entry+0x211/0x8e0 [obdclass]
      [67387.812582]  [<ffffffffc0bb0b2e>] ? lu_ucred+0x1e/0x30 [obdclass]
      [67387.812592]  [<ffffffffc120fd35>] ? mdt_ucred+0x15/0x20 [mdt]
      .....
      

      Full vmcore at /scratch/dumps/soak-9.spirit.hpdd.intel.com/10.10.1.109-2018-04-24-18:20:36

      Attachments

        Issue Links

          Activity

            [LU-10949] LBUG lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed

            Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/34181
            Subject: LU-10949 mdt: lost reference on mdt_md_root
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1cf41c8ad2ec38b0697b9cc029c016ec49607bfc

            gerrit Gerrit Updater added a comment - Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/34181 Subject: LU-10949 mdt: lost reference on mdt_md_root Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1cf41c8ad2ec38b0697b9cc029c016ec49607bfc
            laisiyao Lai Siyao added a comment -

            I'm sure the debug patch doesn't change any logic, so I'll push to land the debug patch, so in the future it can help debug.

            laisiyao Lai Siyao added a comment - I'm sure the debug patch doesn't change any logic, so I'll push to land the debug patch, so in the future it can help debug.

            Tested for 24 hours with the debug patch, the error is not repeating. Any idea what is going on?

            cliffw Cliff White (Inactive) added a comment - Tested for 24 hours with the debug patch, the error is not repeating. Any idea what is going on?
            laisiyao Lai Siyao added a comment -

            Hi Cliff, I just uploaded a debug patch, would you test with that build?

            laisiyao Lai Siyao added a comment - Hi Cliff, I just uploaded a debug patch, would you test with that build?

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32244
            Subject: LU-10949 debug: assert object refcount > 1 in lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7ec5d9141c1eed09397b2e501494e6213e2a56a4

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/32244 Subject: LU-10949 debug: assert object refcount > 1 in lock cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7ec5d9141c1eed09397b2e501494e6213e2a56a4
            laisiyao Lai Siyao added a comment -

            This asserts on mdt_lock_root_xattr() -> mdt_remote_object_lock(), and the message shows it was evicted:

            [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
            [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc  = -5
            [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: 
            

            Since root XATTR lock holds a refcount of root, there should be race here.

            laisiyao Lai Siyao added a comment - This asserts on mdt_lock_root_xattr() -> mdt_remote_object_lock(), and the message shows it was evicted: [67387.761484] LustreError: 167-0: soaked-MDT0000-osp-MDT0001: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. [67387.810434] LustreError: 8279:0:(mdt_reint.c:2181:mdt_reint_rename_or_migrate()) soaked-MDT0001: can't lock FS for rename: rc = -5 [67387.810699] LNet: Service thread pid 8242 completed after 336.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [67387.810700] LNet: Service thread pid 8615 completed after 335.77s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [67387.811812] LustreError: 8615:0:(lu_object.h:688:lu_object_get()) ASSERTION( atomic_read(&o->lo_header->loh_ref) > 0 ) failed: Since root XATTR lock holds a refcount of root, there should be race here.

            Lai - Would you please review this issue and comment?

            Thanks,
            James

            jamesanunez James Nunez (Inactive) added a comment - Lai - Would you please review this issue and comment? Thanks, James

            We hit this repeatedly on master. About every two hours.

            cliffw Cliff White (Inactive) added a comment - We hit this repeatedly on master. About every two hours.

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: