Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16253

sanityn: ASSERTION( orro->oo_ref == 0 ) in 77d

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Alex Zhuravlev <bzzz@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8e9b7081-bc95-4443-a2cb-32f33c6b9f54

      [18399.419463] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) ASSERTION( orro->oo_ref == 0 ) failed: Busy NRS TRR policy object for OST with index 3, with 1 refs
      [18399.422350] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) LBUG
      [18399.423651] Pid: 358716, comm: lctl 4.18.0-372.26.1.el8_lustre.x86_64 #1 SMP Wed Oct 5 15:10:35 UTC 2022
      [18399.425421] Call Trace TBD:
      [18399.426173] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
      [18399.427258] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
      [18399.428199] [<0>] nrs_trr_hop_exit+0x11c/0x150 [ptlrpc]
      [18399.429791] [<0>] cfs_hash_putref+0x1c8/0x4b0 [libcfs]
      [18399.430814] [<0>] nrs_orr_stop+0x65/0x270 [ptlrpc]
      [18399.431868] [<0>] nrs_policy_stop0+0x38/0x1b0 [ptlrpc]
      [18399.432985] [<0>] nrs_policy_stop_primary.isra.10+0x181/0x1d0 [ptlrpc]
      [18399.434361] [<0>] nrs_policy_start_locked+0x467/0x660 [ptlrpc]
      [18399.435581] [<0>] nrs_policy_ctl+0x203/0x2d0 [ptlrpc]
      [18399.436678] [<0>] ptlrpc_nrs_policy_control+0x10f/0x2f0 [ptlrpc]
      [18399.437926] [<0>] ptlrpc_lprocfs_nrs_policies_seq_write+0x473/0x5e0 [ptlrpc]
      [18399.439365] [<0>] full_proxy_write+0x53/0x80
      [18399.440266] [<0>] vfs_write+0xa5/0x1a0
      [18399.441031] [<0>] ksys_write+0x4f/0xb0
      [18399.441796] [<0>] do_syscall_64+0x5b/0x1a0
      

      Attachments

        Issue Links

          Activity

            [LU-16253] sanityn: ASSERTION( orro->oo_ref == 0 ) in 77d

            The patch https://review.whamcloud.com/51260 "LU-16253 ptlrpc: define nrs_orr_object.oo_ref atomic_t" was abandoned for master, but would be suitable for backporting to b2_15 if this problem is hitting there frequently.

            adilger Andreas Dilger added a comment - The patch https://review.whamcloud.com/51260 " LU-16253 ptlrpc: define nrs_orr_object.oo_ref atomic_t " was abandoned for master, but would be suitable for backporting to b2_15 if this problem is hitting there frequently.
            sarah Sarah Liu added a comment - also hit on 2.15.5 https://testing.whamcloud.com/test_sets/832cc83a-46c8-4577-bd89-45adb3a44c80

            Patch 40113 landed which resolved this problem.

            simmonsja James A Simmons added a comment - Patch 40113 landed which resolved this problem.
            simmonsja James A Simmons added a comment - Patch https://review.whamcloud.com/#/c/fs/lustre-release/+/40113 for LU-8130 already address this issue.
            gerrit Gerrit Updater added a comment - - edited

            "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51260
            Subject: LU-16253 ptlrpc: define nrs_orr_object.oo_ref atomic_t
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 28782c745172b65b2992ef1ab751bf79161382a0

            gerrit Gerrit Updater added a comment - - edited "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51260 Subject: LU-16253 ptlrpc: define nrs_orr_object.oo_ref atomic_t Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 28782c745172b65b2992ef1ab751bf79161382a0
            flei Feng Lei added a comment -

            orro->oo_ref is a reference count but not an atomic type, so there should be a lock to protect it. Usually it is protected with the hash bucket lock if it is changed from the call path nrs_trr_hash_ops.hs_get() or nrs_trr_hash_ops.hs_put_locked(). So if it is changed from cfs_hash_ops.hs_put(), the function should hold the lock by itself.

            nrs_trr_hash_ops.hs_put is filled with nrs_orr_hop_put(), which does not hold the bucket lock. It is a race condition.

            Make sense?

            flei Feng Lei added a comment - orro->oo_ref is a reference count but not an atomic type, so there should be a lock to protect it. Usually it is protected with the hash bucket lock if it is changed from the call path nrs_trr_hash_ops.hs_get() or nrs_trr_hash_ops.hs_put_locked() . So if it is changed from cfs_hash_ops.hs_put() , the function should hold the lock by itself. nrs_trr_hash_ops.hs_put is filled with nrs_orr_hop_put() , which does not hold the bucket lock. It is a race condition. Make sense?

            Andreas, sorry for my late answer (I missed this comment).
            This does not to seem likely, https://review.whamcloud.com/48494 implements a force mode for TBF it does not change start/stop policies logics. And nrs_orr_req_get() does not implement "force" mode, so it should not be impacted by this patch.
            https://review.whamcloud.com/48523/ "LU-14976 nrs: change nrs policies at run time" is more likely to provoke those kinds of crashes.

            eaujames Etienne Aujames added a comment - Andreas, sorry for my late answer (I missed this comment). This does not to seem likely, https://review.whamcloud.com/48494 implements a force mode for TBF it does not change start/stop policies logics. And nrs_orr_req_get() does not implement "force" mode, so it should not be impacted by this patch. https://review.whamcloud.com/48523/ " LU-14976 nrs: change nrs policies at run time" is more likely to provoke those kinds of crashes.

            Etienne, is this related to your recently landed patch https://review.whamcloud.com/48494 "LU-16144 nrs: implement force mode for nrs_tbf_req_get()"?

            adilger Andreas Dilger added a comment - Etienne, is this related to your recently landed patch https://review.whamcloud.com/48494 " LU-16144 nrs: implement force mode for nrs_tbf_req_get() "?

            People

              flei Feng Lei
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: