[LU-16253] sanityn: ASSERTION( orro->oo_ref == 0 ) in 77d Created: 19/Oct/22 Updated: 20/Aug/23 Resolved: 20/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Feng Lei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Alex Zhuravlev <bzzz@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8e9b7081-bc95-4443-a2cb-32f33c6b9f54 [18399.419463] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) ASSERTION( orro->oo_ref == 0 ) failed: Busy NRS TRR policy object for OST with index 3, with 1 refs [18399.422350] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) LBUG [18399.423651] Pid: 358716, comm: lctl 4.18.0-372.26.1.el8_lustre.x86_64 #1 SMP Wed Oct 5 15:10:35 UTC 2022 [18399.425421] Call Trace TBD: [18399.426173] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [18399.427258] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [18399.428199] [<0>] nrs_trr_hop_exit+0x11c/0x150 [ptlrpc] [18399.429791] [<0>] cfs_hash_putref+0x1c8/0x4b0 [libcfs] [18399.430814] [<0>] nrs_orr_stop+0x65/0x270 [ptlrpc] [18399.431868] [<0>] nrs_policy_stop0+0x38/0x1b0 [ptlrpc] [18399.432985] [<0>] nrs_policy_stop_primary.isra.10+0x181/0x1d0 [ptlrpc] [18399.434361] [<0>] nrs_policy_start_locked+0x467/0x660 [ptlrpc] [18399.435581] [<0>] nrs_policy_ctl+0x203/0x2d0 [ptlrpc] [18399.436678] [<0>] ptlrpc_nrs_policy_control+0x10f/0x2f0 [ptlrpc] [18399.437926] [<0>] ptlrpc_lprocfs_nrs_policies_seq_write+0x473/0x5e0 [ptlrpc] [18399.439365] [<0>] full_proxy_write+0x53/0x80 [18399.440266] [<0>] vfs_write+0xa5/0x1a0 [18399.441031] [<0>] ksys_write+0x4f/0xb0 [18399.441796] [<0>] do_syscall_64+0x5b/0x1a0 |
| Comments |
| Comment by Andreas Dilger [ 19/Oct/22 ] |
|
Etienne, is this related to your recently landed patch https://review.whamcloud.com/48494 " |
| Comment by Etienne Aujames [ 19/Dec/22 ] |
|
Andreas, sorry for my late answer (I missed this comment). |
| Comment by Feng Lei [ 09/Jun/23 ] |
|
orro->oo_ref is a reference count but not an atomic type, so there should be a lock to protect it. Usually it is protected with the hash bucket lock if it is changed from the call path nrs_trr_hash_ops.hs_get() or nrs_trr_hash_ops.hs_put_locked(). So if it is changed from cfs_hash_ops.hs_put(), the function should hold the lock by itself. nrs_trr_hash_ops.hs_put is filled with nrs_orr_hop_put(), which does not hold the bucket lock. It is a race condition. Make sense? |
| Comment by Gerrit Updater [ 09/Jun/23 ] |
|
"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51260 |
| Comment by James A Simmons [ 10/Jun/23 ] |
|
Patch https://review.whamcloud.com/#/c/fs/lustre-release/+/40113 for LU-8130 already address this issue. |
| Comment by James A Simmons [ 20/Aug/23 ] |
|
Patch 40113 landed which resolved this problem. |