[LU-16253] sanityn: ASSERTION( orro->oo_ref == 0 ) in 77d Created: 19/Oct/22  Updated: 20/Aug/23  Resolved: 20/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Feng Lei
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-6849 sanity-quota test 22 LBUG with “ASSER... Open
Related
is related to LU-16144 OST crash at umount in ptlrpc_nrs_req... Resolved
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Alex Zhuravlev <bzzz@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8e9b7081-bc95-4443-a2cb-32f33c6b9f54

[18399.419463] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) ASSERTION( orro->oo_ref == 0 ) failed: Busy NRS TRR policy object for OST with index 3, with 1 refs
[18399.422350] LustreError: 358716:0:(nrs_orr.c:481:nrs_trr_hop_exit()) LBUG
[18399.423651] Pid: 358716, comm: lctl 4.18.0-372.26.1.el8_lustre.x86_64 #1 SMP Wed Oct 5 15:10:35 UTC 2022
[18399.425421] Call Trace TBD:
[18399.426173] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[18399.427258] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[18399.428199] [<0>] nrs_trr_hop_exit+0x11c/0x150 [ptlrpc]
[18399.429791] [<0>] cfs_hash_putref+0x1c8/0x4b0 [libcfs]
[18399.430814] [<0>] nrs_orr_stop+0x65/0x270 [ptlrpc]
[18399.431868] [<0>] nrs_policy_stop0+0x38/0x1b0 [ptlrpc]
[18399.432985] [<0>] nrs_policy_stop_primary.isra.10+0x181/0x1d0 [ptlrpc]
[18399.434361] [<0>] nrs_policy_start_locked+0x467/0x660 [ptlrpc]
[18399.435581] [<0>] nrs_policy_ctl+0x203/0x2d0 [ptlrpc]
[18399.436678] [<0>] ptlrpc_nrs_policy_control+0x10f/0x2f0 [ptlrpc]
[18399.437926] [<0>] ptlrpc_lprocfs_nrs_policies_seq_write+0x473/0x5e0 [ptlrpc]
[18399.439365] [<0>] full_proxy_write+0x53/0x80
[18399.440266] [<0>] vfs_write+0xa5/0x1a0
[18399.441031] [<0>] ksys_write+0x4f/0xb0
[18399.441796] [<0>] do_syscall_64+0x5b/0x1a0


 Comments   
Comment by Andreas Dilger [ 19/Oct/22 ]

Etienne, is this related to your recently landed patch https://review.whamcloud.com/48494 "LU-16144 nrs: implement force mode for nrs_tbf_req_get()"?

Comment by Etienne Aujames [ 19/Dec/22 ]

Andreas, sorry for my late answer (I missed this comment).
This does not to seem likely, https://review.whamcloud.com/48494 implements a force mode for TBF it does not change start/stop policies logics. And nrs_orr_req_get() does not implement "force" mode, so it should not be impacted by this patch.
https://review.whamcloud.com/48523/ "LU-14976 nrs: change nrs policies at run time" is more likely to provoke those kinds of crashes.

Comment by Feng Lei [ 09/Jun/23 ]

orro->oo_ref is a reference count but not an atomic type, so there should be a lock to protect it. Usually it is protected with the hash bucket lock if it is changed from the call path nrs_trr_hash_ops.hs_get() or nrs_trr_hash_ops.hs_put_locked(). So if it is changed from cfs_hash_ops.hs_put(), the function should hold the lock by itself.

nrs_trr_hash_ops.hs_put is filled with nrs_orr_hop_put(), which does not hold the bucket lock. It is a race condition.

Make sense?

Comment by Gerrit Updater [ 09/Jun/23 ]

"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51260
Subject: LU-16253 ptlrpc: define nrs_orr_object.oo_ref atomic_t
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 28782c745172b65b2992ef1ab751bf79161382a0

Comment by James A Simmons [ 10/Jun/23 ]

Patch https://review.whamcloud.com/#/c/fs/lustre-release/+/40113 for LU-8130 already address this issue.

Comment by James A Simmons [ 20/Aug/23 ]

Patch 40113 landed which resolved this problem.

Generated at Sat Feb 10 03:25:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.