[LU-7853] Fixes bitfield in lod qos code Created: 07/Mar/16  Updated: 15/Jul/21  Resolved: 26/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Rahul Deshmukh (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-14277 any create blocked due any OST fail Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Replaced bitfields viz. lq_dirty, lq_same_space, and lq_reset to named bits and use atomic functions. As those are more safe.



 Comments   
Comment by Gerrit Updater [ 07/Mar/16 ]

Rahul Deshmukh (rahul.deshmukh@seagate.com) uploaded a new patch: http://review.whamcloud.com/18812
Subject: LU-7853 lod: Fixes bitfield in lod qos code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3f153f8b4a365b801934d1de843389e4e51a90c8

Comment by Oleg Drokin [ 07/Mar/16 ]

So were tyou hittign any real problems with the code that you feel you needed the atomic bit operations?

Comment by Rahul Deshmukh (Inactive) [ 08/Mar/16 ]

Yes, problem was hit on older code i.e. version 2.5 and it was seen for lq_dirty. So as a fixed other bitfield with those. Actually lq_dirty is not present in latest code but still to be safer side ported this patch.

Comment by Andreas Dilger [ 16/Jan/17 ]

I agree with Oleg that "more safe" doesn't provide justification for landing the patch. I can imagine that there is some problem with these bitfields in the code, though it seems lq_rw_sem is held when lq_dirty and lq_same_space are accessed and modified, except in lod_qos_priofree_seq_write() and lod_qos_thresholdrr_seq_write(). Since there is already a lock protecting all of these accesses, adding atomic bit operations is just overhead unless there is a reason to do so.

Please provide more details about the original bug so there is justification for the patch to be landed.

Comment by Alexander Zarochentsev [ 22/Jan/19 ]

Andreas, there was an old crash with "ASSERTION( lov->lov_qos.lq_statfs_in_progress )" failure:

Oct 11 01:24:25 lus207n03 kernel: LustreError: 140684:0:(lov_qos.c:1136:qos_statfs_done()) ASSERTION( lov->lov_qos.lq_statfs_in_progress ) failed:
Oct 11 01:24:25 lus207n03 kernel: LustreError: 140684:0:(lov_qos.c:1136:qos_statfs_done()) LBUG
Oct 11 01:24:25 lus207n03 kernel: Pid: 140684, comm: ptlrpcd
Oct 11 01:24:25 lus207n03 kernel:
Oct 11 01:24:25 lus207n03 kernel: Call Trace:
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa042c825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa042ce27>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa0a56829>] qos_statfs_done+0x89/0x90 [lov]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa0a4a166>] cb_statfs_update+0x346/0x580 [lov]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa09cdcb6>] osc_statfs_interpret+0xd6/0x400 [osc]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa073171b>] ptlrpc_check_set+0x29b/0x1d50 [ptlrpc]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff814eaf6a>] ? schedule_timeout+0x19a/0x2e0
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff8107c180>] ? process_timeout+0x0/0x10
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa07636b0>] ptlrpcd_check+0x1a0/0x230 [ptlrpc]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa0763a43>] ptlrpcd+0x303/0x370 [ptlrpc]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff8105f9f0>] ? default_wake_function+0x0/0x20
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffffa0763740>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff81090806>] kthread+0x96/0xa0
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff8100c14a>] child_rip+0xa/0x20
Oct 11 01:24:25 lus207n03 kernel:  [<ffffffff81090770>] ? kthread+0x0/0xa0

however the assertion doesn't exist in recent Lustre code. a wrong update of bitfield flags still possible due to an unprotected access through lod_qos_thresholdrr_seq_write() , though w/o a crash, just some updates to  lq_reset/lq_same_space can be lost.

Comment by Andreas Dilger [ 22/Jan/19 ]

It would be good to include such explanation into the commit message of the patch, so that it is clear why it should land.

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/18812/
Subject: LU-7853 lod: fixes bitfield in lod qos code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3bae39f0a5b98a279fb5f7b8d00211ac0d09366f

Comment by Peter Jones [ 26/Feb/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:12:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.