[LU-16160] take ldlm lock when queue sync pages Created: 15/Sep/22  Updated: 19/Jan/24  Resolved: 14/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Major
Reporter: Zhenyu Xu Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File 0001-LUS-10932-SIGBUS-is-possible-on-a-race-with-page-rec.patch    
Issue Links:
Related
is related to LU-16156 stale read during IOR test due LU-14541 Open
is related to LU-14541 Memory reclaim caused a stale data read Resolved
is related to LU-15815 fast_read/stale data/reclaim workroun... Resolved
is related to LU-16401 various crashes with cl_page_discard ... Open
is related to LU-16224 rw_seq_cst_vs_drop_caches dies with S... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

osc_queue_sync_pages() add osc_extent to osc_object's IO extent list without taking ldlm locks, and then it calls osc_io_unplug_async() to queue the IO work for the client.

I think the IO extent should take ldlm locks while waiting in the IO work queue.



 Comments   
Comment by Gerrit Updater [ 15/Sep/22 ]

"Bobi Jam <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/48557
Subject: LU-16160 osc: take ldlm lock when queue sync pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5fddc18876c2cdc7321bfeb9a1e8e30f733a4aa7

Comment by Gerrit Updater [ 20/Sep/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/48607
Subject: LU-16160 llite: clear stale page's uptodate bit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c04d9374b0bf05cccc84f353200ff29cc65b2af7

Comment by Gerrit Updater [ 24/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48557/
Subject: LU-16160 osc: take ldlm lock when queue sync pages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67aca1fcc6bed20794832decdba590a758d67d8f

Comment by Alexey Lyashkov [ 27/Sep/22 ]

Just for record.
Cray hits an stale page read now. but it's not related to the race described in the "Subject: LU-16160 llite: clear stale page's uptodate bit" patch.

Based on current logs - large time between lock cancel and next read. So high likely old race (from LU-14541) returned. Not a race between swapd and read. I work on grab additional details now.

Comment by Gerrit Updater [ 27/Sep/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/48673
Subject: LU-16160 llite: clear page uptodate bit on cache drop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 265f4a8a08f299a5de10bdc07167d875d8fb2531

Comment by Gerrit Updater [ 08/Oct/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48804
Subject: LU-16160 llite: clear stale page's uptodate bit
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d2dc3487cdb845dd79124856101d62ae6b9f8f10

Comment by Gerrit Updater [ 08/Oct/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48805
Subject: LU-16160 llite: clear page uptodate bit on cache drop
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 47eb0bee9bd74e6f2472b1b4a372aadc2c591ad7

Comment by Alexey Lyashkov [ 26/Oct/22 ]

I replicated a situation when page live in cache with uptodate flag and without cl_page.
Coredump + modules + kernel available at http://shadowland.me:8080/webshare/WhamCloud/16150/log1/.

Comment by Gerrit Updater [ 02/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48607/
Subject: LU-16160 llite: clear stale page's uptodate bit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5b911e03261c3de6b0c2934c86dd191f01af4f2f

Comment by Peter Jones [ 02/Nov/22 ]

All existing patches have landed. Please reopen if more work is to be tracked under this ticket.

Comment by Gerrit Updater [ 12/Dec/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49372
Subject: LU-16160 llite: make page state consistent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7307e91ee06955c3fafad60e0a6c6f329736ccfd

Comment by Gerrit Updater [ 30/Dec/22 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49534
Subject: LU-16160 llite: handle filemap_fault() returns SIGBUS error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 437ee8064530b3db7a7c8a8a26e0694273bf0919

Comment by Gerrit Updater [ 03/Jan/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49541
Subject: LU-16160 llite: revert commit 5b911e03261c3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4e63d5c7b94b42c8b8936a591a967b8138ad04fc

Comment by Cory Spitz [ 03/Jan/23 ]

> Please reopen if more work is to be tracked under this ticket
Reopening since bobijam has posted a revert and more work against this ticket.

Comment by Gerrit Updater [ 04/Jan/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49549
Subject: LU-16160 llite: handle filemap_fault() returns SIGBUS error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2e20813f4f355d6c8d0cf0ac861fefbf35d7f6f3

Comment by Gerrit Updater [ 04/Jan/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49553
Subject: LU-16160 llite: handle filemap_fault() returns SIGBUS error
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 7b9342563fc08f800fa4379b189f3b5df6ebd570

Comment by Andrew Perepechko [ 12/Jan/23 ]

I've uploaded the workaround for stale data+SIGBUS issue that we are currently using, as an attachment. We're going to replace it with a better (design-wise/performance-wise) solution if/when we have it. Not sure if it's useful but we'd like to share.

Comment by Peter Jones [ 12/Jan/23 ]

Thanks very much Andrew. Is it possible to push the patch into gerrit so it is easier for us to provide testing/review feedback?

Comment by Peter Jones [ 16/Jan/23 ]

Andrew

Bobijam feels that your attached patch takes a similar approach to his latest LU-16160 patch. Could you please review the latter in gerrit to flag any issues that should deter us from landing this to master?

Thanks

Peter

Comment by Andrew Perepechko [ 16/Jan/23 ]

Peter, sure, I'll do that. Thank you

Comment by Gerrit Updater [ 16/Jan/23 ]

"Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49647
Subject: LU-16160 llite: SIGBUS is possible on a race with page reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 371d94fe71b103c1712bf8f2b1bb1c026cf31de4

Comment by Peter Jones [ 16/Jan/23 ]

panda Patrick ported your attached patch to master and pushed it into gerrit, so we can compare and contrast both similar approaches. To that end, please can you confirm that nothing was altered during the porting? Thanks!

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49541/
Subject: LU-16160 revert: "llite: clear stale page's uptodate bit"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 84c9618190f9e3a526ce51dc4995fcfa3a9ed265

Comment by Gerrit Updater [ 27/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49647/
Subject: LU-16160 llite: SIGBUS is possible on a race with page reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b4da788a819f82d35b685d6ee7f02809c05ca005

Comment by Gerrit Updater [ 03/Mar/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50202
Subject: LU-16160 llite: SIGBUS is possible on a race with page reclaim
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: f817fcd2c553c223fc89087f62c9d59517bb8e59

Comment by Peter Jones [ 14/Mar/23 ]

Looks like everything tracked under this ticket has landed for 2.16

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50202/
Subject: LU-16160 llite: SIGBUS is possible on a race with page reclaim
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: b0a6d4d08e19d06661deabdb7278f07662d8b6e8

Generated at Sat Feb 10 03:24:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.