[LU-15189] GDS support improvements and fixes. Created: 03/Nov/21  Updated: 09/Sep/22  Resolved: 30/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Whamcloud GDS code have several oddness.
1. Lack of autoconf support it caused a wrong structure / macros used for GDS<>Lustre interaction.
(changed in the last nvidia-fs code).

2. GDS have a bug in page state testing function. It caused a set force_rdma flag to the ptlrpc message buffer. It's because

bool nvfs_is_gpu_page(struct page *page)
{
        nvfs_mgroup_ptr_t nvfs_mgroup;

        nvfs_mgroup = __nvfs_mgroup_from_page(page, false);
        if (nvfs_mgroup == NULL) {
                return false;
        } else if (unlikely(IS_ERR(nvfs_mgroup))) {
                // This is a GPU page but we did not take reference as we are in shutdown path
                // But, we will return true to the caller so that caller doesn't think it is a
                // CPU page and fall back to CPU path 
                return true; <<< true if no magic.

It's very easy to detect - just push force_rdma flag into memory mapping function and check is force_rdma buffer can mapped with GDS code.
But instead of invest it, Whamcloud adds an odd workaround which tries to execute a both mapping functions who caused an lnet slowness.

Lets fix it.



 Comments   
Comment by Alexey Lyashkov [ 04/Nov/21 ]

WC GPU code have lack of protection from GDS module unload, existent code is racy -
nvfs_get_ops don't have protection between shutdown check and ref count increment.
Once thread will interrupted by IRQ or something else - this check will don't work.

Comment by Gerrit Updater [ 08/Nov/21 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45480
Subject: LU-15189 build: add GDS configure options
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: da0738ac67986bdfa98a1dd5e4b92e80fa5ac596

Comment by Gerrit Updater [ 08/Nov/21 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45481
Subject: LU-15189 osc: don't have extra nvidia call
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 02f73e76f324dd8e602fd9b7d11b0860696b2875

Comment by Gerrit Updater [ 08/Nov/21 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45482
Subject: LU-15189 lnet: fix memory mapping.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 534e7be7ac56a4b9f7aad4420e7588c933238d61

Comment by Gerrit Updater [ 06/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45481/
Subject: LU-15189 osc: don't have extra nvidia call
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a75f1a90611038ea097912384413813e8d350290

Comment by Gerrit Updater [ 03/Mar/22 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46692
Subject: LU-15189 build: add GDS configure options
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 40aff8174de7e76038bdba29740bf52a0176329b

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45480/
Subject: LU-15189 build: add GDS configure options
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c65eabc2b1136d6bc2cf2d86d6434d5b4ad300e7

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45482/
Subject: LU-15189 lnet: fix memory mapping.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 959304eac7ec5b156b4bfa57f47cbbf9ef3c8315

Comment by Peter Jones [ 30/May/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:16:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.