Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
3
-
9223372036854775807
Description
Whamcloud GDS code have several oddness.
1. Lack of autoconf support it caused a wrong structure / macros used for GDS<>Lustre interaction.
(changed in the last nvidia-fs code).
2. GDS have a bug in page state testing function. It caused a set force_rdma flag to the ptlrpc message buffer. It's because
bool nvfs_is_gpu_page(struct page *page) { nvfs_mgroup_ptr_t nvfs_mgroup; nvfs_mgroup = __nvfs_mgroup_from_page(page, false); if (nvfs_mgroup == NULL) { return false; } else if (unlikely(IS_ERR(nvfs_mgroup))) { // This is a GPU page but we did not take reference as we are in shutdown path // But, we will return true to the caller so that caller doesn't think it is a // CPU page and fall back to CPU path return true; <<< true if no magic.
It's very easy to detect - just push force_rdma flag into memory mapping function and check is force_rdma buffer can mapped with GDS code.
But instead of invest it, Whamcloud adds an odd workaround which tries to execute a both mapping functions who caused an lnet slowness.
Lets fix it.