[LU-13181] kiblnd_fmr_pool_map error on the AARCH64 with 64k pages Created: 31/Jan/20  Updated: 16/Feb/21  Resolved: 16/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10157 LNET_MAX_IOV hard coded to 256 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

AARCH64 client don't able to do any bulk transfers with error.

[  339.806240] LNetError: 9537:0:(o2iblnd.c:1926:kiblnd_fmr_pool_map()) Failed to map mr 1/16 elements
[  339.806243] LNetError: 9535:0:(o2iblnd.c:1926:kiblnd_fmr_pool_map()) Failed to map mr 1/16 elements
[  339.806249] LNetError: 9538:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't map 1048576 pages: -22
[  339.806251] LNetError: 9535:0:(o2iblnd.c:1926:kiblnd_fmr_pool_map()) Skipped 1 previous similar message
[  339.806255] LNetError: 9536:0:(o2iblnd_cb.c:1841:kiblnd_reply()) Can't setup GET src for 10.149.4.6@o2ib: -22

tracing say an interested info.

kiblnd_sd_03_00-9535  [044] d...   488.776602: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x65260000
 kiblnd_sd_03_00-9535  [044] d...   488.776604: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776605: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x65270000
 kiblnd_sd_03_00-9535  [044] d...   488.776607: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776608: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x65280000
 kiblnd_sd_03_00-9535  [044] d...   488.776609: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776610: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x65290000
 kiblnd_sd_03_00-9535  [044] d...   488.776612: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776613: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652a0000
 kiblnd_sd_03_00-9535  [044] d...   488.776614: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776615: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652b0000
 kiblnd_sd_03_00-9535  [044] d...   488.776617: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776618: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652c0000
 kiblnd_sd_03_00-9535  [044] d...   488.776620: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776621: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652d0000
 kiblnd_sd_03_00-9535  [044] d...   488.776622: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776623: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652e0000
 kiblnd_sd_03_00-9535  [044] d...   488.776625: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776626: p_mlx5_set_page_0: (mlx5_set_page+0x0/0x60 [mlx5_ib]) arg1=0xffff8089529a2c00 arg2=0x652f0000
 kiblnd_sd_03_00-9535  [044] d...   488.776627: r_mlx5_set_page_0: (ib_sg_to_pages+0xc4/0x1b8 [ib_core] <- mlx5_set_page) arg1=0x0
 kiblnd_sd_03_00-9535  [044] d...   488.776628: r_ib_sg_to_pages_0: (mlx5_ib_map_mr_sg+0x8c/0x240 [mlx5_ib] <- ib_sg_to_pages) arg1=0x1 arg2=0xffff00001906f7f0

Obtaining an scaterlist info

struct scatterlist {
  page_link = 0xffff7fe0245c3600,
  offset = 0x0,
  length = 0x10000,
  dma_address = 0x80500000,
  dma_length = 0x100000
}

so DMA length covers a while 1Mb transfer as single entry.

It's mean - ib_dma_map_sg have merge all pages to the single region - this return stored in the rd->rd_nfrags, but it have checked against tx->tx_nfrags which hold a number fragments before mapping, this incorrect check generates a false error and transfers is stopped.



 Comments   
Comment by Gerrit Updater [ 31/Jan/20 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/37388
Subject: LU-13181 o2ib: fix page mapping error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7bc4e8656476653c5a1d9d3a9e9beaa6975f828b

Comment by Gerrit Updater [ 16/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37388/
Subject: LU-13181 o2ib: fix page mapping error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 40385cda7afbd62faf7de2e956f0c7f4fa1a3fed

Comment by Peter Jones [ 16/Jun/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 23/Jan/21 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41303
Subject: LU-13181 o2ib: fix page mapping error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 21d151a36cd74e7895d5adda8bc4718bb2cba6c8

Comment by Gerrit Updater [ 16/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41303/
Subject: LU-13181 o2ib: fix page mapping error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: efab0ee6dced2c8e0b3f0c17338b11f8dcf2d66f

Generated at Sat Feb 10 02:59:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.