[LU-10089] kiblnd_fmr_pool_map() Failed to map mr 10/11 elements Created: 05/Oct/17  Updated: 02/Feb/18  Resolved: 24/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

Build Version: 2.8.0_12.chaos
See ssh://review.whamcloud.com/fs/lustre-release-fe-llnl
RHEL 7.4 kernel


Attachments: Text File console.log.jet1.lu-10089.txt    
Issue Links:
Blocker
is blocking LU-10115 Backport appropriate patches to 2.8 F... Resolved
Related
is related to LU-9983 LBUG llog_osd.c:327:llog_osd_declare_... Resolved
is related to LU-10091 o2iblnd fast reg crash on shutdown Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following group of messages appear in the console logs of MDTs.

2017-10-04 08:11:00 [407096.858161] LNetError: 174158:0:(o2iblnd.c:1893:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
2017-10-04 08:11:00 [407096.869697] LNetError: 174158:0:(o2iblnd_cb.c:590:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
2017-10-04 08:11:00 [407096.880686] LNetError: 174158:0:(o2iblnd_cb.c:1582:kiblnd_send()) Can't setup GET sink for 172.19.1.112@o2ib100: -22
2017-10-04 08:11:00 [407096.893504] LustreError: 174158:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff883ebaa9bb00
2017-10-04 08:12:40 [407196.901157] LustreError: 174158:0:(ldlm_lib.c:3186:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883f27232850 x1579913603429696/t0(0) o1000->lquake-MDT0001-mdtlov_UUID@172.19.1.112@o2ib100:-1/-1 lens 352/0 e 0 to 0 dl 1507130003 ref 1 fl Interpret:/0/ffffffff rc 0/-1

The nodes have Mellanox ConnectX-4 IB adapters.



 Comments   
Comment by Olaf Faaland [ 05/Oct/17 ]

The file system has 16 MDTs, each on a separate MDS. This log snippet is from the server hosting MGS and MDT0, NID 172.19.1.111@o2ib100. The nodes referenced by NIDs in the range172.19.1.112@o2ib100 to 172.19.1.126@o2ib100 are the other MDSs.

Comment by Olaf Faaland [ 05/Oct/17 ]

The console log for the MGS/MDS0000 node, which the "Failed to map mr" messages appear in, is attached as console.log.jet1.lu-10089.txt

Comment by Olaf Faaland [ 05/Oct/17 ]

I spot checked a few examples. There appears to be a corresponding error message on the node referred to by the "timeout on bulk WRITE" message. For example:

jet1:
2017-10-05 06:55:32 [488964.187806] LNetError: 174158:0:(o2iblnd.c:1893:kiblnd_fmr_pool_map()) Failed to map mr 10/11 elements
2017-10-05 06:55:32 [488964.199374] LNetError: 174158:0:(o2iblnd_cb.c:590:kiblnd_fmr_map_tx()) Can't map 41033 pages: -22
2017-10-05 06:55:32 [488964.210361] LNetError: 174158:0:(o2iblnd_cb.c:1582:kiblnd_send()) Can't setup GET sink for 172.19.1.112@o2ib100: -22
2017-10-05 06:55:32 [488964.223175] LustreError: 174158:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff883ebaa9bf00
2017-10-05 06:57:12 [489064.231425] LustreError: 174158:0:(ldlm_lib.c:3186:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff883f27239c50 x1579913603914692/t0(0) o1000->lquake-MDT0001-mdtlov_UUID@172.19.1.112@o2ib100:-1/-1 lens 352/0 e 2 to 0 dl 1507211843 ref 1 fl Interpret:/0/ffffffff rc 0/-1
jet2:
2017-10-05 06:57:12 [489015.158245] LustreError: 11-0: lquake-MDT0000-osp-MDT0001: operation out_update to node 172.19.1.111@o2ib100 failed: rc = -110
2017-10-05 06:57:12 [489015.173453] LustreError: 16866:0:(layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `object_update_reply' (1 of 1) in format `OUT_UPDATE': 0 vs. 4096 (server)
2017-10-05 06:57:12 [489015.173453]   req@ffff882b342cdd00 x1579913603914692/t0(0) o1000->lquake-MDT0000-osp-MDT0001@172.19.1.111@o2ib100:24/4 lens 352/192 e 2 to 0 dl 1507211888 ref 2 fl Interpret:ReM/0/0 rc -110/-110
2017-10-05 06:57:16 [489019.757615] LustreError: 17382:0:(llog_cat.c:744:llog_cat_cancel_records()) lquake-MDT0000-osp-MDT0001: fail to cancel 1 of 1 llog-records: rc = -116

The nodes use ntp to sync and all report they are within the same 1/20th of a second or so.

Comment by Amir Shehata (Inactive) [ 05/Oct/17 ]

Can you try: https://review.whamcloud.com/29290

Comment by Olaf Faaland [ 05/Oct/17 ]

Can you try: https://review.whamcloud.com/29290

Yes, I will.

Creating remote directories seems to hang, then fail, and trigger these messages. Is there likely a separate problem, or does it make sense that these two symptoms would be connected?

Comment by Amir Shehata (Inactive) [ 05/Oct/17 ]

I was able to reproduce this issue on mlx5 as well.

It looks to be due to:
LU-6215 o2iblnd: port to new fast reg API introduced in 4.4

Will need to investigate this.

Comment by Olaf Faaland [ 06/Oct/17 ]

Amir,

OK. Please check our patch stack to see what we've got. Pointer is under "Environment" above.

With patch 29290, I see the following in the console logs of the MDS's, on a quiet filesystem:

2017-10-05 18:19:01 [ 1342.476344] LustreError: 15171:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff887f17347e00
2017-10-05 18:19:01 [ 1342.481443] mlx5_0:dump_cqe:262:(pid 15170): dump error cqe
2017-10-05 18:19:01 [ 1342.481444] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.481445] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.481445] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.481445] 00000000 9d005304 0800180b 000ff6d2
2017-10-05 18:19:01 [ 1342.481630] LustreError: 15170:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff887f1cb76800
2017-10-05 18:19:01 [ 1342.486614] mlx5_0:dump_cqe:262:(pid 15170): dump error cqe
2017-10-05 18:19:01 [ 1342.486617] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.486618] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.486619] 00000000 00000000 00000000 00000000
2017-10-05 18:19:01 [ 1342.486620] 00000000 9d005304 0800180c 000ef0d2
2017-10-05 18:19:01 [ 1342.486810] LustreError: 15170:0:(events.c:449:server_bulk_callback()) event type 5, status -5, desc ffff887f17347a00
Comment by James A Simmons [ 06/Oct/17 ]

Olaf are you carrying the LU-9500 patch (https://review.whamcloud.com/27149)

Comment by Olaf Faaland [ 07/Oct/17 ]

James:
Yes. We are carrying backports of patches from LU-9026, LU-9500, and LU-9472. I don't have the gerritt links handy but can get them.

Comment by Olaf Faaland [ 09/Oct/17 ]

Amir,
Have you learned anything more? Thanks.

Comment by Amir Shehata (Inactive) [ 10/Oct/17 ]

Olaf, yes. we fixed this issue for MLX-5 but the same fix doesn't work on MLX-4. I can upload a test patch for MLX-5 for now since I think this is what you're using, correct?

Comment by Olaf Faaland [ 10/Oct/17 ]

Amir, thanks.

We have both MLX-4 and MLX-5 hardware in our center, so we need both to work.

You're correct, the symptoms reported in this issue appear on servers with MLX-5. The filesystem is mounted via routers with MLX-5 and OmniPath and clients using OmniPath (based on the loaded drivers, mlx5_core/ib).

I can give your existing patch a try if it would be helpful.

Comment by Gerrit Updater [ 10/Oct/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/29551
Subject: LU-10089 o2iblnd: use IB_MR_TYPE_SG_GAPS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0527177a8e5690c7e9418f4da3070fecf5ef5a4

Comment by Amir Shehata (Inactive) [ 10/Oct/17 ]

Ok, so I think we have it resolved. There are three different fixes:
1. https://review.whamcloud.com/29551 - this is required for MLX-5, and it's particular for fastreg
2. https://review.whamcloud.com/29290 - this is required for MLX-4 and MLX-5, since FMR and fastreg have different requirements for the RDMA descriptor
3. https://review.whamcloud.com/29341 - this is a revert for LU-9810 which is required for MLX-4 and OPA, since both seem to support fastreg, but fastreg doesn't work for either of them. So we need to continue preferring FMR. I'm investigating why MLX-4 and OPA both don't support fastreg properly. Since fastreg is a MLX feature it makes sense that OPA doesn't support it, but not sure why MLX-4 chokes on it.

If you could try these three patches and see if they resolve your issue.

Comment by Olaf Faaland [ 11/Oct/17 ]

I ran on an MLX-5 machine with good success. I'll test on MLX-4 and OPA this morning.

Comment by Olaf Faaland [ 11/Oct/17 ]

With brief testing, I see no errors on MLX-4 and OPA machines. What is the next step?

Comment by Amir Shehata (Inactive) [ 11/Oct/17 ]

I believe that I'll push for landing these three patches as they stabilize master as well.

I'm not sure how you guys will pick up the patches. Do you maintain your own tree? or do would you need this patches backported?

Comment by Olaf Faaland [ 11/Oct/17 ]

Backport them to 2.8 fe, please. They will likely apply cleanly.

The relationship between our stack and 2.8fe is complicated but they are not that different, and we will be switching to 2.8fe + a small stack of commits very soon.

Comment by Olaf Faaland [ 12/Oct/17 ]

Note that we cannot land these patches to our production tree until they are through your review and testing process, and are merged to master at a minimum.  Let me know if there's anything I can do to help that along.

Comment by Amir Shehata (Inactive) [ 13/Oct/17 ]

Some notes:
1. Fastreg with MLX4 + OFED + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): gets BIND_ERR
2. Fastreg with MLX4 + MOFED 4.1 + ib_alloc_mr(.., IB_MR_TYPE_MEM_REG): works
3. IB_MR_TYPE_SG_GAPS is not supported for MLX4 on either OFED or MOFED
4. MLX4 FMR mapping works differently on MOFED vs OFED.

So the best solution is to:
1. Always use FMR for MLX4
2. Always use FMR for OPA
3. FMR is not available for MLX5 so fastreg will be used

The three patches I described earlier seems like the ideal solution for now.

Comment by Olaf Faaland [ 16/Oct/17 ]

This appears to be working well in my tests.

Comment by Olaf Faaland [ 19/Oct/17 ]

Hi Amir,

I see https://review.whamcloud.com/#/c/29551/ has status "Ready to land", but hasn't been landed.  Is there further work needed, or is it just waiting for the next time a set of patches get merged?  Thanks.

Comment by James A Simmons [ 19/Oct/17 ]

Their is a question about querying the IB device to see if it supports IB_MR_TYPE_SG_GAPS instead of assuming that IB_MR_TYPE_SG_GAPS is always the case.

Comment by Amir Shehata (Inactive) [ 19/Oct/17 ]

I updated https://review.whamcloud.com/#/c/29551/ to address the comments.

One thing to note, if you're using OPA you should use map-on-demand set to 256. I'm still analyzing this issue and hopefully will have a patch soon. This issue is tracked under LU-10129

Comment by Gerrit Updater [ 24/Oct/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29551/
Subject: LU-10089 o2iblnd: use IB_MR_TYPE_SG_GAPS
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1b609396e468949f2420f14fed5ebfc999366b62

Comment by Peter Jones [ 24/Oct/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 25/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29771
Subject: LU-10089 o2iblnd: use IB_MR_TYPE_SG_GAPS
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: aacf8a650f495f50faf0135c6201ad0e446faf74

Comment by Minh Diep [ 02/Feb/18 ]

we don't need this in 2.10

Generated at Sat Feb 10 02:31:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.