[LU-10578] request_in_callback() event type 2, status -103, service ost_io Created: 30/Jan/18  Updated: 13/Apr/18  Resolved: 13/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.10.2, Lustre 2.10.3 RC1, CentOS 7.4


Attachments: Text File oak-gw04.kernel.log     File oak-gw22_lustre.log.gz     Text File oak-gw22_lustre_neterror.log     Text File oak-io1-s1.vmcore-dmesg.txt     File oak-io1-s2_lustre.log.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We're seeing a recurrent crash on OSSes running 2.10.2 on Oak. Many of the following log messages can be seen before the crash:

[928518.391343] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[928518.401925] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[928518.410543] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io

Finally, the server crashes after an NMI watchdog is triggered:

[928639.223483] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24

See oak-io1-s1.vmcore-dmesg.txt for full log.

After restarting the OSS, I can see the following in the logs:

Jan 29 12:45:21 oak-io1-s1 kernel: LNet: Using FMR for registration
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1590957016747328
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 1386 previous similar messages
Jan 29 12:45:21 oak-io1-s1 kernel: Lustre: oak-OST001c: Connection restored to 7cfcbcde-275c-eb1e-9911-bb9f7ea0c616 (at 10.0.2.223@o2ib5)
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff881c45339850 x1590957016747344/t0(0) o4->7f9fc76a-8204-2cde-2104-5efdf97586ca@10.0.2.223@o2ib5:162/0 lens 3440/1152 e 0 to 0 dl 1517258732 ref 1 fl Interpret:/0/0 rc 0/0
Jan 29 12:45:21 oak-io1-s1 kernel: LNet: 325230:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 10.0.2.223@o2ib5: connected

"message length 0 too small.." can also be found in LU-9983 (fixed in 2.10.3) and LU-10068 (not clear if fully resolved).

10.0.2.223@o2ib5 is oak-gw04, a lustre client running in a VM with SR-IOV on IB (mlx4) serving as a SMB gateway. It was been working fine (almost a year) as long as we were running 2.9, but we're now having this recurrent issue after upgrading clients and servers to the 2.10.x branch. For a bit I suspected the change on map_on_demand now set to 256 instead of 0, but after reading the changelogs, I don't think it should have such impact... We're using mlx4 on both clients and servers on Oak (but it's connected to several lnet routers with both mlx4 and mlx5 remotely.

I tried to upgrade oak-gw04 to 2.10.3 RC1 and the same happened. I'm now in the process of upgrading all Lustre servers to 2.10.3 RC1, because it should also fix another issue that we have, LU-10267 - not related to this issue. I will update this ticket if the issue happens again with all servers running 2.10.3 RC1...

A crash is usually occuring after a few hours/days when oak-gw04 is up. Note that I don't get any OSS crash if oak-gw04 stays down. We have numerous other VMs using SR-IOV and have no issue, only this one (serving SMB). This "SMB gateway" is experimental, our users love it but it's not critical for production.

Any idea welcomed...

Thanks!
Stephane



 Comments   
Comment by Peter Jones [ 30/Jan/18 ]

Let us know how you get on

Comment by Stephane Thiell [ 30/Jan/18 ]

Sure. Just restarted everything under 2.10.3 RC1 and started the SMB gateway. Will update as needed.

Comment by Stephane Thiell [ 31/Jan/18 ]

Unfortunately, a first OSS crashed again, and logs there are pointing again to the SMB gateway 10.0.2.223@o2ib5):

[168210.670422] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.683576] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.695777] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.712729] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.719227] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168246.606295] INFO: rcu_sched detected stalls on CPUs/tasks: { 18} (detected by 7, t=60002 jiffies, g=7390585, c=7390584, q=494642)
[168246.606296] Task dump for CPU 18:
[168246.606300] ll_ost_io00_103 R  running task        0 284327      2 0x00000088
[168246.606304]  ffffffff810ea5fc ffff881cabe1bd28 ffffffff81099f95 ffff881c6214d458
[168246.606306]  ffff88015329dee0 ffff881c32960b40 ffff8801b6c00000 ffff8801b6c00000
[168246.606308]  ffff881c3247f280 ffff883dcc6b4e00 ffff881d53a0fc00 ffff881cabe1bd90
[168246.606309] Call Trace:
[168246.606318]  [<ffffffff810ea5fc>] ? ktime_get+0x4c/0xd0
[168246.606324]  [<ffffffff81099f95>] ? mod_timer+0x185/0x220
[168246.606348]  [<ffffffffc0842bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[168246.606419]  [<ffffffffc0bbd9e8>] ? __lustre_unpack_msg+0x88/0x430 [ptlrpc]
[168246.606482]  [<ffffffffc0bf0273>] ? sptlrpc_svc_unwrap_request+0x73/0x600 [ptlrpc]
[168246.606536]  [<ffffffffc0bd1115>] ? ptlrpc_main+0x955/0x1e40 [ptlrpc]
[168246.606588]  [<ffffffffc0bd07c0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[168246.606592]  [<ffffffff810b098f>] ? kthread+0xcf/0xe0
[168246.606595]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[168246.606600]  [<ffffffff816b4f58>] ? ret_from_fork+0x58/0x90
[168246.606602]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[168260.719089] Lustre: oak-OST0002: Client b12d2aa6-e7f0-54da-6947-a475d16c533a (at 10.0.2.223@o2ib5) reconnecting
[168260.719091] Lustre: Skipped 2387 previous similar messages
[168260.720910] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168260.720939] LustreError: 284292:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[168260.720941] LustreError: 284292:0:(pack_generic.c:590:__lustre_unpack_msg()) Skipped 2660 previous similar messages
[168260.720944] LustreError: 284292:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1591039248701024
[168260.720945] LustreError: 284292:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 2416 previous similar messages
[168285.720074] Lustre: oak-OST0002: Connection restored to  (at 10.0.2.223@o2ib5)
[168285.720076] Lustre: Skipped 2662 previous similar messages
[168285.722111] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.729113] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.735719] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.742268] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.749380] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io


crash:

[168411.523150] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.533367] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.543574] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.554742] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.565872] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.576389] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.587405] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.598215] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168411.605427] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26
[168411.605458] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase dell_rbu rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr lpc_ich mei_me mei dm_service_time ses enclosure sg ipmi_si ipmi_devintf ipmi_msghandler
[168411.605474]  shpchp wmi acpi_power_meter nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx4_core drm tg3 ahci crct10dif_pclmul crct10dif_common mpt3sas libahci crc32c_intel ptp raid_class libata megaraid_sas i2c_core devlink scsi_transport_sas pps_core
[168411.605476] CPU: 26 PID: 280365 Comm: lc_watchdogd Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1


After the OSTs failover, the OSS pair crashed too:

[101277.229184] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.240162] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.240175] LustreError: 181754:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[101277.240178] LustreError: 181754:0:(pack_generic.c:590:__lustre_unpack_msg()) Skipped 10158 previous similar messages
[101277.240182] LustreError: 181754:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1591039248701024
[101277.240183] LustreError: 181754:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 10158 previous similar messages
[101277.255135] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.269872] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io


crash:

...
[101298.937807] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101298.952543] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101298.966895] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101298.980745] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101298.994897] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101299.005048] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101299.005530] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
[101299.005574] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm uas vfat fat iTCO_wdt iTCO_vendor_support usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 nfsv4 dell_rbu dns_resolver nfs fscache mxm_wmi dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr mlx4_en mlx4_ib ib_core mlx4_core devlink lpc_ich mei_me mei dm_service_time ses enclosure sg ipmi_si
[101299.005593]  ipmi_devintf ipmi_msghandler shpchp wmi acpi_power_meter dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tg3 ttm ahci crct10dif_pclmul crct10dif_common drm mpt3sas libahci crc32c_intel ptp libata megaraid_sas i2c_core raid_class pps_core scsi_transport_sas
[101299.005596] CPU: 7 PID: 148188 Comm: lc_watchdogd Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1
...


No logs could be seen on the SMB gateway itself (the client), just OST disconnecting when the OSSes crashed. We've now shut down the SMB gateway to avoid further service interruptions.This is super weird. All of the above are running 2.10.3 RC1.

Stephane

Comment by Stephane Thiell [ 05/Feb/18 ]

The server crash itself could be caused by an issue with the serial console, similar to the one described in LU-7886. We're investigating this possibility. Although, this wouldn't explain the flood of request_in_callback error messages and also sptlrpc_svc_unwrap_request() error unpacking request from 12345-10.0.2.223@o2ib5 seen in the first place.

Comment by Stephane Thiell [ 16/Mar/18 ]

Just to note that this issue occurred on another Lustre client, also in a VM using SR-IOV for IB, but not exporting the filesystem using SAMBA. So likely not related to SMB. So now I wonder if the issue is related to IB/SR-IOV?

Comment by Peter Jones [ 19/Mar/18 ]

Sonia

Can you please investigate?

Thanks

Peter

Comment by John Hammond [ 19/Mar/18 ]

Sonia,

It looks like errno 103 (ECONNABORTED) could only be comeing from o2iblnd in this case. Could you take a look?

Comment by Sonia Sharma (Inactive) [ 20/Mar/18 ]

Hi Stephane

Looks like the neterr logging is not turned on . Can you enable that and provide the logs from /var/log/messages and dmesg output when the issue occur again? It would show the network related errors occurring when we see ECONNABORTED.

what does the configuration looks like on the servers, clients and routers , specifically on oak-gw04?

Can you run  below command on the nodes and show the output.

lnetctl net show -v

Thanks

Comment by Stephane Thiell [ 20/Mar/18 ]

Hi Sonia,

Thanks for looking at this! I have enabled neterror logging on oak-gw04 and servers. I'll try to reproduce the error, but we are trying to reduce the impact for production so have temporarily limited access to the SMB gateway. I'll see what I can do to make it happens again...

Note: oak-gw04 is on same same lustre network than the servers (o2ib5), so routers between them.

[root@oak-gw04 ~]# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0]"
    - net type: o2ib5
      local NI(s):
        - nid: 10.0.2.223@o2ib5
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 203
              recv_count: 203
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 256
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: -1
          CPT: "[0]"

Servers (I checked all of them and they all look the same, just the NIDs and counters change):

[root@oak-io1-s1 ~]# lnetctl net show -v
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          lnd tunables:
          tcp bonding: 0
          dev cpt: 0
          CPT: "[0,1]"
    - net type: o2ib5
      local NI(s):
        - nid: 10.0.2.101@o2ib5
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 361208357
              recv_count: 310772624
              drop_count: 0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 256
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"

Also IB HBA firmware is recent, 2.42.5000. Cards are Mellanox mlx4 FDR MT_1100120019

Thanks,
Stephane 

 

Comment by Stephane Thiell [ 22/Mar/18 ]

Hi,

Ah, we just got another occurrence of the issue from another client!

Attaching:

  • logs from the client oak-gw22 (10.0.2.242@o2ib5) as oak-gw22_lustre.log.gz
  • logs from the impacted OSS oak-io1-s2 (10.0.2.102@o2ib5) as oak-io1-s2_lustre.log.gz

Does that help?

Thanks!

Stephane

Comment by Stephane Thiell [ 22/Mar/18 ]

Note that the OSS didn't crash this time probably because we completely disabled the serial console.

Comment by Stephane Thiell [ 23/Mar/18 ]

Happened again. Attached oak-gw22_lustre_neterror.log this time with neterror:

00000100:02000000:0.0F:1521773263.363016:0:1734:0:(import.c:1542:ptlrpc_import_recovery_state_machine()) oak-OST0013-osc-ffff8800b8462000: Connection restored to 10.0.2.102@o2ib5 (at 10.0.2.102@o2ib5)
00000800:00000100:3.0F:1521773263.363190:0:1675:0:(o2iblnd_cb.c:3470:kiblnd_complete()) FastReg failed: 6
00000800:00000100:3.0:1521773263.363193:0:1675:0:(o2iblnd_cb.c:3481:kiblnd_complete()) RDMA (tx: ffffc90001d095d8) failed: 5
00000800:00000100:3.0:1521773263.363196:0:1675:0:(o2iblnd_cb.c:967:kiblnd_tx_complete()) Tx -> 10.0.2.102@o2ib5 cookie 0x2671f sending 1 waiting 0: failed 5
00000800:00000100:3.0:1521773263.363199:0:1675:0:(o2iblnd_cb.c:1914:kiblnd_close_conn_locked()) Closing conn to 10.0.2.102@o2ib5: error -5(waiting)

 

Comment by Stephane Thiell [ 23/Mar/18 ]

Apparently this issue is already known... reported by James Simmons in LU-9810 last September.. is there a patch? I see that  https://review.whamcloud.com/28279/ has been merged on master but can't find it in 2.10. Could you please advise as this has a big impact on our users, thanks!!

Comment by Stephane Thiell [ 23/Mar/18 ]

Hello,

I applied https://review.whamcloud.com/#/c/28279/ on the client last night just in case, but this "FastReg failed: 6" error is still reproducible under load. Is there a workaround to avoid FastReg at all? This error has appeared in 2.10 for us, not seen before when using Lustre 2.9. Thanks much.

Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ]

Hi Stephane,

LU-9983 had 3 patches with it. Among those 3, only 2 patches were applied to 2.10 and https://review.whamcloud.com/29290/ was not applied to 2.10 because it was thought that the other two patches can take care of the issue observed with the LU-9810 lnet: prefer Fast Reg patch.

Basically the issues were seen with this patch  https://review.whamcloud.com/#/c/28278/ .You can try to revert it and see if it helps. 

Or you can try to apply https://review.whamcloud.com/29290/  as well from LU-9983. But this patch triggered another issue that is taken care of in LU-10129.

Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ]

My bad. I see  LU-9810 lnet: prefer Fast Reg patch is already reverted and is not there in 2.10.

Comment by Stephane Thiell [ 23/Mar/18 ]

Thanks much Sonia! I'll check. One thing sounds weird because my 2.10.3 clients (using lustre-client RPMs) start with

 kernel: LNet: Using FastReg for registration

whereas my Lustre 2.10.3 servers start with:

kernel: LNet: Using FMR for registration

Both are using mlx4... hmm...

Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ]

Yes this shouldn't be the case for clients. How are you getting lustre-client RPMs, building from source?

Comment by Stephane Thiell [ 23/Mar/18 ]

I tried both ways actually: building for source (kmod) and with Intel provided RPMs (with lustre-client-dkms).
I wonder if this is related to SR-IOV? On other clients using mlx4 but not VFs/slaves, I can see "LNet: Using FMR for registration".

On this old thread, FMR doesn't seem supported on slaves:
https://www.spinics.net/lists/linux-rdma/msg22014.html

Any idea? Thanks!

Comment by Stephane Thiell [ 23/Mar/18 ]

I bet there was a change in the way FastReg is handled in 2.10 that is breaking the use of FastReg on mlx4 slaves...

Comment by Stephane Thiell [ 23/Mar/18 ]

I applied https://review.whamcloud.com/29290/ on top of 2.10.3, and I am still able to reproduce the issue.

00000100:02000000:1.0:1521826954.348730:0:1622:0:(import.c:1542:ptlrpc_import_recovery_state_machine()) oak-OST0012-osc-ffff8800bb5c9000: Connection restored to 10.0.2.101@o2ib5 (at 10.0.2.101@o2ib5)
00000800:00000100:0.0:1521826954.348944:0:1558:0:(o2iblnd_cb.c:3476:kiblnd_complete()) FastReg failed: 6
00000800:00000100:0.0:1521826954.348946:0:1558:0:(o2iblnd_cb.c:3487:kiblnd_complete()) RDMA (tx: ffffc90001d60940) failed: 5
00000800:00000100:0.0:1521826954.348949:0:1558:0:(o2iblnd_cb.c:973:kiblnd_tx_complete()) Tx -> 10.0.2.101@o2ib5 cookie 0x25ef sending 1 waiting 0: failed 5
00000800:00000100:0.0:1521826954.348951:0:1558:0:(o2iblnd_cb.c:1920:kiblnd_close_conn_locked()) Closing conn to 10.0.2.101@o2ib5: error -5(waiting)
Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ]

Yes it wouldn't resolve. I was mistaken earlier and saw that LU-9810 is already reverted in 2.10. 
Now the issue is that why in the absence of "Prefer Fast-Reg" patch, mlx4 clients are still using Fast-Reg while the servers are using FMR. 

And if it is because of SR-IOV then I am thinking that you said the issues did not appear for Lustre 2.9 with same SR-IOV mlx4 clients.

I am looking into the changes from 2.9 to 2.10 that might be related to this.

Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ]

Stephane 

Can you confirm that it is with same kernel and MOFED versions that you are seeing this issue in Lustre 2.10 but not in 2.9?

Comment by Stephane Thiell [ 23/Mar/18 ]

Hi Sonia,

We're using the in-kernel OFED stack on clients and servers.

I cannot confirm this as another major change between our Lustre 2.9 and 2.10 client setup is actually the kernel version ( CentOS 7.3 to 7.4) and the removal of ib_get_dma_mr in the in-kernel OFED stack in 7.4. For example, I'm not able to build Lustre 2.9 on CentOS 7.4 due to this issue. Now, I could try with MOFED maybe...

 

Comment by Stephane Thiell [ 23/Mar/18 ]

Also, I was checking the documentation of our Infiniband FDR HBAs (PSID MT_1100120019), at http://www.mellanox.com/pdf/firmware/ConnectX3-FW-2_42_5000-release_notes.pdf

and there is mention of "FMR for SRIOV" in Beta support in quite old firmware (2.11.0500, and we have the latest fw 2.42.5000). So I do have the feeling it should be supported. In that case, I'm wondering why Lustre isn't picking up FMR.

Comment by Stephane Thiell [ 23/Mar/18 ]

Hi Sonia,

I'm testing right now with Lustre 2.10.3 + MOFED 4.2-1.2.0.0 (el7.4) in the VM (guest OS). The reproducer doesn't seem to trigger the issue so far. Fingers crossed!

Note: in that case, when loading LNet / Lustre, there is no FMR nor FastReg message at all, just this:

[   24.306279] LNet: HW NUMA nodes: 1, HW CPU cores: 4, npartitions: 1
[   24.310169] alg: No test for adler32 (adler32-zlib)
[   24.310935] alg: No test for crc32 (crc32-table)
[   25.141828] Lustre: Lustre: Build Version: 2.10.3_MOFED
[   25.282150] LNet: Added LNI 10.0.2.242@o2ib5 [8/256/0/180]
[   50.563366] Lustre: Mounted oak-client
Comment by Stephane Thiell [ 28/Mar/18 ]

Hi Sonia,

The VMs using MOFED are now working fine. I wasn't able to reproduce the issue. We're still running some tests but that sounds good. We now believe the issue was introduced in RHEL/CentOS 7.4. We didn't see this with Lustre 2.9 because we were still using CentOS 7.3 or less and it used to worked well. I don't know the details of what have changed in OFA/EL7.4 but I'll try to open a Red Hat ticket when I get some time, but our solution going forward is to use MOFED 4.2 on the VMs that use SR-IOV over mlx4.

Thanks much.

Stephane

Comment by Sonia Sharma (Inactive) [ 28/Mar/18 ]

Thanks Stephane for the update. I will keep a note of this as well.

Comment by Peter Jones [ 13/Apr/18 ]

Stephane

Do you need anything further on this ticket?

Peter

Comment by Stephane Thiell [ 13/Apr/18 ]

Hi Peter – no thanks, we're good, using MOFED 4.2 solved our issue, everything is working well on our Oak client VMs. It's up to you (Intel) if you want to investigate why CentOS 7.4 with in-kernel OFA + SR-IOV + Lustre 2.10 doesn't work under I/O load. Might not be worth it, esp. that RHEL 7.5 is now out and probably come with updated OFA.

Thanks for your help in finding the root cause!

Stephane

 

 

Comment by Peter Jones [ 13/Apr/18 ]

ok - thanks Stephane. I think upwards and onwards

Generated at Sat Feb 10 02:36:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.