[LU-10578] request_in_callback() event type 2, status -103, service ost_io Created: 30/Jan/18 Updated: 13/Apr/18 Resolved: 13/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.10.2, Lustre 2.10.3 RC1, CentOS 7.4 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We're seeing a recurrent crash on OSSes running 2.10.2 on Oak. Many of the following log messages can be seen before the crash: [928518.391343] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [928518.401925] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [928518.410543] LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io Finally, the server crashes after an NMI watchdog is triggered: [928639.223483] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 See oak-io1-s1.vmcore-dmesg.txt for full log. After restarting the OSS, I can see the following in the logs: Jan 29 12:45:21 oak-io1-s1 kernel: LNet: Using FMR for registration Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1590957016747328 Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 1386 previous similar messages Jan 29 12:45:21 oak-io1-s1 kernel: Lustre: oak-OST001c: Connection restored to 7cfcbcde-275c-eb1e-9911-bb9f7ea0c616 (at 10.0.2.223@o2ib5) Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 325230:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io Jan 29 12:45:21 oak-io1-s1 kernel: LustreError: 359716:0:(ldlm_lib.c:3247:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff881c45339850 x1590957016747344/t0(0) o4->7f9fc76a-8204-2cde-2104-5efdf97586ca@10.0.2.223@o2ib5:162/0 lens 3440/1152 e 0 to 0 dl 1517258732 ref 1 fl Interpret:/0/0 rc 0/0 Jan 29 12:45:21 oak-io1-s1 kernel: LNet: 325230:0:(o2iblnd_cb.c:1350:kiblnd_reconnect_peer()) Abort reconnection of 10.0.2.223@o2ib5: connected "message length 0 too small.." can also be found in 10.0.2.223@o2ib5 is oak-gw04, a lustre client running in a VM with SR-IOV on IB (mlx4) serving as a SMB gateway. It was been working fine (almost a year) as long as we were running 2.9, but we're now having this recurrent issue after upgrading clients and servers to the 2.10.x branch. For a bit I suspected the change on map_on_demand now set to 256 instead of 0, but after reading the changelogs, I don't think it should have such impact... We're using mlx4 on both clients and servers on Oak (but it's connected to several lnet routers with both mlx4 and mlx5 remotely. I tried to upgrade oak-gw04 to 2.10.3 RC1 and the same happened. I'm now in the process of upgrading all Lustre servers to 2.10.3 RC1, because it should also fix another issue that we have, A crash is usually occuring after a few hours/days when oak-gw04 is up. Note that I don't get any OSS crash if oak-gw04 stays down. We have numerous other VMs using SR-IOV and have no issue, only this one (serving SMB). This "SMB gateway" is experimental, our users love it but it's not critical for production. Any idea welcomed... Thanks! |
| Comments |
| Comment by Peter Jones [ 30/Jan/18 ] |
|
Let us know how you get on |
| Comment by Stephane Thiell [ 30/Jan/18 ] |
|
Sure. Just restarted everything under 2.10.3 RC1 and started the SMB gateway. Will update as needed. |
| Comment by Stephane Thiell [ 31/Jan/18 ] |
|
Unfortunately, a first OSS crashed again, and logs there are pointing again to the SMB gateway 10.0.2.223@o2ib5): [168210.670422] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.683576] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.695777] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.712729] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168210.719227] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168246.606295] INFO: rcu_sched detected stalls on CPUs/tasks: { 18} (detected by 7, t=60002 jiffies, g=7390585, c=7390584, q=494642)
[168246.606296] Task dump for CPU 18:
[168246.606300] ll_ost_io00_103 R running task 0 284327 2 0x00000088
[168246.606304] ffffffff810ea5fc ffff881cabe1bd28 ffffffff81099f95 ffff881c6214d458
[168246.606306] ffff88015329dee0 ffff881c32960b40 ffff8801b6c00000 ffff8801b6c00000
[168246.606308] ffff881c3247f280 ffff883dcc6b4e00 ffff881d53a0fc00 ffff881cabe1bd90
[168246.606309] Call Trace:
[168246.606318] [<ffffffff810ea5fc>] ? ktime_get+0x4c/0xd0
[168246.606324] [<ffffffff81099f95>] ? mod_timer+0x185/0x220
[168246.606348] [<ffffffffc0842bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[168246.606419] [<ffffffffc0bbd9e8>] ? __lustre_unpack_msg+0x88/0x430 [ptlrpc]
[168246.606482] [<ffffffffc0bf0273>] ? sptlrpc_svc_unwrap_request+0x73/0x600 [ptlrpc]
[168246.606536] [<ffffffffc0bd1115>] ? ptlrpc_main+0x955/0x1e40 [ptlrpc]
[168246.606588] [<ffffffffc0bd07c0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[168246.606592] [<ffffffff810b098f>] ? kthread+0xcf/0xe0
[168246.606595] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[168246.606600] [<ffffffff816b4f58>] ? ret_from_fork+0x58/0x90
[168246.606602] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[168260.719089] Lustre: oak-OST0002: Client b12d2aa6-e7f0-54da-6947-a475d16c533a (at 10.0.2.223@o2ib5) reconnecting
[168260.719091] Lustre: Skipped 2387 previous similar messages
[168260.720910] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168260.720939] LustreError: 284292:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[168260.720941] LustreError: 284292:0:(pack_generic.c:590:__lustre_unpack_msg()) Skipped 2660 previous similar messages
[168260.720944] LustreError: 284292:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1591039248701024
[168260.720945] LustreError: 284292:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 2416 previous similar messages
[168285.720074] Lustre: oak-OST0002: Connection restored to (at 10.0.2.223@o2ib5)
[168285.720076] Lustre: Skipped 2662 previous similar messages
[168285.722111] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.729113] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.735719] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.742268] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[168285.749380] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
crash: [168411.523150] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.533367] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.543574] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.554742] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.565872] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.576389] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.587405] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.598215] LustreError: 271224:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [168411.605427] NMI watchdog: Watchdog detected hard LOCKUP on cpu 26 [168411.605458] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase dell_rbu rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr lpc_ich mei_me mei dm_service_time ses enclosure sg ipmi_si ipmi_devintf ipmi_msghandler [168411.605474] shpchp wmi acpi_power_meter nfsd dm_multipath dm_mod auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx4_core drm tg3 ahci crct10dif_pclmul crct10dif_common mpt3sas libahci crc32c_intel ptp raid_class libata megaraid_sas i2c_core devlink scsi_transport_sas pps_core [168411.605476] CPU: 26 PID: 280365 Comm: lc_watchdogd Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 After the OSTs failover, the OSS pair crashed too: [101277.229184] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.240162] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.240175] LustreError: 181754:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[101277.240178] LustreError: 181754:0:(pack_generic.c:590:__lustre_unpack_msg()) Skipped 10158 previous similar messages
[101277.240182] LustreError: 181754:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.0.2.223@o2ib5 x1591039248701024
[101277.240183] LustreError: 181754:0:(sec.c:2069:sptlrpc_svc_unwrap_request()) Skipped 10158 previous similar messages
[101277.255135] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
[101277.269872] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io
crash: ... [101298.937807] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101298.952543] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101298.966895] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101298.980745] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101298.994897] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101299.005048] LustreError: 144304:0:(events.c:304:request_in_callback()) event type 2, status -103, service ost_io [101299.005530] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7 [101299.005574] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq libcrc32c async_xor xor async_tx ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm uas vfat fat iTCO_wdt iTCO_vendor_support usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 nfsv4 dell_rbu dns_resolver nfs fscache mxm_wmi dcdbas sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr mlx4_en mlx4_ib ib_core mlx4_core devlink lpc_ich mei_me mei dm_service_time ses enclosure sg ipmi_si [101299.005593] ipmi_devintf ipmi_msghandler shpchp wmi acpi_power_meter dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tg3 ttm ahci crct10dif_pclmul crct10dif_common drm mpt3sas libahci crc32c_intel ptp libata megaraid_sas i2c_core raid_class pps_core scsi_transport_sas [101299.005596] CPU: 7 PID: 148188 Comm: lc_watchdogd Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 ... No logs could be seen on the SMB gateway itself (the client), just OST disconnecting when the OSSes crashed. We've now shut down the SMB gateway to avoid further service interruptions.This is super weird. All of the above are running 2.10.3 RC1. Stephane |
| Comment by Stephane Thiell [ 05/Feb/18 ] |
|
The server crash itself could be caused by an issue with the serial console, similar to the one described in |
| Comment by Stephane Thiell [ 16/Mar/18 ] |
|
Just to note that this issue occurred on another Lustre client, also in a VM using SR-IOV for IB, but not exporting the filesystem using SAMBA. So likely not related to SMB. So now I wonder if the issue is related to IB/SR-IOV? |
| Comment by Peter Jones [ 19/Mar/18 ] |
|
Sonia Can you please investigate? Thanks Peter |
| Comment by John Hammond [ 19/Mar/18 ] |
|
Sonia, It looks like errno 103 (ECONNABORTED) could only be comeing from o2iblnd in this case. Could you take a look? |
| Comment by Sonia Sharma (Inactive) [ 20/Mar/18 ] |
|
Hi Stephane Looks like the neterr logging is not turned on . Can you enable that and provide the logs from /var/log/messages and dmesg output when the issue occur again? It would show the network related errors occurring when we see ECONNABORTED. what does the configuration looks like on the servers, clients and routers , specifically on oak-gw04? Can you run below command on the nodes and show the output. lnetctl net show -v Thanks |
| Comment by Stephane Thiell [ 20/Mar/18 ] |
|
Hi Sonia, Thanks for looking at this! I have enabled neterror logging on oak-gw04 and servers. I'll try to reproduce the error, but we are trying to reduce the impact for production so have temporarily limited access to the SMB gateway. I'll see what I can do to make it happens again... Note: oak-gw04 is on same same lustre network than the servers (o2ib5), so routers between them. [root@oak-gw04 ~]# lnetctl net show -v
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0]"
- net type: o2ib5
local NI(s):
- nid: 10.0.2.223@o2ib5
status: up
interfaces:
0: ib0
statistics:
send_count: 203
recv_count: 203
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 256
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
tcp bonding: 0
dev cpt: -1
CPT: "[0]"
Servers (I checked all of them and they all look the same, just the NIDs and counters change): [root@oak-io1-s1 ~]# lnetctl net show -v
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
statistics:
send_count: 0
recv_count: 0
drop_count: 0
tunables:
peer_timeout: 0
peer_credits: 0
peer_buffer_credits: 0
credits: 0
lnd tunables:
tcp bonding: 0
dev cpt: 0
CPT: "[0,1]"
- net type: o2ib5
local NI(s):
- nid: 10.0.2.101@o2ib5
status: up
interfaces:
0: ib0
statistics:
send_count: 361208357
recv_count: 310772624
drop_count: 0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 256
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
Also IB HBA firmware is recent, 2.42.5000. Cards are Mellanox mlx4 FDR MT_1100120019 Thanks,
|
| Comment by Stephane Thiell [ 22/Mar/18 ] |
|
Hi, Ah, we just got another occurrence of the issue from another client! Attaching:
Does that help? Thanks! Stephane |
| Comment by Stephane Thiell [ 22/Mar/18 ] |
|
Note that the OSS didn't crash this time probably because we completely disabled the serial console. |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Happened again. Attached oak-gw22_lustre_neterror.log this time with neterror: 00000100:02000000:0.0F:1521773263.363016:0:1734:0:(import.c:1542:ptlrpc_import_recovery_state_machine()) oak-OST0013-osc-ffff8800b8462000: Connection restored to 10.0.2.102@o2ib5 (at 10.0.2.102@o2ib5) 00000800:00000100:3.0F:1521773263.363190:0:1675:0:(o2iblnd_cb.c:3470:kiblnd_complete()) FastReg failed: 6 00000800:00000100:3.0:1521773263.363193:0:1675:0:(o2iblnd_cb.c:3481:kiblnd_complete()) RDMA (tx: ffffc90001d095d8) failed: 5 00000800:00000100:3.0:1521773263.363196:0:1675:0:(o2iblnd_cb.c:967:kiblnd_tx_complete()) Tx -> 10.0.2.102@o2ib5 cookie 0x2671f sending 1 waiting 0: failed 5 00000800:00000100:3.0:1521773263.363199:0:1675:0:(o2iblnd_cb.c:1914:kiblnd_close_conn_locked()) Closing conn to 10.0.2.102@o2ib5: error -5(waiting)
|
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Apparently this issue is already known... reported by James Simmons in |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Hello, I applied https://review.whamcloud.com/#/c/28279/ on the client last night just in case, but this "FastReg failed: 6" error is still reproducible under load. Is there a workaround to avoid FastReg at all? This error has appeared in 2.10 for us, not seen before when using Lustre 2.9. Thanks much. |
| Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ] |
|
Hi Stephane,
Basically the issues were seen with this patch https://review.whamcloud.com/#/c/28278/ .You can try to revert it and see if it helps. Or you can try to apply https://review.whamcloud.com/29290/ as well from |
| Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ] |
|
My bad. I see |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Thanks much Sonia! I'll check. One thing sounds weird because my 2.10.3 clients (using lustre-client RPMs) start with kernel: LNet: Using FastReg for registration whereas my Lustre 2.10.3 servers start with: kernel: LNet: Using FMR for registration Both are using mlx4... hmm... |
| Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ] |
|
Yes this shouldn't be the case for clients. How are you getting lustre-client RPMs, building from source? |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
I tried both ways actually: building for source (kmod) and with Intel provided RPMs (with lustre-client-dkms). On this old thread, FMR doesn't seem supported on slaves: Any idea? Thanks! |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
I bet there was a change in the way FastReg is handled in 2.10 that is breaking the use of FastReg on mlx4 slaves... |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
I applied https://review.whamcloud.com/29290/ on top of 2.10.3, and I am still able to reproduce the issue. 00000100:02000000:1.0:1521826954.348730:0:1622:0:(import.c:1542:ptlrpc_import_recovery_state_machine()) oak-OST0012-osc-ffff8800bb5c9000: Connection restored to 10.0.2.101@o2ib5 (at 10.0.2.101@o2ib5) 00000800:00000100:0.0:1521826954.348944:0:1558:0:(o2iblnd_cb.c:3476:kiblnd_complete()) FastReg failed: 6 00000800:00000100:0.0:1521826954.348946:0:1558:0:(o2iblnd_cb.c:3487:kiblnd_complete()) RDMA (tx: ffffc90001d60940) failed: 5 00000800:00000100:0.0:1521826954.348949:0:1558:0:(o2iblnd_cb.c:973:kiblnd_tx_complete()) Tx -> 10.0.2.101@o2ib5 cookie 0x25ef sending 1 waiting 0: failed 5 00000800:00000100:0.0:1521826954.348951:0:1558:0:(o2iblnd_cb.c:1920:kiblnd_close_conn_locked()) Closing conn to 10.0.2.101@o2ib5: error -5(waiting) |
| Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ] |
|
Yes it wouldn't resolve. I was mistaken earlier and saw that And if it is because of SR-IOV then I am thinking that you said the issues did not appear for Lustre 2.9 with same SR-IOV mlx4 clients. I am looking into the changes from 2.9 to 2.10 that might be related to this. |
| Comment by Sonia Sharma (Inactive) [ 23/Mar/18 ] |
|
Stephane Can you confirm that it is with same kernel and MOFED versions that you are seeing this issue in Lustre 2.10 but not in 2.9? |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Hi Sonia, We're using the in-kernel OFED stack on clients and servers. I cannot confirm this as another major change between our Lustre 2.9 and 2.10 client setup is actually the kernel version ( CentOS 7.3 to 7.4) and the removal of ib_get_dma_mr in the in-kernel OFED stack in 7.4. For example, I'm not able to build Lustre 2.9 on CentOS 7.4 due to this issue. Now, I could try with MOFED maybe...
|
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Also, I was checking the documentation of our Infiniband FDR HBAs (PSID MT_1100120019), at http://www.mellanox.com/pdf/firmware/ConnectX3-FW-2_42_5000-release_notes.pdf and there is mention of "FMR for SRIOV" in Beta support in quite old firmware (2.11.0500, and we have the latest fw 2.42.5000). So I do have the feeling it should be supported. In that case, I'm wondering why Lustre isn't picking up FMR. |
| Comment by Stephane Thiell [ 23/Mar/18 ] |
|
Hi Sonia, I'm testing right now with Lustre 2.10.3 + MOFED 4.2-1.2.0.0 (el7.4) in the VM (guest OS). The reproducer doesn't seem to trigger the issue so far. Fingers crossed! Note: in that case, when loading LNet / Lustre, there is no FMR nor FastReg message at all, just this: [ 24.306279] LNet: HW NUMA nodes: 1, HW CPU cores: 4, npartitions: 1 [ 24.310169] alg: No test for adler32 (adler32-zlib) [ 24.310935] alg: No test for crc32 (crc32-table) [ 25.141828] Lustre: Lustre: Build Version: 2.10.3_MOFED [ 25.282150] LNet: Added LNI 10.0.2.242@o2ib5 [8/256/0/180] [ 50.563366] Lustre: Mounted oak-client |
| Comment by Stephane Thiell [ 28/Mar/18 ] |
|
Hi Sonia, The VMs using MOFED are now working fine. I wasn't able to reproduce the issue. We're still running some tests but that sounds good. We now believe the issue was introduced in RHEL/CentOS 7.4. We didn't see this with Lustre 2.9 because we were still using CentOS 7.3 or less and it used to worked well. I don't know the details of what have changed in OFA/EL7.4 but I'll try to open a Red Hat ticket when I get some time, but our solution going forward is to use MOFED 4.2 on the VMs that use SR-IOV over mlx4. Thanks much. Stephane |
| Comment by Sonia Sharma (Inactive) [ 28/Mar/18 ] |
|
Thanks Stephane for the update. I will keep a note of this as well. |
| Comment by Peter Jones [ 13/Apr/18 ] |
|
Stephane Do you need anything further on this ticket? Peter |
| Comment by Stephane Thiell [ 13/Apr/18 ] |
|
Hi Peter – no thanks, we're good, using MOFED 4.2 solved our issue, everything is working well on our Oak client VMs. It's up to you (Intel) if you want to investigate why CentOS 7.4 with in-kernel OFA + SR-IOV + Lustre 2.10 doesn't work under I/O load. Might not be worth it, esp. that RHEL 7.5 is now out and probably come with updated OFA. Thanks for your help in finding the root cause! Stephane
|
| Comment by Peter Jones [ 13/Apr/18 ] |
|
ok - thanks Stephane. I think upwards and onwards |