[LU-12856] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336) Created: 14/Oct/19  Updated: 03/Jun/22  Resolved: 06/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

L2.12.2 server & L2.10.8 Server
L2.12.2 client & L2.11 Client


Attachments: File 0001-LU-12856-debug.patch     Text File 0002-LU-12856-revert-LU-9983.patch     File client.10.151.53.134.debug.gz     File nbp7-mds.2019-10-16.out.gz     File r407i2n14.out2.gz     File server.xid.1647322708401808.debug.gz    
Issue Links:
Duplicate
duplicates LU-12772 bulk timeout after 2.12.2 clients upg... Closed
Related
is related to LU-1757 Short I/O support Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We are seeing

 Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0
Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0
Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages
Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib)
Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages
Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0
Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages

We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
Client Configs

r593i6n16 ~ # lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 100
    health_sensitivity: 0
    recovery_interval: 1
----
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
options lnet networks=o2ib(ib1)
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
options lnet lnet_transaction_timeout=100
options ptlrpc at_max=600 at_min=200

Server Configs

nbp7-oss7 ~ # lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 100
    health_sensitivity: 0
    recovery_interval: 1
----
options ko2iblnd require_privileged_port=0 use_privileged_port=0
options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32


#lnet
options lnet networks=o2ib(ib1) 
options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
options lnet dead_router_check_interval=60 live_router_check_interval=30
options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
options ptlrpc at_max=600 at_min=200

These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.



 Comments   
Comment by Mahmoud Hanafi [ 14/Oct/19 ]

attaching client and server debug that show trace of  xid 1647322708401808

server.xid.1647322708401808.debug.gz

 

client.10.151.53.134.debug.gz

 

 

Comment by Peter Jones [ 14/Oct/19 ]

Amir

Can you please advise here?

Thanks

Peter

Comment by Mahmoud Hanafi [ 16/Oct/19 ]

I was able to reproduce this with 10 clients running l2.12.2(MOFED4.6.1) and 1 server running 2.10.8 +3patchs (MOFED4.6.1)

(server had MDT + OSTS)

Started IOR

mpiexec -n 200 /u/mhanafi/bin/IOR -a POSIX -C -Q 1 -e -i 200 -r -w -t 4096 -z -b 30m -F -O keepFile=1

Let IO run for a few iterations.
On the server

Umount the OSTs.  Rebooted Server with  MGS/MDT still mounted. After reboot server logs

ct 15 20:05:21 nbp7-mds2 kernel: [  390.574619] LDISKFS-fs (dm-0): recovery complete
Oct 15 20:05:21 nbp7-mds2 kernel: [  390.574678] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: errors=panic,user_xattr,errors=panic,user_xattr,no_mbcache,nodelalloc
Oct 15 20:05:21 nbp7-mds2 kernel: [  390.900389] Lustre: MGS: Connection restored to f3912282-13f7-c80a-d5e2-9ebf63e9b11f (at 0@lo)
Oct 15 20:05:22 nbp7-mds2 kernel: [  391.369402] LDISKFS-fs (dm-1): recovery complete
Oct 15 20:05:22 nbp7-mds2 kernel: [  391.369463] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,errors=remount-ro,acl,user_xattr,no_mbcache,nodelalloc
Oct 15 20:05:22 nbp7-mds2 kernel: [  391.856711] Lustre: nbptest-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.502806] LDISKFS-fs (dm-6): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.583519] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.602199] LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.606903] LDISKFS-fs (dm-3): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.608884] LDISKFS-fs (dm-4): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.613246] LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.637495] LDISKFS-fs (dm-5): file extents enabled, maximum tree depth=5
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.698220] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.703712] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.737637] LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.758227] LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:23 nbp7-mds2 kernel: [  392.782114] LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. Opts: errors=panic,errors=panic,,no_mbcache,nodelalloc
Oct 15 20:05:24 nbp7-mds2 kernel: [  393.422465] Lustre: nbptest-MDT0000: Connection restored to f3912282-13f7-c80a-d5e2-9ebf63e9b11f (at 0@lo)
Oct 15 20:05:24 nbp7-mds2 kernel: [  393.422778] Lustre: nbptest-OST0001: Imperative Recovery not enabled, recovery window 300-900
Oct 15 20:05:25 nbp7-mds2 kernel: [  394.552737] Lustre: nbptest-MDT0000: Connection restored to f3912282-13f7-c80a-d5e2-9ebf63e9b11f (at 0@lo)
Oct 15 20:05:25 nbp7-mds2 kernel: [  394.552741] Lustre: Skipped 4 previous similar messages
Oct 15 20:05:25 nbp7-mds2 kernel: [  394.553004] Lustre: nbptest-OST0003: Imperative Recovery not enabled, recovery window 300-900
Oct 15 20:05:25 nbp7-mds2 kernel: [  394.553010] Lustre: Skipped 4 previous similar messages
Oct 15 20:05:50 nbp7-mds2 kernel: [  419.552979] Lustre: nbptest-MDT0000: Connection restored to f3912282-13f7-c80a-d5e2-9ebf63e9b11f (at 0@lo)
Oct 15 20:05:54 nbp7-mds2 ntpd[4973]: 0.0.0.0 c61c 0c clock_step +0.674395 s
Oct 15 20:05:54 nbp7-mds2 ntpd[4973]: 0.0.0.0 c614 04 freq_mode
Oct 15 20:05:54 nbp7-mds2 systemd[1]: Time has been changed
Oct 15 20:05:55 nbp7-mds2 ntpd[4973]: 0.0.0.0 c618 08 no_sys_peer
Oct 15 20:06:23 nbp7-mds2 kernel: [  451.803748] Lustre: nbptest-OST0000: Will be in recovery for at least 5:00, or until 11 clients reconnect
Oct 15 20:06:23 nbp7-mds2 kernel: [  451.803751] Lustre: MGS: Connection restored to 8f0ab64d-4319-bfcb-2f49-357e6e697403 (at 10.151.8.71@o2ib)
Oct 15 20:06:23 nbp7-mds2 kernel: [  451.814450] Lustre: Skipped 1 previous similar message
Oct 15 20:06:23 nbp7-mds2 sec[5242]: SEC_EVENT |msg recovery start|target nbptest-OST0000|expected_time 5:00|expected_clients 11
Oct 15 20:06:34 nbp7-mds2 kernel: [  463.484331] Lustre: MGS: Connection restored to 808df143-d3b2-edd2-eb46-31297e7b8ace (at 10.151.8.26@o2ib)
Oct 15 20:06:34 nbp7-mds2 kernel: [  463.484339] Lustre: Skipped 55 previous similar messages
Oct 15 20:06:38 nbp7-mds2 kernel: [  466.932419] Lustre: nbptest-MDT0000: Recovery over after 0:15, of 10 clients 10 recovered and 0 were evicted.
Oct 15 20:08:47 nbp7-mds2 kernel: [  596.552836] Lustre: 6924:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571195122/real 1571195122]  req@ffff928c3ac31500 x1647517323231488/t0(0) o8->nbptest-OST0002-osc-MDT0000@0@lo:28/4 lens 520/544 e 0 to 1 dl 1571195327 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 20:08:47 nbp7-mds2 kernel: [  596.583846] Lustre: nbptest-OST0001: Connection restored to f3912282-13f7-c80a-d5e2-9ebf63e9b11f (at 0@lo)
Oct 15 20:08:47 nbp7-mds2 kernel: [  596.583853] Lustre: Skipped 23 previous similar messages
Oct 15 20:08:48 nbp7-mds2 kernel: [  596.593770] Lustre: nbptest-OST0005: Recovery over after 2:25, of 11 clients 11 recovered and 0 were evicted.
Oct 15 20:08:48 nbp7-mds2 kernel: [  596.615721] Lustre: nbptest-OST0004: deleting orphan objects from 0x0:4031 to 0x0:4065
Oct 15 20:08:48 nbp7-mds2 kernel: [  596.617546] Lustre: nbptest-OST0000: deleting orphan objects from 0x0:3855 to 0x0:3937
Oct 15 20:08:48 nbp7-mds2 kernel: [  596.829956] Lustre: nbptest-OST0001: deleting orphan objects from 0x0:3957 to 0x0:4033
Oct 15 20:08:48 nbp7-mds2 kernel: [  597.094516] Lustre: nbptest-OST0002: deleting orphan objects from 0x0:3922 to 0x0:4001
Oct 15 20:08:49 nbp7-mds2 kernel: [  597.850050] Lustre: nbptest-OST0003: deleting orphan objects from 0x0:3855 to 0x0:3873
Oct 15 20:08:49 nbp7-mds2 kernel: [  597.850173] Lustre: nbptest-OST0005: deleting orphan objects from 0x0:3672 to 0x0:3745
Oct 15 20:08:50 nbp7-mds2 kernel: [  598.606590] LustreError: 6082:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff929420c4bc00
Oct 15 20:08:50 nbp7-mds2 kernel: [  598.618461] LustreError: 6082:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff929420c4bc00
Oct 15 20:08:50 nbp7-mds2 kernel: [  598.630348] LustreError: 19480:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff929439985450 x1646705467974160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:568/0 lens 4584/448 e 0 to 0 dl 1571195583 ref 1 fl Interpret:/0/0 rc 0/0
Oct 15 20:08:50 nbp7-mds2 kernel: [  598.657827] Lustre: nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1 (at 10.151.10.96@o2ib), client will retry: rc = -110
Oct 15 20:16:23 nbp7-mds2 kernel: [ 1052.567312] Lustre: nbptest-OST0004: Client b9e645b5-be32-890c-4f1d-ef22af2686a1 (at 10.151.10.96@o2ib) reconnecting
Oct 15 20:16:23 nbp7-mds2 kernel: [ 1052.579182] Lustre: nbptest-OST0004: Connection restored to 484e1dc0-5396-3b63-f200-4783636c115f (at 10.151.10.96@o2ib)
Oct 15 20:16:23 nbp7-mds2 kernel: [ 1052.579188] Lustre: Skipped 15 previous similar messages
Oct 15 20:16:23 nbp7-mds2 kernel: [ 1052.579673] LustreError: 6080:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff929421a7b000
Oct 15 20:16:24 nbp7-mds2 kernel: [ 1052.591559] LustreError: 6080:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff929421a7b000
Oct 15 20:16:24 nbp7-mds2 kernel: [ 1052.591583] LustreError: 19763:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff9294c569c850 x1646705467974160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:268/0 lens 4584/448 e 0 to 0 dl 1571196038 ref 1 fl Interpret:/2/0 rc 0/0
Oct 15 20:16:24 nbp7-mds2 kernel: [ 1052.591606] Lustre: nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1 (at 10.151.10.96@o2ib), client will retry: rc = -110

Client with nid (10.151.10.96@o2ib) got stuck.

 

Client logs

Oct 15 19:53:12 r413i7n14 kernel: [1571194392.031434] LustreError: 11-0: nbptest-OST0004-osc-ffff988e44349000: operation ost_write to node 10.151.27.39@o2ib failed: rc = -19
Oct 15 19:53:12 r413i7n14 kernel: [1571194392.031434] Lustre: nbptest-OST0004-osc-ffff988e44349000: Connection to nbptest-OST0004 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 15 19:53:15 r413i7n14 kernel: [1571194395.911244] LustreError: 11-0: nbptest-OST0000-osc-ffff988e44349000: operation ost_write to node 10.151.27.39@o2ib failed: rc = -19
Oct 15 19:53:15 r413i7n14 kernel: [1571194395.911244] Lustre: nbptest-OST0000-osc-ffff988e44349000: Connection to nbptest-OST0000 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 15 19:53:17 r413i7n14 kernel: [1571194397.463168] LustreError: 11-0: nbptest-OST0000-osc-ffff988e44349000: operation ost_write to node 10.151.27.39@o2ib failed: rc = -19
Oct 15 19:53:17 r413i7n14 kernel: [1571194397.463168] LustreError: Skipped 2 previous similar messages
Oct 15 19:53:20 r413i7n14 kernel: [1571194400.727009] LustreError: 11-0: nbptest-OST0001-osc-ffff988e44349000: operation ost_write to node 10.151.27.39@o2ib failed: rc = -19
Oct 15 19:53:20 r413i7n14 kernel: [1571194400.727009] LustreError: Skipped 4 previous similar messages
Oct 15 19:53:20 r413i7n14 kernel: [1571194400.727009] Lustre: nbptest-OST0001-osc-ffff988e44349000: Connection to nbptest-OST0001 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 15 19:53:35 r413i7n14 kernel: [1571194415.870268] LNet: 3152:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.27.39@o2ib
Oct 15 19:53:35 r413i7n14 kernel: [1571194415.870268] LNet: 3152:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) Skipped 4 previous similar messages
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.548743] LNetError: 3146:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 14 seconds
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.548743] LNetError: 3146:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.151.27.39@o2ib (142): c: 3, oc: 0, rc: 32
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.596741] Lustre: 3194:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1571194577/real 1571194692]  req@ffff9891c50aadc0 x1646705467808112/t0(0) o400->nbptest-MDT0000-mdc-ffff988e44349000@10.151.27.39@o2ib:12/10 lens 224/224 e 0 to 1 dl 1571195032 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.596741] Lustre: nbptest-OST0002-osc-ffff988e44349000: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.596741] LustreError: 166-1: MGC10.151.27.39@o2ib: Connection to MGS (at 10.151.27.39@o2ib) was lost; in progress operations using this service will fail
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.596741] Lustre: 3194:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
Oct 15 19:58:12 r413i7n14 kernel: [1571194692.596741] Lustre: Skipped 3 previous similar messages
Oct 15 20:01:10 r413i7n14 kernel: [1571194870.568042] Lustre: 3161:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194415/real 1571194415]  req@ffff988c68871680 x1646705467731008/t0(0) o4->nbptest-OST0002-osc-ffff988e44349000@10.151.27.39@o2ib:6/4 lens 2808/1024 e 0 to 1 dl 1571194870 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 15 20:01:10 r413i7n14 kernel: [1571194870.568042] Lustre: 3178:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194415/real 1571194415]  req@ffff988d61f1fb00 x1646705467730960/t0(0) o4->nbptest-OST0002-osc-ffff988e44349000@10.151.27.39@o2ib:6/4 lens 2328/904 e 0 to 1 dl 1571194870 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 15 20:01:10 r413i7n14 kernel: [1571194870.568042] Lustre: 3178:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 15 20:01:10 r413i7n14 kernel: [1571194870.568042] Lustre: 3161:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Oct 15 20:01:39 r413i7n14 kernel: [1571194899.222641] Lustre: 3194:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194444/real 1571194444]  req@ffff989078cb0d80 x1646705467753088/t0(0) o400->nbptest-OST0002-osc-ffff988e44349000@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571194899 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 20:01:39 r413i7n14 kernel: [1571194899.222641] Lustre: 3194:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 15 20:02:06 r413i7n14 kernel: [1571194926.869290] Lustre: 3184:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194471/real 1571194471]  req@ffff9892834a1240 x1646705467764112/t0(0) o400->nbptest-OST0003-osc-ffff988e44349000@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571194926 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 20:02:06 r413i7n14 kernel: [1571194926.869290] Lustre: 3184:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 15 20:02:32 r413i7n14 kernel: [1571194952.468039] Lustre: 3192:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194497/real 1571194497]  req@ffff9891c53f6040 x1646705467775120/t0(0) o400->nbptest-OST0003-osc-ffff988e44349000@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571194952 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 20:02:32 r413i7n14 kernel: [1571194952.468039] Lustre: 3192:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 15 20:02:39 r413i7n14 kernel: [1571194959.519694] LNetError: 7898:0:(o2iblnd_cb.c:2946:kiblnd_rejected()) 10.151.27.39@o2ib rejected: o2iblnd fatal error
Oct 15 20:02:47 r413i7n14 kernel: [1571194967.891285] Lustre: Unmounted nbp2-client
Oct 15 20:02:50 r413i7n14 kernel: [1571194970.235170] Lustre: Unmounted nbp13-client
Oct 15 20:02:59 r413i7n14 kernel: [1571194979.090737] Lustre: 3199:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571194524/real 1571194524]  req@ffff9890165d8000 x1646705467786112/t0(0) o400->nbptest-OST0002-osc-ffff988e44349000@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571194979 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 15 20:02:59 r413i7n14 kernel: [1571194979.090737] Lustre: 3199:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 15 20:03:21 r413i7n14 kernel: [1571195001.613636] LustreError: 29579:0:(osc_cache.c:955:osc_extent_wait()) extent ffff988d42fdc4d0@{[1 -> 2/1023], [3|0|+|rpc|wiY|ffff988d0849c940], [32768|2|+|-|ffff988a0ab4b000|1024|ffff988d7a00c980]} nbptest-OST0000-osc-ffff988e44349000: wait ext to 0 timedout, recovery in progress?
Oct 15 20:03:21 r413i7n14 kernel: [1571195001.613636] LustreError: 29582:0:(osc_cache.c:955:osc_extent_wait()) extent ffff988bee572f20@{[3160 -> 3163/4095], [3|0|+|rpc|wiuY|ffff988a0f8b6140], [40960|4|+|-|ffff988a3c82bb40|1024|ffff988d7a040bc0]} nbptest-OST0004-osc-ffff988e44349000: wait ext to 0 timedout, recovery in progress?
Oct 15 20:03:21 r413i7n14 kernel: [1571195001.613636] LustreError: 29586:0:(osc_cache.c:955:osc_extent_wait()) extent ffff98931678ad10@{[1739 -> 1741/2047], [3|0|+|rpc|wiY|ffff98954b466700], [36864|3|+|-|ffff98927d17eb40|1024|ffff988d9cb10cc0]} nbptest-OST0001-osc-ffff988e44349000: wait ext to 0 timedout, recovery in progress?
Oct 15 20:03:21 r413i7n14 kernel: [1571195001.613636] LustreError: 29586:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff98931678ad10 ns: nbptest-OST0001-osc-ffff988e44349000 lock: ffff98927d17eb40/0x7791a7c5926cc8b6 lrc: 189/0,0 mode: PW/PW res: [0xf69:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 15372288->15376383) flags: 0x20000000000 nid: local remote: 0x4c6849e559209ae5 expref: -99 pid: 29586 timeout: 0 lvb_type: 1
Oct 15 20:06:38 r413i7n14 kernel: [1571195198.204030] Lustre: Evicted from MGS (at 10.151.27.39@o2ib) after server handle changed from 0x4c6849e55916ae08 to 0x8abce496dbe3554e
Oct 15 20:06:38 r413i7n14 kernel: [1571195198.204030] Lustre: MGC10.151.27.39@o2ib: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 15 20:06:38 r413i7n14 kernel: [1571195198.340023] LustreError: 3159:0:(client.c:3025:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff988a896b9680 x1646705467704992/t8589952500(8589952500) o101->nbptest-MDT0000-mdc-ffff988e44349000@10.151.27.39@o2ib:12/10 lens 616/544 e 0 to 0 dl 1571195653 ref 2 fl Interpret:RP/4/0 rc 301/301
Oct 15 20:08:48 r413i7n14 kernel: [1571195328.009691] Lustre: nbptest-OST0005-osc-ffff988e44349000: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 15 20:08:48 r413i7n14 kernel: [1571195328.009691] Lustre: Skipped 1 previous similar message
Oct 15 20:16:23 r413i7n14 kernel: [1571195783.955424] Lustre: 3169:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571195328/real 1571195328]  req@ffff988ca4938480 x1646705467974160/t0(0) o4->nbptest-OST0004-osc-ffff988e44349000@10.151.27.39@o2ib:6/4 lens 4584/448 e 0 to 1 dl 1571195783 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 15 20:16:23 r413i7n14 kernel: [1571195783.955424] Lustre: 3169:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 15 20:16:23 r413i7n14 kernel: [1571195783.955424] Lustre: nbptest-OST0004-osc-ffff988e44349000: Connection to nbptest-OST0004 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 15 20:16:23 r413i7n14 kernel: [1571195783.991422] Lustre: nbptest-OST0004-osc-ffff988e44349000: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)

I have never seen

Oct 15 20:02:39 r413i7n14 kernel: [1571194959.519694] LNetError: 7898:0:(o2iblnd_cb.c:2946:kiblnd_rejected()) 10.151.27.39@o2ib rejected: o2iblnd fatal error

After I killed my IOR task One file remained open on the node and it can't be closed

r413i7n14 ~ # lsof /nobackuptest/
COMMAND   PID    USER   FD   TYPE     DEVICE SIZE/OFF               NODE NAME
IOR     29574 mhanafi  cwd    DIR 523,627074    69632 144115205272502273 /nobackuptest/mhanafi
IOR     29574 mhanafi   11u   REG 523,627074 31457280 144115238843711816 /nobackuptest/mhanafi/testFile.00000121

I will check to see if we have a 2.12.2 client with MOFED4.5.

Comment by Mahmoud Hanafi [ 16/Oct/19 ]

I tried l2.12.2 with mofed4.5 and got the same error. I was able to get debug dump from the client side

I killed my job and it left files open.

r407i2n14 ~ # lsof /nobackuptest/
COMMAND  PID    USER   FD   TYPE     DEVICE SIZE/OFF               NODE NAME
IOR     3965 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3965 mhanafi   11u   REG 523,627074 31457280 144115272398143530 /nobackuptest/mhanafi/testdir/testFile.00000031
IOR     3966 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3966 mhanafi   11u   REG 523,627074 31457280 144115272398143535 /nobackuptest/mhanafi/testdir/testFile.00000032
IOR     3968 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3968 mhanafi   11u   REG 523,627074 31457280 144115272398143537 /nobackuptest/mhanafi/testdir/testFile.00000034
IOR     3969 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3969 mhanafi   11u   REG 523,627074 31457280 144115272398143531 /nobackuptest/mhanafi/testdir/testFile.00000035
IOR     3972 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3972 mhanafi   11u   REG 523,627074 31457280 144115272398143533 /nobackuptest/mhanafi/testdir/testFile.00000038
IOR     3973 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3973 mhanafi   11u   REG 523,627074 31457280 144115272398143532 /nobackuptest/mhanafi/testdir/testFile.00000039
IOR     3977 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3977 mhanafi   11u   REG 523,627074 31457280 144115272398143541 /nobackuptest/mhanafi/testdir/testFile.00000133
IOR     3978 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
IOR     3978 mhanafi   11u   REG 523,627074 31457280 144115272398143539 /nobackuptest/mhanafi/testdir/testFile.00000134

here is logs from r407i2n14

Oct 16 08:33:01 r407i2n14 kernel: [1571239981.657626] Lustre: Mounted nbptest-client
Oct 16 08:36:38 r407i2n14 kernel: [1571240198.107290] LNet: 3987:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: 11-0: nbptest-OST0002-osc-ffff9b1c83b70800: operation ost_read to node 10.151.27.39@o2ib failed: rc = -19
Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: Skipped 5 previous similar messages
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.151.27.39@o2ib (32): c: 28, oc: 0, rc: 32
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1571240433/real 1571240438]  req@ffff9b1f1f58cdc0 x1647564412471488/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 1 dl 1571240888 ref 2 fl Rpc:eXN/0/ffffffff rc 0/-1
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection to nbptest-OST0005 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: Skipped 4 previous similar messages
Oct 16 08:40:38 r407i2n14 kernel: [1571240438.819795] LustreError: 166-1: MGC10.151.27.39@o2ib: Connection to MGS (at 10.151.27.39@o2ib) was lost; in progress operations using this service will fail
Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240380/real 1571240380]  req@ffff9b1f1faf2000 x1647564412461280/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240835 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240407/real 1571240407]  req@ffff9b1f1f444940 x1647564412466432/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240862 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: Evicted from MGS (at 10.151.27.39@o2ib) after server handle changed from 0x8abce496dbe36dab to 0x438c0068703e8a86
Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: MGC10.151.27.39@o2ib: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 08:50:21 r407i2n14 kernel: [1571241021.694672] LustreError: 2948:0:(client.c:3025:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff9b27bcc17200 x1647564409414608/t12884904463(12884904463) o101->nbptest-MDT0000-mdc-ffff9b1c83b70800@10.151.27.39@o2ib:12/10 lens 616/544 e 0 to 0 dl 1571241476 ref 2 fl Interpret:RP/4/0 rc 301/301
Oct 16 08:50:25 r407i2n14 kernel: [1571241025.850579] Lustre: nbptest-MDT0000-mdc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 08:51:30 r407i2n14 kernel: [1571241090.517125] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b20b34236c0 x1647564412456208/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1f3c7544c0 x1647564412456112/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1fc7f784c0 x1647564412455712/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection to nbptest-OST0001 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 1 previous similar message
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 5 previous similar messages
Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: Skipped 2 previous similar messages
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Oct 16 09:06:40 r407i2n14 kernel: [1571242000.413048] Lustre: Skipped 4 previous similar messages
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000]  req@ffff9b1fc7f3f200 x1647564412454624/t0(0) o3->nbptest-OST0002-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0003-osc-ffff9b1c83b70800: Connection to nbptest-OST0003 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: Skipped 2 previous similar messages
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Oct 16 09:14:15 r407i2n14 kernel: [1571242455.070902] Lustre: Skipped 2 previous similar messages

Here is debug logs from the client size.
r407i2n14.out2.gz

server side logs

nbp7-mds.2019-10-16.out.gz

Comment by Amir Shehata (Inactive) [ 16/Oct/19 ]

I'm looking at the logs. In the meantime, can we verify that the problem doesn't occur if you use a 2.10.8 client?

Comment by Amir Shehata (Inactive) [ 16/Oct/19 ]

I'm seeing the following in the server logs:

6794493 00000020:00000040:7.0:1571241545.526189:0:10654:0:(genops.c:884:class_conn2export()) looking for export cookie 0x438c0068703e8998
6794494 00000020:00000040:7.0:1571241545.526191:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 6
6794495 00010000:00000040:7.0:1571241545.526203:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2
6794496 00010000:00000040:7.0:1571241545.526206:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1
6794497 00000100:00000040:7.0:1571241545.526215:0:10654:0:(service.c:1840:ptlrpc_server_request_get()) RPC GETting export ffff8d363c5db000 : new rpc_count 1
6794498 00000100:00000040:7.0:1571241545.526219:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret"  req@ffff8d3e5c84ac50 x1646705467974160/        t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/0 e 0 to 0 dl 1571241800 ref 1 fl New:/2/ffffffff rc 0/-1
6794499 00000100:00000200:7.0:1571241545.526228:0:10654:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646705467974160
6794500 00000020:00000040:7.0:1571241545.526254:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 7
6794501 00000400:00000200:7.0:1571241545.526261:0:10654:0:(lib-move.c:2998:LNetGet()) LNetGet -> 12345-10.151.10.96@o2ib
6794502 00000400:00000200:7.0:1571241545.526273:0:10654:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.10.9        6@o2ib(10.151.10.96@o2ib:10.151.10.96@o2ib) : GET
6794503 00000800:00000200:7.0:1571241545.526278:0:10654:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.151.10.96@o2ib
6794504 00000800:00000200:7.0:1571241545.526281:0:10654:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
6794505 00000400:00000200:7.0:1571241545.526289:0:10654:0:(lib-move.c:2877:lnet_create_reply_msg()) 10.151.27.39@o2ib: Reply from 12345-10.151.10.96@o2ib md ffff8d3e5cf61400
6794506 00000800:00000200:7.0:1571241545.526293:0:10654:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8d3f1bab2e00] -> 10.151.10.96@o2ib (2) version: 12
6794507 00000800:00000200:7.0:1571241545.526296:0:10654:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8d3efb3dba00] (68)++

Later on (based on the connection ID) it appears like the client is responding with -ENODATA.

6794519 00000800:00000200:3.0:1571241545.526389:0:6062:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8d3efb3dba00] (71)--
6794520 00000100:00020000:3.0:1571241545.526392:0:6062:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8d363c04aa00
6794521 00000100:00020000:3.0:1571241545.538264:0:6062:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8d363c04aa00
6794522 00000400:00000200:3.0:1571241545.550138:0:6062:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8d3e5cf61400
6794523 00000800:00000200:3.0:1571241545.550141:0:6062:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (69)++
6794524 00000800:00000200:3.0:1571241545.550143:0:6062:0:(o2iblnd_cb.c:228:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (70)--
6794525 00010000:00020000:9.0:1571241545.550145:0:10654:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8d3e5c84ac50 x1646705467974160/t0(0) o4-        >b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/0 rc 0/0
6794526 00000800:00000200:3.0:1571241545.550145:0:6062:0:(o2iblnd_cb.c:3663:kiblnd_scheduler()) conn[ffff8d3efb3dba00] (69)--
6794527 00000020:00000040:9.0:1571241545.577696:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 6
6794528 00000020:02000400:9.0:1571241545.577703:0:10654:0:(tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1         (at 10.151.10.96@o2ib), client will retry: rc = -110
6794529 00010000:00000080:9.0:1571241545.592859:0:10654:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff8d3e5c84ac50 x16467        05467974160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1
6794530 00010000:00000040:9.0:1571241545.592868:0:10654:0:(ldlm_lib.c:2886:target_committed_to_req()) last_committed 17179869458, transno 0, xid 1646705467974160
6794531 00000100:00000040:9.0:1571241545.592871:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff8d3e5c84ac50 x164670546797        4160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1
6794532 00000100:00000040:9.0:1571241545.592893:0:10654:0:(service.c:1034:ptlrpc_server_finish_active_request()) RPC PUTting export ffff8d363c5db000 : new rpc_count 0
6794533 00010000:00000040:9.0:1571241545.592897:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2
6794534 00010000:00000040:9.0:1571241545.592900:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1
6794535 00000020:00000040:9.0:1571241545.592903:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 5
6794536 00000100:00000200:9.0:1571241545.592906:0:10654:0:(niobuf.c:953:ptlrpc_register_rqbd()) LNetMEAttach: portal 6

The server reports:

(tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1         (at 10.151.10.96@o2ib), client will retry: rc = -110
(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update 

This happens in the following scenario

Server --GET (MD cookie) --> client
Client tries to look up the MD but can't find it
Server <--GET_DONE (-ENODATA) -- client

Did the server reboot happen around: Wed Oct 16 08:59:05 PDT 2019

However, this xid (x1646705467974160) in this log snippet is not to/from the client which the attached client logs belong to. It's to/from 10.151.10.96@o2ib. Can we confirm if this client is hung on that request?

Do you know which particular xid is stuck in the logs provided?

Comment by Mahmoud Hanafi [ 16/Oct/19 ]

x1646705467974160 is from client (10.151.10.96@o2ib)
That was the client stuck from yesterday. I should have cleaned up that one before retesting.
We are looking at
r407i2n14 has nid 10.151.7.114@o2ib But I think the client logs start after the recover.

If you can't get any useful info I can retry and capture longer logs on the client.

This is the time line

1571240785           Server reboot
1571241024          Remount and recovery
1571241068.466059   Recovery finished

One of XID stuck (x1647564412455904)
SERVER LOG

00010000:00020000:11.0:1571241290.559039:0:9679:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c84b450 x1647564412455904/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
00010000:00020000:15.0:1571241290.559177:0:10654:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c849c50 x1647564412456000/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
00010000:00020000:15.0:1571241290.581931:0:10321:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c84d050 x1647564412455712/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0

(r407i2n14) Client Log

00000100:00080000:39.0:1571241090.523097:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00000200:29.0:1571241090.523223:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
00000100:00000040:29.0:1571241090.523227:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000200:29.0:1571241090.523275:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
00000100:00000200:17.0:1571241090.523405:0:2940:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000040:17.0:1571241090.523414:0:2940:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000400:24.0:1571241545.699041:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
00000100:00000200:24.0:1571241545.699047:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
00000100:00000200:24.0:1571241545.699051:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
00000100:00080000:39.0:1571241545.717146:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
00000100:00000200:24.0:1571241545.717171:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
00000100:00000040:24.0:1571241545.717173:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
00000100:00000200:24.0:1571241545.717179:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
00000100:00000200:34.0:1571241545.717231:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
00000100:00000040:34.0:1571241545.717235:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
00000100:00000400:29.0:1571242000.355038:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
00000100:00080000:39.0:1571242000.355521:0:2948:0:(client.c:2816:ptlrpc_resend_req()) @@@ going to resend  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
00000100:00000200:29.0:1571242000.384628:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1
00000100:00000200:29.0:1571242000.384632:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1
00000100:00000200:29.0:1571242000.384646:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
00000100:00000040:29.0:1571242000.384648:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
00000100:00000200:29.0:1571242000.384653:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
00000100:00000200:34.0:1571242000.384688:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
00000100:00000040:34.0:1571242000.384692:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
00000100:00000400:8.0:1571242455.015047:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
00000100:00000200:8.0:1571242455.015105:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
00000100:00000200:8.0:1571242455.015118:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
00000100:00080000:39.0:1571242455.074990:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
00000100:00000200:8.0:1571242455.075055:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
00000100:00000040:8.0:1571242455.075063:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1
00000100:00000200:8.0:1571242455.075092:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
00000100:00000200:14.0:1571242455.075206:0:2941:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1
00000100:00000040:14.0:1571242455.075215:0:2941:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1

Comment by Amir Shehata (Inactive) [ 17/Oct/19 ]

Is it possible to get access to the test system? Now that we have a way to reproduce, I can add extra debug output to try and track what's going on. I'm having a bit of trouble tracking down why the bulk is timing out from the logs. The first instance of the problematic xids on the server is when we get:

 
00010000:00020000:9.0:1571241290.565033:0:10331:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5cf94050 x1647564415493712/t0(0)     o3->51a7d23c-3cd9-fd1b-280d-498c636237e0@10.151.8.26@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
 

But I don't see any instance of the xid before this. And the last instance that I see is:

00000100:00000200:9.0:1571241545.717364:0:10676:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1647564412455904
 

I don't see how the bulk times out.
 
At the LNet level it seems like there are no glaring communication issues. IE there are no timeouts or failures reported by LNet or the LND (at least in the logs I see). The only problem is the rejection you saw, but that's probably because the connection request came early. It was retried and the connection established. It's not responsible for the stuck requests.
 
One possibility is that the server gets the request and the response is queued, but never sent. I can not verify that from the logs. The reason I'm suspecting this is because of the messages:

00000100:00000040:4.0:1571241024.735760:0:6055:0:(events.c:359:request_in_callback()) Buffer complete: 63 buffers still posted
 

There are many of these messages and the maximum number of buffers I see is 63. It doesn't' go below the 50s

Then there is the case of the PUT_NAK:

00000800:00000400:9.0:1571241545.699924:0:6060:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:5.0:1571241545.699930:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:15.0:1571241545.699948:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:5.0:1571241545.717462:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:5.0:1571241545.717484:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:15.0:1571241545.717490:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:9.0:1571241545.758883:0:6061:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
00000800:00000400:15.0:1571241545.841772:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib

 
They all happen at the same second the request is received from the client: 1571241545.
 
Moreover I sampled a subset of the xids which the client reports as timing out and on the server they appear to all arrive at the same second 1571241545. It's likely that an event occurred triggering some requests to get stuck in this state. They timeout and are resend close to each other.

On the client all the xids exhibit the same behaviour. Send the request, times out after 455s, resend, repeat
 
These are a bunch of clues, but I haven't been able to put them together to form a specific conclusion yet. It's likely I'll need a few iterations of adding extra debugging, and reproducing the problem, if I'm able to get access to the test system. Let me know if that's possible..
 
Otherwise, first step is to capture enough server logs so we see what happens after the server receives the problematic xid.

Comment by Mahmoud Hanafi [ 17/Oct/19 ]

Re-ran and got better debugging. Here  xid 1646444722512224 to nid 10.151.7.67@o2ib (r407i0n3) times out. This is client side after reboot of server

00000100:00080000:0.0F:1571286150.709202:0:3642:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00000200:0.0F:1571286866.293058:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 512 bytes, xid 1646444722514672, portal 25
00000100:00000200:0.0:1571286866.293067:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 224 bytes to portal 26, xid 1646444722514672, offset 0
00000400:00000200:0.0:1571286866.293074:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
00000400:00000200:0.0:1571286866.293083:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
00000400:00000200:0.0:1571286866.293096:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
00000800:00000200:0.0:1571286866.293103:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.151.27.39@o2ib
00000800:00000200:0.0:1571286866.293110:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
00000800:00000200:0.0:1571286866.293113:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
00000800:00000200:0.0:1571286866.293116:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
00000800:00000200:0.0:1571286866.293121:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
00000100:00000200:0.0:1571286866.293338:0:3642:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
00000100:00000200:0.0:1571286866.293350:0:3642:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
00000400:00000200:0.0:1571286866.293360:0:3642:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99fa69f23ee0
00000100:00080000:11.0:1571286883.004713:0:3637:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff99fa992b4c00):  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00000200:0.0F:1571286883.004777:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1646444722512224, portal 4
00000100:00000200:0.0:1571286883.004785:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1646444722512224, offset 0
00000400:00000200:0.0:1571286883.004793:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
00000400:00000200:0.0:1571286883.004798:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
00000400:00000200:0.0:1571286883.004809:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
00000800:00000200:0.0:1571286883.004815:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 488 bytes in 1 frags to 12345-10.151.27.39@o2ib
00000800:00000200:0.0:1571286883.004830:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
00000800:00000200:0.0:1571286883.004832:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
00000800:00000200:0.0:1571286883.004835:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
00000800:00000200:0.0:1571286883.004839:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
00000100:00000200:37.0:1571286883.004870:0:3629:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571287337 ref 3 fl Rpc:/0/ffffffff rc 0/-1
00000100:00000200:1.0:1571286883.086207:0:3642:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8

HERE is the server side

00000100:00000200:1.0:1571286883.030498:0:10906:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646444722512224
00000400:00000200:3.0:1571286883.030956:0:10906:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.7.67@o2ib
00000400:00000200:3.0:1571286883.030964:0:10906:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.7.67@o2ib(10.151.7.67@o2ib:10.151.7.67@o2ib) : PUT
00000800:00000200:3.0:1571286883.030968:0:10906:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.7.67@o2ib
00000800:00000200:3.0:1571286883.030970:0:10906:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
00000800:00000200:3.0:1571286883.030973:0:10906:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8c3c04acc300] -> 10.151.7.67@o2ib (2) version: 12
00000800:00000200:3.0:1571286883.030975:0:10906:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (68)++
00000800:00000200:3.0:1571286883.030976:0:10906:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8c3ba8064600] (69)++
00000800:00000200:3.0:1571286883.030979:0:10906:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (70)--
00000100:00000200:3.0:1571286883.030982:0:10906:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0
00010000:00020000:13.0:1571286983.030947:0:10906:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8c3b6e3ba850 x1646444722512224/t0(0) o3->61da07f7-88f2-b96a-5d40-8a120a8a2bcb@10.151.7.67@o2ib:12/0 lens 488/432 e 0 to 0 dl 1571287137 ref 1 fl Interpret:/0/0 rc 0/0

I don't understand the client setup buffers like this

 put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8

But server sent

Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0

Doesn't the mbits have to match?

Btw, my test is doing 4k IO.
I uploaded all logs to ftp.whamcloud.com/uploads/LU-12856/nasa.debug.logs.tgz

Comment by Mahmoud Hanafi [ 17/Oct/19 ]

I got it to reproduce with 2.11 client. May be its a server side issue. I am going to try a older server build. I know we never had this issue with 2.7 servers and 2.7 clients.  I'll try a 2.10.6 server.

I tried 2.10.6 server it reproduced.

All transfer with mbits 0x0-0x0 endup with

Bulk timout.

 

Comment by Amir Shehata (Inactive) [ 18/Oct/19 ]

You're right that the mbits need to match. The question is why they don't for the same xid. I'll need to dig deeper, as this is not an LNet issue.

One question though, if we go back to the original problem reported in LU-12772, can we identify that it's the same problem? IE the mbits are 0x0?

Comment by Mahmoud Hanafi [ 18/Oct/19 ]

My last test was with 2.10.6 so that would be the case reported in LU-12772. We can close that one as dup and track this issue here.

Comment by Peter Jones [ 18/Oct/19 ]

ok

Comment by Mahmoud Hanafi [ 19/Oct/19 ]

This should not have been close.

 

Comment by Mahmoud Hanafi [ 19/Oct/19 ]

I tried a couple of time to reproduce using a 2.10.8 client and it wouldn't. 

So looks like the bug was introduce after 2.10.8

Comment by Amir Shehata (Inactive) [ 21/Oct/19 ]

I need to clarify the test matrix

2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem
2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem
2.10.8 client (sles12sp3) / 2.12.2 server  (cent7) - no problem
2.11 client (sles12sp3)   / 2.10.6 server  (cent7)  - problem
2.12.2 client (sles12sp4) / 2.10.6 server  (cent7)  - problem
2.12.2 client (sles12sp4)  / 2.10.8 server  (cent7) - problem
2.12.2 client  (sles12sp4) / 2.12.2 server  (cent7) - probem

Basically, once we move the clients past 2.10.8 we see the problem?

Is that correct? Will narrow down the list of changes to check.

thanks

amir

Comment by Mahmoud Hanafi [ 21/Oct/19 ]
2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem
2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem
2.10.8 client (sles12sp3) / 2.12.2 server  (cent7) - no problem
2.11 client (sles12sp3)   / 2.10.6 server  (cent7)  - problem
2.12.2 client (sles12sp4) / 2.10.6 server  (cent7)  - problem
2.12.2 client (sles12sp4)  / 2.10.8 server  (cent7) - problem
2.12.2 client  (sles12sp4) / 2.12.2 server  (cent7) - probem

Correct past 2.10.8 client is the issue.

Comment by Amir Shehata (Inactive) [ 21/Oct/19 ]

Can we try something, just to eliminate one of the variables.
in 2.12.2 client can you disable discovery:

options lnet lnet_peer_discovery_disabled=1

or vial lnetct

lnetctl set discovery 0 

and try and reproduce with the 2.12.2 clients/2.10.x servers.

One of the features which made it in 2.11 and up is lnet discovery and I want to ensure that it's not introducing an issue.

Comment by Mahmoud Hanafi [ 21/Oct/19 ]

Turning of discovery didn't help. I was able to reproduce with 2.12.2 client and 2.10.6 server.

 

Comment by Mahmoud Hanafi [ 22/Oct/19 ]

We can't figure how 2.10.8 fits into master and how it is related 2.11

Comment by Mahmoud Hanafi [ 24/Oct/19 ]

We did a test of tag 2.10.52 and it was ok. But tag 2.10.53 failed. So the issue was introduce after tag 2.10.52

Comment by Amir Shehata (Inactive) [ 25/Oct/19 ]

The discovery feature went between 2.10.52 and 2.10.53. Since we're going down the path of bisecting, can we try:

 1c45d9051764e0637ba90b3db06ba8fa37722916 LU-9913 lnet: balance references in lnet_discover_peer_locked()

That's after the landing of the discovery feature.

I'm looking at the rest of the patches.

Comment by Mahmoud Hanafi [ 25/Oct/19 ]

Looks like there was an issue with our last 2.10.53 build. Rebuild an testing show it does NOT have the issue. So we have lookup further up.

Comment by Amir Shehata (Inactive) [ 26/Oct/19 ]

Hi Mahmoud,

Attached is a short patch to collect some more info around the problem message. I'll probably be adding more debug patches to try and track down the problem. We're also trying to reproduce it locally.

0001-LU-12856-debug.patch

Comment by Mahmoud Hanafi [ 26/Oct/19 ]

Ran a bunch of more tests. Tag2.10.54 is good. and tag2.10.55 is very bad. Writes will even fail with out rebooting the server. I will try the debug patch next.

 

I tryied to do bisect but I had compilation errors.

 * 61f26ea47f - (HEAD) LU-9578 llite: use security context if it's enabled in the kernel (2 years ago)
* 82e794e268 - LU-9452 ldlm: remove MSG_CONNECT_LIBCLIENT support (2 years ago)
* 627d0133d9 - LU-7990 llite: increase whole-file readahead to RPC size (2 years ago)
* 97671436c5 - LU-9405 utils: remove device path parsing from mount.lustre (2 years ago)
* a3f734db9b - LU-9019 ofd: migrate to 64 bit time (2 years ago)
* 685ef61f0f - LU-9814 ldiskfs: restore simple_strtol in prealloc (2 years ago)
* 383ef1a93b - LU-4923 osd-ldiskfs: dirdata is not needed on MGS (2 years ago)
* 1eb0573fde - LU-9782 osd-ldiskfs: avoid extra search (2 years ago)
* 45900a7777 - LU-4134 obdclass: obd_device improvement (2 years ago)
* 3187d551d5 - LU-9990 lnet: add backwards compatibility for YAML config (2 years ago)
* 52a1befd75 - LU-9968 tests: correct stripe index sanity 300g (2 years ago)
* d4f7bb22d8 - LU-9860 tests: Run command on MGS for conf-sanity 33a (2 years ago)
* 66bb2d13f8 - (2.10.9956) LU-9956 kernel: kernel upgrade [SLES12 SP3 4.4.82-6.3] (2 years ago)
* 66abf6ffe9 - LU-9140 nrs: measure the runtime of dd directly (2 years ago)
* dd23aa4a64 - LU-10119 scripts: Correct shebang/hashpling format (2 years ago)
* 45e5e76e32 - LU-8344 test: fix sanity 256 (2 years ago)
* 9d06804860 - LU-10029 osd-ldiskfs: make project inherit attr removeable (2 years ago)
* bdb0407957 - LU-9416 hsm: add kkuc before sending registration RPCs (2 years ago)
* a1eb6de081 - LU-9752 man: Reference zgenhostid instead of genhostid (2 years ago)
* 84f690eee2 - LU-9469 ldiskfs: add additional attach_jinode call (2 years ago)
* 20787a89ad - LU-9908 tests: force umount client in test 70e, 41b, and 105 (2 years ago)
* 7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments (2 years ago)
* 7a024b535b - LU-10051 build: Build with ZFS 0.7.2 (2 years ago)
* 3a2f24fefb - LU-9158 test: Use project ID for project quota for quota_scan (2 years ago)
* e81847bd06 - LU-9660 ptlrpc: do not wakeup every second (2 years ago)
* d94de5c04e - LU-5170 lfs: Standardize error messages in lfs_setstripe() (2 years ago)
* 1cc354d559 - LU-9741 test: Correct check of stripe count for directories (2 years ago)
* f354258015 - LU-9672 gss: fix expiration time of sunrpc cache (2 years ago)
* 481bd7754d - LU-9590 tests: remove replay-single tests from ALWAYS_EXCEPT (2 years ago)
* aa83ef5a9e - LU-9462 doc: update lfs setstripe man page and usage (2 years ago)
* 29293649b5 - LU-8721 tests: add parallel-scale fio test (2 years ago)
* 036641a1e1 - LU-10047 tests: stop skipping test_102 subtests (2 years ago)
* 30f82889c1 - (2.10.54, refs/bisect/good-30f82889c194ce9a05a9659cea81cc8e5078d0b1) New tag 2.10.54 (2 years ago)

 

 In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0,
                 from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45,
                 from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40:
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c: In function â:
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: error: â undeclared (first use in this function)
  CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000);
           ^
/GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â
 #define CLASSERT(cond) do {switch (1) {case (cond): case 0: break; } } while (0)
                                              ^
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:1794:11: note: each undeclared identifier is reported only once for each function it appears in
  CLASSERT(LMV_HASH_FLAG_DEAD == 0x40000000);
           ^
/GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:360:46: note: in definition of macro â
 #define CLASSERT(cond) do {switch (1) {case (cond): case 0: break; } } while (0)
                                              ^
In file included from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/linkage.h:4:0,
                 from /usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/fs.h:4,
                 from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:36:
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:23: error: invalid application of â to incomplete type â
  LASSERTF((int)sizeof(struct mgs_send_param) == 1024, "found %lld\n",
                       ^
/usr/src/linux-4.4.143-94.47.1.20180815nasa/include/linux/compiler.h:166:42: note: in definition of macro â
 # define unlikely(x) __builtin_expect(!!(x), 0)
                                          ^
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3550:2: note: in expansion of macro â
  LASSERTF((int)sizeof(struct mgs_send_param) == 1024, "found %lld\n",
  ^
In file included from /GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs.h:39:0,
                 from /GITCHECKOUT/lustre-nas/lustre/include/obd_support.h:45,
                 from /GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:40:
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3551:27: error: invalid application of â to incomplete type â
    (long long)(int)sizeof(struct mgs_send_param));
                           ^
/GITCHECKOUT/lustre-nas/libcfs/include/libcfs/libcfs_private.h:88:9: note: in definition of macro â
      ## __VA_ARGS__);   \
         ^
/GITCHECKOUT/lustre-nas/lustre/ptlrpc/wiretest.c:3552:11: error: â undeclared (first use in this function)
g
Comment by Mahmoud Hanafi [ 26/Oct/19 ]

I reverted

 7954a52042 - LU-9983 ko2iblnd: allow for discontiguous fragments

at 2.10.55. Doing so fixed all the bulk timeout issues. But I can't easily revert it at 2.12.2 to absolutely verify that this introduced the issue. Can take a closer look at this commit. And may be give us patch that reverts it for 2.12.2 to try.

 

Here is the results of the tests

Client

Client Server Pass/Fail
2.10.8 2.10.8 pass
2.10.8 2.12.2 pass
2.12.2 2.12.2 Fail
2.12.0 2.10.8 Fail
2.12.2 2.10.8 Fail
2.11.0 2.10.8 Fail
tag: 2.10.57 2.10.8 Fail
tag: 2.10.56 2.10.8 Fail with write bulk errors
tag: 2.10.55 2.10.8 Fail with write build errors
tag: 2.10.54 2.10.8 Pass
tag: 2.10.56 (revert 7954a52-LU-9983 ko2iblnd: allow for discontiguous fragments) 2.10.8 pass
Comment by Serguei Smirnov [ 29/Oct/19 ]

Hi mhanafi,

LU-9983 introduced a change that is not in 2.10.56 but is in 2.10.57: "1f50b1e494ff1b4988508c6d6398ee6769467931 LU-9983 osp: align the OSP request size by 4k". This change is supposed to make sure that pages are aligned and so the fragmentation doesn't occur.

Could you please try the attached 0002-LU-12856-revert-LU-9983.patch on top of 2.10.57? It reverts the fragmentation-related changes introduced by LU-9983 (which are in both 2.10.56 and 2.10.57) and adds a debug message that is printed if fragmentation does occur. 

If the errors still can be seen, it will mean that some other change that came in between 2.10.56 and 2.10.57 is causing it.

Thanks,

Serguei.

Comment by Mahmoud Hanafi [ 29/Oct/19 ]

Here is debug output. This was 2.11.0 client with 2.10.8+debugpatch
Client Side

00000100:00000200:24.0F:1572319805.927751:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1648686524462144, portal 4
00000100:00000200:24.0:1572319805.927756:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524462144, offset 0
00000400:00000200:24.0:1572319805.927759:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
00000400:00000200:24.0:1572319805.927764:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT
00000800:00000200:24.0:1572319805.927767:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib
00000800:00000200:24.0:1572319805.927771:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12
00000800:00000200:24.0:1572319805.927772:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (68)++
00000800:00000200:24.0:1572319805.927773:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (69)++
00000800:00000200:24.0:1572319805.927776:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)--
00000100:00000200:24.0F:1572319806.043820:0:5247:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5db78bea32fe0-0x5db78bea32fe0, portal 8
00000100:00000200:24.0:1572319806.043826:0:5247:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 1024 bytes, xid 1648686524477408, portal 4
00000100:00000200:24.0:1572319806.043827:0:5247:0:(niobuf.c:85:ptl_send_buf()) Sending 608 bytes to portal 6, xid 1648686524477408, offset 0
00000400:00000200:24.0:1572319806.043830:0:5247:0:(lib-move.c:2983:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
00000400:00000200:24.0:1572319806.043835:0:5247:0:(lib-move.c:2113:lnet_select_pathway()) TRACE: 10.151.4.80@o2ib(10.151.4.80@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT
00000800:00000200:24.0:1572319806.043837:0:5247:0:(o2iblnd_cb.c:1617:kiblnd_send()) sending 608 bytes in 1 frags to 12345-10.151.27.39@o2ib
00000800:00000200:24.0:1572319806.043839:0:5247:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88090474fc40] -> 10.151.27.39@o2ib (2) version: 12
00000800:00000200:24.0:1572319806.043840:0:5247:0:(o2iblnd_cb.c:1498:kiblnd_launch_tx()) conn[ffff88103ff79c00] (70)++
00000800:00000200:24.0:1572319806.043841:0:5247:0:(o2iblnd_cb.c:1273:kiblnd_queue_tx_locked()) conn[ffff88103ff79c00] (71)++
00000800:00000200:24.0:1572319806.043843:0:5247:0:(o2iblnd_cb.c:1504:kiblnd_launch_tx()) conn[ffff88103ff79c00] (73)--
00000100:00000200:24.0:1572319806.044181:0:5247:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff8807b7d589c0 x1648686524477408/t0(0) o3->nbptest-OST0001-osc-ffff88085dfaf800@10.151.27.39@o2ib:6/4 lens 608/440 e 0 to 0 dl 1572320211 ref 2 fl Rpc

Server Side


 00000100:00000200:2.0:1572319602.131825:0:19476:0:(service.c:2540:ptlrpc_main()) service thread 1 (#2) started
00000100:00000200:14.0:1572319805.872010:0:19476:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1648686524462144
00000100:00000200:10.0:1572319805.887904:0:19476:0:(niobuf.c:193:ptlrpc_start_bulk_transfer()) NASA rq_mbits = 0x0, rq->rq_xid = 0x5db78bea2f440, mask = 0xfffffffffffffff0
00000100:00000200:10.0:1572319805.887906:0:19476:0:(niobuf.c:207:ptlrpc_start_bulk_transfer()) NASA posted_md = 0, total_md = 1, mbits = 0x0
00000400:00000200:10.0:1572319805.887913:0:19476:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.4.80@o2ib
00000400:00000200:10.0:1572319805.887925:0:19476:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.4.80@o2ib(10.151.4.80@o2ib:10.151.4.80@o2ib) : PUT
00000800:00000200:10.0:1572319805.887930:0:19476:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.4.80@o2ib
00000800:00000200:10.0:1572319805.887933:0:19476:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
00000800:00000200:10.0:1572319805.887938:0:19476:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff89169ab0b280] -> 10.151.4.80@o2ib (2) version: 12
00000800:00000200:10.0:1572319805.887940:0:19476:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8916777d4200] (68)++
00000800:00000200:10.0:1572319805.887942:0:19476:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8916777d4200] (69)++
00000800:00000200:10.0:1572319805.887946:0:19476:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8916777d4200] (70)--
00000100:00000200:10.0:1572319805.887949:0:19476:0:(niobuf.c:268:ptlrpc_start_bulk_transfer()) NASA Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.4.80@o2ib mbits 0x0-0x0 posted_md = 1
00010000:00020000:0.0:1572319905.887258:0:19476:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/0 rc 0/0
00000100:00000200:0.0:1572319905.887272:0:19476:0:(events.c:449:server_bulk_callback()) event type 6, status 0, desc ffff8915bf42fe00
00000400:00000200:0.0:1572319905.887274:0:19476:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff891679fdb480
00000020:02000400:0.0:1572319905.887327:0:19476:0:(tgt_handler.c:2046:tgt_brw_read()) nbptest-OST0000: Bulk IO read error with 22dff754-ac95-23ef-d877-0f51317e2d84 (at 10.151.4.80@o2ib), client will retry: rc -110
00010000:00000080:0.0:1572319905.887337:0:19476:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff8915c5c3f850 x1648686524462144/t0(0) o3->22dff754-ac95-23ef-d877-0f51317e2d84@10.151.4.80@o2ib:95/0 lens 608/432 e 0 to 0 dl 1572320060 ref 1 fl Interpret:/0/ffffffff rc -110/-1
00000100:00000200:0.0:1572319905.887360:0:19476:0:(niobuf.c:958:ptlrpc_register_rqbd()) LNetMEAttach: portal 6
Comment by Mahmoud Hanafi [ 30/Oct/19 ]

Doing git bisect between 2.10.57 and 2.10.56, I was able to identify the commit that introduced this issue. It is

commit 3483e195314bddb8d72594ebb10307c83a4bb860
Author: Patrick Farrell <paf@cray.com>
Date:   Thu Dec 7 07:00:58 2017 -0600

    LU-1757 brw: Fix short i/o and enable for mdc
    
    The short i/o flag was left out of the OST flags in the
    original patch, meaning it was not really on.  Also, the
    short_io_size value was used uninitialized, meaning it
    was sometimes non-zero, which coudl lead to several issues.
    
    Also add the short i/o flag to the MDC/MDT for data on MDT.
    Quick testing suggests this works fine with no further
    changes.
    
    Cray-bug-id: LUS-187
    Signed-off-by: Patrick Farrell <paf@cray.com>
    Change-Id: I4154b87d5ad73b53467b0382368fad7c5ba177fe
    Reviewed-on: https://review.whamcloud.com/30435
    Tested-by: Jenkins
    Reviewed-by: Mike Pershin <mike.pershin@intel.com>
    Reviewed-by: Alexandr Boyko <c17825@cray.com>
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

I tested reverting the commit in 2.11.0 and it fixed the issue. I tested the reverted in l2.12.2 client it DID NOT fix the issue. Now I am really confused.....

2.11.0 is sles12sp3 and 2.12.2 is sles12sp4.

Comment by Peter Jones [ 31/Oct/19 ]

Mike

Can you identify what the problem is with the LU-1757 patch?

Peter

Comment by Mikhail Pershin [ 31/Oct/19 ]

Mahmoud, can you disable short_io on client implicitly and do tests?
lctl set_param osc.*.short_io_bytes=0
The patch you've reverted fixes short_io feature agreement between server and client, this explains actually client-server matrix of error, only after that patch clients are really able to use short_io feature. So it seems the real source of problem is that feature and we can check that by disabling it on clients.

Comment by Mahmoud Hanafi [ 31/Oct/19 ]

I tested ({{lctl set_param osc.*.short_io_bytes=0) }}on 100 clients  it did not reproduce the bulk timeout issue.

So looks like we don't need to revert the patch. What debug info would you like to further diag the issue.

Comment by Mikhail Pershin [ 31/Oct/19 ]

Mahmoud, that is useful info, thanks, I will check related code first

Comment by Mahmoud Hanafi [ 31/Oct/19 ]

The short_io_bytes param is don't show up in /proc or /sys.

 401i0n3 ~ # ls -l /sys/kernel/debug/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000
total 0
-r--r--r-- 1 root root 0 Oct 31 09:43 srpc_contexts
-r--r--r-- 1 root root 0 Oct 31 09:43 srpc_info
--w------- 1 root root 0 Oct 31 09:43 srpc_sepol
-rw-r--r-- 1 root root 0 Oct 31 09:43 stats
r401i0n3 ~ # ls -l /proc/fs/lustre/osc/nbp2-OST0000-osc-ffff9bd2999f5000/
total 0
-rw-r--r-- 1 root root 0 Oct 31 10:16 checksum_type
-r--r--r-- 1 root root 0 Oct 31 10:16 connect_flags
-rw-r--r-- 1 root root 0 Oct 31 10:16 cur_grant_bytes
-rw-r--r-- 1 root root 0 Oct 31 10:16 import
-rw-r--r-- 1 root root 0 Oct 31 10:16 max_pages_per_rpc
-rw-r--r-- 1 root root 0 Oct 31 10:16 osc_cached_mb
-rw-r--r-- 1 root root 0 Oct 31 10:16 osc_stats
-r--r--r-- 1 root root 0 Oct 31 10:16 ost_server_uuid
-rw-r--r-- 1 root root 0 Oct 31 10:16 pinger_recov
-rw-r--r-- 1 root root 0 Oct 31 10:16 rpc_stats
-r--r--r-- 1 root root 0 Oct 31 10:16 state
-r--r--r-- 1 root root 0 Oct 31 10:16 timeouts
-r--r--r-- 1 root root 0 Oct 31 10:16 unstable_stats
r401i0n3 ~ # modinfo lustre
filename:       /lib/modules/4.12.14-95.19.1.20190617-nasa/updates/lustre-client/fs/lustre.ko
license:        GPL
version:        2.12.2
description:    Lustre Client File System
author:         OpenSFS, Inc. <http://www.lustre.org/>
suserelease:    SLE12-SP4
srcversion:     DA4405C9155F0D08BCC73DB
depends:        obdclass,ptlrpc,libcfs,lnet,lmv,mdc,lov
retpoline:      Y
vermagic:       4.12.14-95.19.1.20190617-nasa SMP mod_unload modversions retpoline 


r401i0n3 ~ # lctl get_param osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes
osc.nbp2-OST0000-osc-ffff9bd2999f5000.short_io_bytes=0
Comment by Mikhail Pershin [ 31/Oct/19 ]

Mahmoud, just to be sure - we are still considering the problem as described in ticket - error due to truncated bulk READ, right?

Comment by Mahmoud Hanafi [ 31/Oct/19 ]

2.10.8 servers (LU-12772) will error with bulk timeout. 2.12.2 servers report truncated  read error.  In both cases client keeps trying the IO and the files can’t be closed

Comment by Mikhail Pershin [ 31/Oct/19 ]

btw, what is the default value of short_io_bytes prior setting it to 0?

Comment by Mahmoud Hanafi [ 31/Oct/19 ]

looks like 16k

 # lctl get_param osc.nbptest*.short_io_bytes
osc.nbptest-OST0000-osc-ffff9573da091800.short_io_bytes=16384
osc.nbptest-OST0001-osc-ffff9573da091800.short_io_bytes=16384
osc.nbptest-OST0002-osc-ffff9573da091800.short_io_bytes=16384
osc.nbptest-OST0003-osc-ffff9573da091800.short_io_bytes=16384
osc.nbptest-OST0004-osc-ffff9573da091800.short_io_bytes=16384
osc.nbptest-OST0005-osc-ffff9573da091800.short_io_bytes=16384
Comment by Gerrit Updater [ 31/Oct/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36632
Subject: LU-12856 target: check FLFLAGS are valid while accessing them
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 304b3626108ecf2accb2de9a61494807fe7995a4

Comment by Mikhail Pershin [ 31/Oct/19 ]

Please note, this patch is just fixing the only bad thing with short_io I've found. It doesn't look related to this issue because in theory that bug may cause server handle bulk request as short_io wrongly, but we have errors in bulk handling so that is not short_io. It is needed anyway and worth to try too.

Comment by Gerrit Updater [ 06/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36632/
Subject: LU-12856 target: check FLFLAGS are valid while accessing them
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 707f5a982e895c9a484dcdb8d1644e3f63c7c5cc

Comment by Peter Jones [ 06/Dec/19 ]

Landed for 2.14

Comment by Gerrit Updater [ 12/Dec/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37000
Subject: LU-12856 target: check FLFLAGS are valid while accessing them
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2309beb5d11b84e2e4828a4f81e09e2657b47c60

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37000/
Subject: LU-12856 target: check FLFLAGS are valid while accessing them
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5519429c5c4cee15d43c7483f29318899e3cf95e

Generated at Sat Feb 10 02:56:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.