[LU-12065] Client got evicted when lock callback timer expired on OSS Created: 12/Mar/19  Updated: 01/Apr/19  Resolved: 21/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: soak

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Testing soak with b2_10-ib build #107 (same as rc1), seeing following errors on OSS side. Client got evicted, not sure if this is a dup of LU-11818

app error 245441-iorssf.out

task 19 writing /mnt/soaked/soaktest/test/iorssf/245441/ssf
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 245441.0 ON soak-17 CANCELLED AT 2019-03-12T19:12:57 ***
srun: error: soak-39: task 19: Exited with exit code 255
srun: Terminating job step 245441.0
srun: error: soak-23: task 12: Killed

client soak-39 syslog

Mar 12 19:11:59 soak-39 systemd: Removed slice User Slice of root.
Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0]  req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000
Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages
Mar 12 19:12:40 soak-39 sshd[18764]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Mar 12 19:12:41 soak-39 sshd[18764]: Accepted publickey for root from 10.10.1.116 port 34086 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar 12 19:12:41 soak-39 systemd: Created slice User Slice of root.
Mar 12 19:12:41 soak-39 systemd-logind: New session 114 of user root.
Mar 12 19:12:41 soak-39 systemd: Started Session 114 of user root.
Mar 12 19:12:41 soak-39 sshd[18764]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 12 19:12:43 soak-39 sshd[18764]: pam_unix(sshd:session): session closed for user root
Mar 12 19:12:43 soak-39 systemd-logind: Removed session 114.
Mar 12 19:12:43 soak-39 systemd: Removed slice User Slice of root.
Mar 12 19:12:48 soak-39 sshd[18798]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Mar 12 19:12:48 soak-39 sshd[18798]: Accepted publickey for root from 10.10.1.116 port 34138 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar 12 19:12:48 soak-39 systemd: Created slice User Slice of root.
Mar 12 19:12:48 soak-39 systemd-logind: New session 115 of user root.
Mar 12 19:12:48 soak-39 systemd: Started Session 115 of user root.
Mar 12 19:12:48 soak-39 sshd[18798]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:339:ptlrpc_invalidate_import()) soaked-OST0003_UUID: rc = -110 waiting for callback (6 != 0)
Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:365:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9ae731b20000 x1627823901365600/t0(0) o103->soaked-OST0003-osc-ffff9aef55f82800@192.168.1.107@o2ib:17/18 lens 328/224 e 0 to 0 dl 1552417817 ref 2 fl UnregRPC:ES/0/ffffffff rc -5/-1
Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:379:ptlrpc_invalidate_import()) soaked-OST0003_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out.

OSS syslog

Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 2
Mar 12 19:12:46 soak-7 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
Mar 12 19:12:46 soak-7 kernel: LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
Mar 12 19:12:47 soak-7 kernel: LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
Mar 12 19:12:49 soak-7 kernel: Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
Mar 12 19:14:14 soak-7 sshd[208612]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Mar 12 19:14:14 soak-7 sshd[208612]: Accepted publickey for root from 10.10.1.116 port 32972 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar 12 19:14:14 soak-7 systemd: Created slice User Slice of root.
Mar 12 19:14:14 soak-7 systemd-logind: New session 128 of user root.
Mar 12 19:14:14 soak-7 systemd: Started Session 128 of user root.

console log around the same time from same OSS node

[ 4246.048619] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 138s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST000b_UUID lock: ffff8e20898cb600/0x1cc0bc14571a1fd6 lrc: 3/0,0 mode: PW/PW res: [0x5c0000400:0x16da44a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000400030020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103dc4 expref: 182 pid: 40382 timeout: 4298912796 lvb_type: 0
[ 4258.931645] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417673/real 0]  req@ffff8e20080d8c00 x1627824065646496/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417680 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 4258.961747] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 4307.013970] LNetError: 22817:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 6 seconds
[ 4307.025344] LNetError: 22817:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.139@o2ib (12): c: 0, oc: 0, rc: 7
[ 4307.040012] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c20bbce00
[ 4307.052087] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c20bbce00
[ 4307.064166] LustreError: 72826:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c28b19850 x1627823901265440/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:334/0 lens 608/448 e 1 to 0 dl 1552417744 ref 1 fl Interpret:/0/0 rc 0/0
[ 4307.065457] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.139@o2ib) returned error from blocking AST (req@ffff8e210cc45a00 x1627824065648432 status -107 rc -107), evict it ns: filter-soaked-OST000f_UUID lock: ffff8e219abde400/0x1cc0bc1456b1733c lrc: 4/0,0 mode: PR/PR res: [0x2d1d58f:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f7500fff0a expref: 134 pid: 5311 timeout: 4299074300 lvb_type: 1
[ 4307.065460] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message
[ 4307.065462] LustreError: 138-a: soaked-OST000f: A client on nid 192.168.1.139@o2ib was evicted due to a lock blocking callback time out: rc -107
[ 4307.065903] LustreError: 24506:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417728 with bad export cookie 2071862624129549424
[ 4307.169926] LustreError: 64567:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8e21812d8300 x1627824065709840/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[ 4307.212124] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
[ 4348.569869] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c27fd4450 x1627823901305376/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:454/0 lens 608/448 e 0 to 0 dl 1552417864 ref 1 fl Interpret:/0/0 rc 0/0
[ 4348.574183] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
[ 4348.612117] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) Skipped 2 previous similar messages
[ 4381.366667] Lustre: soaked-OST0003: Client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) reconnecting
[ 4381.378443] Lustre: Skipped 11 previous similar messages
[ 4402.265592] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417809/real 1552417811]  req@ffff8e2181a73300 x1627824066270928/t0(0) o104->soaked-OST0003@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417823 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[ 4402.296573] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
[ 4407.189878] LustreError: 24531:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8e1c1961e850 x1627823901357664/t0(0) o3->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/432 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
[ 4407.217287] Lustre: soaked-OST0003: Bulk IO read error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc -107
[ 4407.240530] LustreError: 25151:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c291dc850 x1627823901357648/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/448 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
[ 4407.267684] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
[ 4407.282849] Lustre: Skipped 2 previous similar messages
[ 4497.255109] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c24ef2c00
[ 4497.267214] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c24ef2c00
[ 4497.279320] LustreError: 23844:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c291d9850 x1627823901365088/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:552/0 lens 608/448 e 0 to 0 dl 1552417962 ref 1 fl Interpret:/0/0 rc 0/0
[ 4497.306831] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
[ 4545.033624] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
[ 4545.082619] LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
[ 4545.739600] LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
[ 4548.106298] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
[ 4548.142590] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
[ 4548.160127] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
[ 4774.815013] Lustre: soaked-OST0003: haven't heard from client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8e212a901400, cur 1552418196 expire 1552418046 last 1552417969


 Comments   
Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Eviction appears to be a secondary impact from not being able to complete i/o:

Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0]  req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000
Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages 
Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

So, for soak-39:

[ 6703.820812] Lustre: 14304:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417310/real 1552417315]  req@ffff9ae731b7e300 x1627823901253984/t0(0) o101->soaked-OST0000-osc-ffff9aef55f82800@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1552417318 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
[ 6703.862819] Lustre: 14304:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 54 previous similar messages^M
[ 6899.713947] LustreError: 11-0: soaked-OST000b-osc-ffff9aef55f82800: operation ost_write to node 192.168.1.107@o2ib failed: rc = -107^M
[ 6899.714590] LustreError: 167-0: soaked-OST000b-osc-ffff9aef55f82800: This client was evicted by soaked-OST000b; in progress operations using this service will fail.^M
[ 6899.755561] LustreError: Skipped 2 previous similar messages^M
[ 6934.425363] LustreError: 167-0: soaked-OST0001-osc-ffff9aef55f82800: This client was evicted by soaked-OST0001; in progress operations using this service will fail.^M
[ 6934.449120] LustreError: 18193:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST0001-osc-ffff9aef55f82800: namespace resource [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[ 6934.479588] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount = 2^M
[ 6934.500652] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000401:0x147f90a:0x0].0x0 (ffff9aeecb3baa80) refcount = 2^M
[ 6934.521808] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount = 2^M
[ 6934.542429] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000401:0x147f90a:0x0].0x0 (ffff9aeecb3baa80) refcount = 2^M
[ 6960.134237] LustreError: 14276:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef52b9ee00^M
[ 6960.160379] LustreError: 11-0: soaked-OST000f-osc-ffff9aef55f82800: operation ldlm_enqueue to node 192.168.1.107@o2ib failed: rc = -107^M
[ 6960.174917] LustreError: 167-0: soaked-OST000f-osc-ffff9aef55f82800: This client was evicted by soaked-OST000f; in progress operations using this service will fail.^M
[ 6960.210221] LustreError: 18198:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000f-osc-ffff9aef55f82800: namespace resource [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[ 6960.241028] LustreError: 18198:0:(ldlm_resource.c:1101:ldlm_resource_complain()) Skipped 3 previous similar messages^M
[ 6960.256020] LustreError: 18198:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount = 2^M
[ 6960.277689] LustreError: 18198:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount = 2^M
[ 7150.289937] LNetError: 14275:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M
[ 7150.304607] LNetError: 14275:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (3): c: 1, oc: 3, rc: 8^M
[ 7150.324916] LNetError: 0:0:(o2iblnd_cb.c:3454:kiblnd_qp_event()) 192.168.1.107@o2ib: Async QP event type 1^M
[ 7150.338879] mlx4_core 0000:02:00.0: CQ overrun on CQN 0000aa^M
[ 7150.348197] LNetError: 0:0:(o2iblnd_cb.c:3532:kiblnd_cq_event()) 192.168.1.107@o2ib: async CQ event type 0^M
[ 7150.362068] Lustre: soaked-OST0003-osc-ffff9aef55f82800: Connection to soaked-OST0003 (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
[ 7150.388074] Lustre: Skipped 41 previous similar messages^M
[ 7201.253697] LustreError: 11-0: soaked-OST0003-osc-ffff9aef55f82800: operation ost_write to node 192.168.1.107@o2ib failed: rc = -107^M
[ 7201.254206] LustreError: 167-0: soaked-OST0003-osc-ffff9aef55f82800: This client was evicted by soaked-OST0003; in progress operations using this service will fail.^M
[ 7201.296713] LustreError: Skipped 3 previous similar messages^M
 

Those are the four OST evictions.  I believe all of them are from the same OSS (192.168.1.107), and I see a number of weird kiblnd messages and RDMA timeouts.  Basically, everything starts timing out suddenly, we get evicted, and there are some (to my eye) weird kiblnd messages.

I see no particular indication of trouble on the OSS when this happens.  (I do see some kiblnd, etc, failures, but no indication of extreme load or other problems.)
The OSS for these OSTs is soak-5.

Is someone with IB expertise able to take a look?  I'd like to be persuaded there's no IB level issue.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

ashehata sharmaso ^

Comment by Amir Shehata (Inactive) [ 13/Mar/19 ]

Is this problem reproducible? There is one change I'm suspecting, that I'd like to try and reproduce without.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Somewhat, Amir - It requires running on Soak for a while, and there's no guarantee.  So we could make a change and run for a while and see if Soak runs better...  That's the best we can do.

Comment by Amir Shehata (Inactive) [ 14/Mar/19 ]

I made a mistake. I thought part of the patch for LU-10291 was in 2.10.7, but it's not. At this point, I'm not sure what would cause this. looks like the timeouts are happening because transmits are not completing. We're seeing this on a couple of sites. What was the last build that passed? What MOFED version was being used? maybe we can bisect.

Comment by Sarah Liu [ 14/Mar/19 ]

I think I saw some network issues on b2_10-ib build #101 when doing test for LU-11943. The 2.10.6-rc build #92 should be a good one.
MOFED version of #92 4.5-OFED.4.5.1.0.1.1.gb4fdfac.x86_64

Comment by Patrick Farrell (Inactive) [ 14/Mar/19 ]

Sarah, what is the MOFED version in buid #101?

Comment by Sarah Liu [ 14/Mar/19 ]

The MOFED version hasn't been changed from #92 to #107

Comment by Sarah Liu [ 14/Mar/19 ]

Just found that we updated kernel from 3.10.0-957.el7 to 3.10.0-957.1.3.el7 in build #107, while client nodes are not using the updated kernel, not sure if it may affect the results.

Comment by Amir Shehata (Inactive) [ 14/Mar/19 ]

I looked at the lnet/lnd related changes between 2.10.6-RC1 and 2.10.7-RC1. I only see the following patches:

LU-9810 lnet: fix build with M-OFED 4.1
LU-10124 lnet: Correctly add peer MR value while importing
LU-11006 lnet: fix show peer yaml tree with no peer
LU-11424 lnet: copy the correct amount of cpts to lnet_cpts
LU-9810 lnd: use less CQ entries for each connection
LU-10213 lnd: calculate qp max_send_wrs properly

I'm currently suspecting "lnd: use less CQ entries for each connection". I'm not sure what happens if we run out of CQ entries. If the completion events are being dropped or delayed, that could explain the timeouts we're seeing while transmits are on the active_tx queue. It means they never received completion events and stay on the queue until they timeout.

I'm going to commit a revert to this change. Would be nice to try this out on soak.

Comment by Gerrit Updater [ 14/Mar/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34427
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: aed3633ce8e257b9d03ef03dec33c8e764e95da1

Comment by Sarah Liu [ 15/Mar/19 ]

soak has been running with Amir's patch #34427 for about 13 hours, it does help the situation a lot, the pass/fail rate is 240 pass/22 fail, which is similar to the 2.10.6. While it still showed RDMA errors, but not seeing the "Async QP" anymore

./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (57): c: 0, oc: 0, rc: 8
./soak-6.log:Mar 15 12:57:39 soak-6 kernel: LNetError: 97:0:(o2iblnd_cb.c:2759:kiblnd_rejected()) 192.168.1.108@o2ib rejected: o2iblnd fatal error
./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 3, rc: 8
./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (60): c: 0, oc: 1, rc: 8
./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 11 seconds
./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 1, rc: 8
./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 9 seconds
./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (59): c: 0, oc: 0, rc: 8
./soak-8.log:Mar 15 12:57:33 soak-8 kernel: LNetError: 233:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.1.106@o2ib on NA (ib0:0:192.168.1.108): bad dst nid 192.168.1.108@o2ib

we still have some eviction, but at that time, soak was doing OSS restart

Mar 15 10:34:36 soak-39 kernel: Lustre: soaked-OST000e-osc-ffffa038b9975800: Connection to soaked-OST000e (at 192.168.1.106@o2ib) was lost; in progress operations using this 
service will wait for recovery to complete
Mar 15 10:34:36 soak-39 kernel: LustreError: 167-0: soaked-OST000e-osc-ffffa038b9975800: This client was evicted by soaked-OST000e; in progress operations using this service will fail.
Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000e-osc-ffffa038b9975800: namespace resource [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount = 2
Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):
Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: soaked-OST000e-osc-ffffa038b9975800 lock: ffffa030817c1800/0xd39fa3d0f8107339 lrc: 3/0,1 mode: PW/PW res: [0x2fea580:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526480000000 nid: local remote: 0x56c636f7ad02f38 expref: -99 pid: 49356 timeout: 0 lvb_type: 1
2019-03-15 10:30:52,961:fsmgmt.fsmgmt:INFO     triggering fault oss_restart
2019-03-15 10:30:52,963:fsmgmt.fsmgmt:INFO     executing cmd pm -h powerman -c soak-7> /dev/null
2019-03-15 10:31:23,525:fsmgmt.fsmgmt:INFO     trying to connect to soak-7 ...
2019-03-15 10:31:33,538:fsmgmt.fsmgmt:INFO     trying to connect to soak-7 ...
Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ]

I looked at the logs around that eviction highlighted by Sara at the end there.  It does not have any obvious connection to kiblnd issues, so it's a bit of a mystery.  But we don't have a zero rate of evictions on any branch right now, so it's not a huge cause for concern at the moment.

Comment by Gerrit Updater [ 20/Mar/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34473
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e8743ff884130a6fcc663df615033d8d14014a74

Comment by Gerrit Updater [ 20/Mar/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34474
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 45f85254238a0fa53225897bed8a6bcbe58faf58

Comment by Gerrit Updater [ 21/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34473/
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bf3fc7f1a7bf82c02181c64810bc3f93b3303703

Comment by Gerrit Updater [ 21/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34427/
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 6c12330a350d3e082c835fb61cfcc5c9c93e9bdd

Comment by Peter Jones [ 21/Mar/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 21/Mar/19 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34474/
Subject: LU-12065 lnd: increase CQ entries
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0205858f6533c7cff8fe8b43d09cece15e0b5385

Generated at Sat Feb 10 02:49:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.