[LU-12065] Client got evicted when lock callback timer expired on OSS Created: 12/Mar/19 Updated: 01/Apr/19 Resolved: 21/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Testing soak with b2_10-ib build #107 (same as rc1), seeing following errors on OSS side. Client got evicted, not sure if this is a dup of LU-11818 app error 245441-iorssf.out task 19 writing /mnt/soaked/soaktest/test/iorssf/245441/ssf ** error ** ERROR in aiori-POSIX.c (line 256): transfer failed. ERROR: Input/output error ** exiting ** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 245441.0 ON soak-17 CANCELLED AT 2019-03-12T19:12:57 *** srun: error: soak-39: task 19: Exited with exit code 255 srun: Terminating job step 245441.0 srun: error: soak-23: task 12: Killed client soak-39 syslog Mar 12 19:11:59 soak-39 systemd: Removed slice User Slice of root. Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0] req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000 Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages Mar 12 19:12:40 soak-39 sshd[18764]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 12 19:12:41 soak-39 sshd[18764]: Accepted publickey for root from 10.10.1.116 port 34086 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 12 19:12:41 soak-39 systemd: Created slice User Slice of root. Mar 12 19:12:41 soak-39 systemd-logind: New session 114 of user root. Mar 12 19:12:41 soak-39 systemd: Started Session 114 of user root. Mar 12 19:12:41 soak-39 sshd[18764]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 12 19:12:43 soak-39 sshd[18764]: pam_unix(sshd:session): session closed for user root Mar 12 19:12:43 soak-39 systemd-logind: Removed session 114. Mar 12 19:12:43 soak-39 systemd: Removed slice User Slice of root. Mar 12 19:12:48 soak-39 sshd[18798]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 12 19:12:48 soak-39 sshd[18798]: Accepted publickey for root from 10.10.1.116 port 34138 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 12 19:12:48 soak-39 systemd: Created slice User Slice of root. Mar 12 19:12:48 soak-39 systemd-logind: New session 115 of user root. Mar 12 19:12:48 soak-39 systemd: Started Session 115 of user root. Mar 12 19:12:48 soak-39 sshd[18798]: pam_unix(sshd:session): session opened for user root by (uid=0) Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:339:ptlrpc_invalidate_import()) soaked-OST0003_UUID: rc = -110 waiting for callback (6 != 0) Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:365:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff9ae731b20000 x1627823901365600/t0(0) o103->soaked-OST0003-osc-ffff9aef55f82800@192.168.1.107@o2ib:17/18 lens 328/224 e 0 to 0 dl 1552417817 ref 2 fl UnregRPC:ES/0/ffffffff rc -5/-1 Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:379:ptlrpc_invalidate_import()) soaked-OST0003_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out. OSS syslog Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1 Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1 Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 2 Mar 12 19:12:46 soak-7 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1 Mar 12 19:12:46 soak-7 kernel: LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0 Mar 12 19:12:47 soak-7 kernel: LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0 Mar 12 19:12:49 soak-7 kernel: Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107 Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417 Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages Mar 12 19:14:14 soak-7 sshd[208612]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key Mar 12 19:14:14 soak-7 sshd[208612]: Accepted publickey for root from 10.10.1.116 port 32972 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM Mar 12 19:14:14 soak-7 systemd: Created slice User Slice of root. Mar 12 19:14:14 soak-7 systemd-logind: New session 128 of user root. Mar 12 19:14:14 soak-7 systemd: Started Session 128 of user root. console log around the same time from same OSS node [ 4246.048619] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 138s: evicting client at 192.168.1.139@o2ib ns: filter-soaked-OST000b_UUID lock: ffff8e20898cb600/0x1cc0bc14571a1fd6 lrc: 3/0,0 mode: PW/PW res: [0x5c0000400:0x16da44a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000400030020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103dc4 expref: 182 pid: 40382 timeout: 4298912796 lvb_type: 0 [ 4258.931645] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417673/real 0] req@ffff8e20080d8c00 x1627824065646496/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417680 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4258.961747] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 4307.013970] LNetError: 22817:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 6 seconds [ 4307.025344] LNetError: 22817:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.139@o2ib (12): c: 0, oc: 0, rc: 7 [ 4307.040012] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c20bbce00 [ 4307.052087] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c20bbce00 [ 4307.064166] LustreError: 72826:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8e1c28b19850 x1627823901265440/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:334/0 lens 608/448 e 1 to 0 dl 1552417744 ref 1 fl Interpret:/0/0 rc 0/0 [ 4307.065457] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.139@o2ib) returned error from blocking AST (req@ffff8e210cc45a00 x1627824065648432 status -107 rc -107), evict it ns: filter-soaked-OST000f_UUID lock: ffff8e219abde400/0x1cc0bc1456b1733c lrc: 4/0,0 mode: PR/PR res: [0x2d1d58f:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f7500fff0a expref: 134 pid: 5311 timeout: 4299074300 lvb_type: 1 [ 4307.065460] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message [ 4307.065462] LustreError: 138-a: soaked-OST000f: A client on nid 192.168.1.139@o2ib was evicted due to a lock blocking callback time out: rc -107 [ 4307.065903] LustreError: 24506:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417728 with bad export cookie 2071862624129549424 [ 4307.169926] LustreError: 64567:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8e21812d8300 x1627824065709840/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [ 4307.212124] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110 [ 4348.569869] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff8e1c27fd4450 x1627823901305376/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:454/0 lens 608/448 e 0 to 0 dl 1552417864 ref 1 fl Interpret:/0/0 rc 0/0 [ 4348.574183] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110 [ 4348.612117] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) Skipped 2 previous similar messages [ 4381.366667] Lustre: soaked-OST0003: Client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) reconnecting [ 4381.378443] Lustre: Skipped 11 previous similar messages [ 4402.265592] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417809/real 1552417811] req@ffff8e2181a73300 x1627824066270928/t0(0) o104->soaked-OST0003@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417823 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [ 4402.296573] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 31 previous similar messages [ 4407.189878] LustreError: 24531:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107 req@ffff8e1c1961e850 x1627823901357664/t0(0) o3->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/432 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0 [ 4407.217287] Lustre: soaked-OST0003: Bulk IO read error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc -107 [ 4407.240530] LustreError: 25151:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff8e1c291dc850 x1627823901357648/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/448 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0 [ 4407.267684] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110 [ 4407.282849] Lustre: Skipped 2 previous similar messages [ 4497.255109] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c24ef2c00 [ 4497.267214] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c24ef2c00 [ 4497.279320] LustreError: 23844:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8e1c291d9850 x1627823901365088/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:552/0 lens 608/448 e 0 to 0 dl 1552417962 ref 1 fl Interpret:/0/0 rc 0/0 [ 4497.306831] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110 [ 4545.033624] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1 [ 4545.082619] LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0 [ 4545.739600] LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0 [ 4548.106298] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107 [ 4548.142590] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417 [ 4548.160127] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages [ 4774.815013] Lustre: soaked-OST0003: haven't heard from client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8e212a901400, cur 1552418196 expire 1552418046 last 1552417969 |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Eviction appears to be a secondary impact from not being able to complete i/o: Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0] req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000 Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
So, for soak-39: [ 6703.820812] Lustre: 14304:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417310/real 1552417315] req@ffff9ae731b7e300 x1627823901253984/t0(0) o101->soaked-OST0000-osc-ffff9aef55f82800@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1552417318 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M [ 6703.862819] Lustre: 14304:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 54 previous similar messages^M [ 6899.713947] LustreError: 11-0: soaked-OST000b-osc-ffff9aef55f82800: operation ost_write to node 192.168.1.107@o2ib failed: rc = -107^M [ 6899.714590] LustreError: 167-0: soaked-OST000b-osc-ffff9aef55f82800: This client was evicted by soaked-OST000b; in progress operations using this service will fail.^M [ 6899.755561] LustreError: Skipped 2 previous similar messages^M [ 6934.425363] LustreError: 167-0: soaked-OST0001-osc-ffff9aef55f82800: This client was evicted by soaked-OST0001; in progress operations using this service will fail.^M [ 6934.449120] LustreError: 18193:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST0001-osc-ffff9aef55f82800: namespace resource [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [ 6934.479588] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount = 2^M [ 6934.500652] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000401:0x147f90a:0x0].0x0 (ffff9aeecb3baa80) refcount = 2^M [ 6934.521808] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000402:0x163422f:0x0].0x0 (ffff9ae755216c00) refcount = 2^M [ 6934.542429] LustreError: 18193:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x340000401:0x147f90a:0x0].0x0 (ffff9aeecb3baa80) refcount = 2^M [ 6960.134237] LustreError: 14276:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef52b9ee00^M [ 6960.160379] LustreError: 11-0: soaked-OST000f-osc-ffff9aef55f82800: operation ldlm_enqueue to node 192.168.1.107@o2ib failed: rc = -107^M [ 6960.174917] LustreError: 167-0: soaked-OST000f-osc-ffff9aef55f82800: This client was evicted by soaked-OST000f; in progress operations using this service will fail.^M [ 6960.210221] LustreError: 18198:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000f-osc-ffff9aef55f82800: namespace resource [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [ 6960.241028] LustreError: 18198:0:(ldlm_resource.c:1101:ldlm_resource_complain()) Skipped 3 previous similar messages^M [ 6960.256020] LustreError: 18198:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount = 2^M [ 6960.277689] LustreError: 18198:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x6c0000402:0x163f767:0x0].0x0 (ffff9ae7552166c0) refcount = 2^M [ 7150.289937] LNetError: 14275:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds^M [ 7150.304607] LNetError: 14275:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (3): c: 1, oc: 3, rc: 8^M [ 7150.324916] LNetError: 0:0:(o2iblnd_cb.c:3454:kiblnd_qp_event()) 192.168.1.107@o2ib: Async QP event type 1^M [ 7150.338879] mlx4_core 0000:02:00.0: CQ overrun on CQN 0000aa^M [ 7150.348197] LNetError: 0:0:(o2iblnd_cb.c:3532:kiblnd_cq_event()) 192.168.1.107@o2ib: async CQ event type 0^M [ 7150.362068] Lustre: soaked-OST0003-osc-ffff9aef55f82800: Connection to soaked-OST0003 (at 192.168.1.107@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M [ 7150.388074] Lustre: Skipped 41 previous similar messages^M [ 7201.253697] LustreError: 11-0: soaked-OST0003-osc-ffff9aef55f82800: operation ost_write to node 192.168.1.107@o2ib failed: rc = -107^M [ 7201.254206] LustreError: 167-0: soaked-OST0003-osc-ffff9aef55f82800: This client was evicted by soaked-OST0003; in progress operations using this service will fail.^M [ 7201.296713] LustreError: Skipped 3 previous similar messages^M Those are the four OST evictions. I believe all of them are from the same OSS (192.168.1.107), and I see a number of weird kiblnd messages and RDMA timeouts. Basically, everything starts timing out suddenly, we get evicted, and there are some (to my eye) weird kiblnd messages. I see no particular indication of trouble on the OSS when this happens. (I do see some kiblnd, etc, failures, but no indication of extreme load or other problems.) Is someone with IB expertise able to take a look? I'd like to be persuaded there's no IB level issue. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
| Comment by Amir Shehata (Inactive) [ 13/Mar/19 ] |
|
Is this problem reproducible? There is one change I'm suspecting, that I'd like to try and reproduce without. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Somewhat, Amir - It requires running on Soak for a while, and there's no guarantee. So we could make a change and run for a while and see if Soak runs better... That's the best we can do. |
| Comment by Amir Shehata (Inactive) [ 14/Mar/19 ] |
|
I made a mistake. I thought part of the patch for |
| Comment by Sarah Liu [ 14/Mar/19 ] |
|
I think I saw some network issues on b2_10-ib build #101 when doing test for |
| Comment by Patrick Farrell (Inactive) [ 14/Mar/19 ] |
|
Sarah, what is the MOFED version in buid #101? |
| Comment by Sarah Liu [ 14/Mar/19 ] |
|
The MOFED version hasn't been changed from #92 to #107 |
| Comment by Sarah Liu [ 14/Mar/19 ] |
|
Just found that we updated kernel from 3.10.0-957.el7 to 3.10.0-957.1.3.el7 in build #107, while client nodes are not using the updated kernel, not sure if it may affect the results. |
| Comment by Amir Shehata (Inactive) [ 14/Mar/19 ] |
|
I looked at the lnet/lnd related changes between 2.10.6-RC1 and 2.10.7-RC1. I only see the following patches: LU-9810 lnet: fix build with M-OFED 4.1 LU-10124 lnet: Correctly add peer MR value while importing LU-11006 lnet: fix show peer yaml tree with no peer LU-11424 lnet: copy the correct amount of cpts to lnet_cpts LU-9810 lnd: use less CQ entries for each connection LU-10213 lnd: calculate qp max_send_wrs properly I'm currently suspecting "lnd: use less CQ entries for each connection". I'm not sure what happens if we run out of CQ entries. If the completion events are being dropped or delayed, that could explain the timeouts we're seeing while transmits are on the active_tx queue. It means they never received completion events and stay on the queue until they timeout. I'm going to commit a revert to this change. Would be nice to try this out on soak. |
| Comment by Gerrit Updater [ 14/Mar/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34427 |
| Comment by Sarah Liu [ 15/Mar/19 ] |
|
soak has been running with Amir's patch #34427 for about 13 hours, it does help the situation a lot, the pass/fail rate is 240 pass/22 fail, which is similar to the 2.10.6. While it still showed RDMA errors, but not seeing the "Async QP" anymore ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (57): c: 0, oc: 0, rc: 8 ./soak-6.log:Mar 15 12:57:39 soak-6 kernel: LNetError: 97:0:(o2iblnd_cb.c:2759:kiblnd_rejected()) 192.168.1.108@o2ib rejected: o2iblnd fatal error ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 3, rc: 8 ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (60): c: 0, oc: 1, rc: 8 ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 11 seconds ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 1, rc: 8 ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 9 seconds ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (59): c: 0, oc: 0, rc: 8 ./soak-8.log:Mar 15 12:57:33 soak-8 kernel: LNetError: 233:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.1.106@o2ib on NA (ib0:0:192.168.1.108): bad dst nid 192.168.1.108@o2ib we still have some eviction, but at that time, soak was doing OSS restart Mar 15 10:34:36 soak-39 kernel: Lustre: soaked-OST000e-osc-ffffa038b9975800: Connection to soaked-OST000e (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 15 10:34:36 soak-39 kernel: LustreError: 167-0: soaked-OST000e-osc-ffffa038b9975800: This client was evicted by soaked-OST000e; in progress operations using this service will fail. Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000e-osc-ffffa038b9975800: namespace resource [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount = 2 Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order): Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: soaked-OST000e-osc-ffffa038b9975800 lock: ffffa030817c1800/0xd39fa3d0f8107339 lrc: 3/0,1 mode: PW/PW res: [0x2fea580:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526480000000 nid: local remote: 0x56c636f7ad02f38 expref: -99 pid: 49356 timeout: 0 lvb_type: 1 2019-03-15 10:30:52,961:fsmgmt.fsmgmt:INFO triggering fault oss_restart 2019-03-15 10:30:52,963:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c soak-7> /dev/null 2019-03-15 10:31:23,525:fsmgmt.fsmgmt:INFO trying to connect to soak-7 ... 2019-03-15 10:31:33,538:fsmgmt.fsmgmt:INFO trying to connect to soak-7 ... |
| Comment by Patrick Farrell (Inactive) [ 15/Mar/19 ] |
|
I looked at the logs around that eviction highlighted by Sara at the end there. It does not have any obvious connection to kiblnd issues, so it's a bit of a mystery. But we don't have a zero rate of evictions on any branch right now, so it's not a huge cause for concern at the moment. |
| Comment by Gerrit Updater [ 20/Mar/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34473 |
| Comment by Gerrit Updater [ 20/Mar/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34474 |
| Comment by Gerrit Updater [ 21/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34473/ |
| Comment by Gerrit Updater [ 21/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34427/ |
| Comment by Peter Jones [ 21/Mar/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 21/Mar/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34474/ |