Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12065

Client got evicted when lock callback timer expired on OSS

Details

    • 3
    • 9223372036854775807

    Description

      Testing soak with b2_10-ib build #107 (same as rc1), seeing following errors on OSS side. Client got evicted, not sure if this is a dup of LU-11818

      app error 245441-iorssf.out

      task 19 writing /mnt/soaked/soaktest/test/iorssf/245441/ssf
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      slurmstepd: error: *** STEP 245441.0 ON soak-17 CANCELLED AT 2019-03-12T19:12:57 ***
      srun: error: soak-39: task 19: Exited with exit code 255
      srun: Terminating job step 245441.0
      srun: error: soak-23: task 12: Killed
      

      client soak-39 syslog

      Mar 12 19:11:59 soak-39 systemd: Removed slice User Slice of root.
      Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0]  req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000
      Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages
      Mar 12 19:12:40 soak-39 sshd[18764]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:12:41 soak-39 sshd[18764]: Accepted publickey for root from 10.10.1.116 port 34086 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:12:41 soak-39 systemd: Created slice User Slice of root.
      Mar 12 19:12:41 soak-39 systemd-logind: New session 114 of user root.
      Mar 12 19:12:41 soak-39 systemd: Started Session 114 of user root.
      Mar 12 19:12:41 soak-39 sshd[18764]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Mar 12 19:12:43 soak-39 sshd[18764]: pam_unix(sshd:session): session closed for user root
      Mar 12 19:12:43 soak-39 systemd-logind: Removed session 114.
      Mar 12 19:12:43 soak-39 systemd: Removed slice User Slice of root.
      Mar 12 19:12:48 soak-39 sshd[18798]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:12:48 soak-39 sshd[18798]: Accepted publickey for root from 10.10.1.116 port 34138 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:12:48 soak-39 systemd: Created slice User Slice of root.
      Mar 12 19:12:48 soak-39 systemd-logind: New session 115 of user root.
      Mar 12 19:12:48 soak-39 systemd: Started Session 115 of user root.
      Mar 12 19:12:48 soak-39 sshd[18798]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:339:ptlrpc_invalidate_import()) soaked-OST0003_UUID: rc = -110 waiting for callback (6 != 0)
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:365:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9ae731b20000 x1627823901365600/t0(0) o103->soaked-OST0003-osc-ffff9aef55f82800@192.168.1.107@o2ib:17/18 lens 328/224 e 0 to 0 dl 1552417817 ref 2 fl UnregRPC:ES/0/ffffffff rc -5/-1
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:379:ptlrpc_invalidate_import()) soaked-OST0003_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out.
      

      OSS syslog

      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 2
      Mar 12 19:12:46 soak-7 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
      Mar 12 19:12:46 soak-7 kernel: LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
      Mar 12 19:12:47 soak-7 kernel: LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
      Mar 12 19:12:49 soak-7 kernel: Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
      Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
      Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
      Mar 12 19:14:14 soak-7 sshd[208612]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:14:14 soak-7 sshd[208612]: Accepted publickey for root from 10.10.1.116 port 32972 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:14:14 soak-7 systemd: Created slice User Slice of root.
      Mar 12 19:14:14 soak-7 systemd-logind: New session 128 of user root.
      Mar 12 19:14:14 soak-7 systemd: Started Session 128 of user root.
      

      console log around the same time from same OSS node

      [ 4246.048619] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 138s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST000b_UUID lock: ffff8e20898cb600/0x1cc0bc14571a1fd6 lrc: 3/0,0 mode: PW/PW res: [0x5c0000400:0x16da44a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000400030020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103dc4 expref: 182 pid: 40382 timeout: 4298912796 lvb_type: 0
      [ 4258.931645] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417673/real 0]  req@ffff8e20080d8c00 x1627824065646496/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417680 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 4258.961747] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [ 4307.013970] LNetError: 22817:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 6 seconds
      [ 4307.025344] LNetError: 22817:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.139@o2ib (12): c: 0, oc: 0, rc: 7
      [ 4307.040012] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c20bbce00
      [ 4307.052087] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c20bbce00
      [ 4307.064166] LustreError: 72826:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c28b19850 x1627823901265440/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:334/0 lens 608/448 e 1 to 0 dl 1552417744 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4307.065457] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.139@o2ib) returned error from blocking AST (req@ffff8e210cc45a00 x1627824065648432 status -107 rc -107), evict it ns: filter-soaked-OST000f_UUID lock: ffff8e219abde400/0x1cc0bc1456b1733c lrc: 4/0,0 mode: PR/PR res: [0x2d1d58f:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f7500fff0a expref: 134 pid: 5311 timeout: 4299074300 lvb_type: 1
      [ 4307.065460] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message
      [ 4307.065462] LustreError: 138-a: soaked-OST000f: A client on nid 192.168.1.139@o2ib was evicted due to a lock blocking callback time out: rc -107
      [ 4307.065903] LustreError: 24506:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417728 with bad export cookie 2071862624129549424
      [ 4307.169926] LustreError: 64567:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8e21812d8300 x1627824065709840/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [ 4307.212124] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4348.569869] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c27fd4450 x1627823901305376/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:454/0 lens 608/448 e 0 to 0 dl 1552417864 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4348.574183] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4348.612117] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) Skipped 2 previous similar messages
      [ 4381.366667] Lustre: soaked-OST0003: Client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) reconnecting
      [ 4381.378443] Lustre: Skipped 11 previous similar messages
      [ 4402.265592] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417809/real 1552417811]  req@ffff8e2181a73300 x1627824066270928/t0(0) o104->soaked-OST0003@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417823 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [ 4402.296573] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
      [ 4407.189878] LustreError: 24531:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8e1c1961e850 x1627823901357664/t0(0) o3->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/432 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4407.217287] Lustre: soaked-OST0003: Bulk IO read error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc -107
      [ 4407.240530] LustreError: 25151:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c291dc850 x1627823901357648/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/448 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4407.267684] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4407.282849] Lustre: Skipped 2 previous similar messages
      [ 4497.255109] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c24ef2c00
      [ 4497.267214] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c24ef2c00
      [ 4497.279320] LustreError: 23844:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c291d9850 x1627823901365088/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:552/0 lens 608/448 e 0 to 0 dl 1552417962 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4497.306831] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4545.033624] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
      [ 4545.082619] LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4545.739600] LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4548.106298] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
      [ 4548.142590] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
      [ 4548.160127] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
      [ 4774.815013] Lustre: soaked-OST0003: haven't heard from client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8e212a901400, cur 1552418196 expire 1552418046 last 1552417969
      

      Attachments

        Issue Links

          Activity

            [LU-12065] Client got evicted when lock callback timer expired on OSS

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34474/
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 0205858f6533c7cff8fe8b43d09cece15e0b5385

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34474/ Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 0205858f6533c7cff8fe8b43d09cece15e0b5385
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34427/
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 6c12330a350d3e082c835fb61cfcc5c9c93e9bdd

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34427/ Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 6c12330a350d3e082c835fb61cfcc5c9c93e9bdd

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34473/
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bf3fc7f1a7bf82c02181c64810bc3f93b3303703

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/34473/ Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: master Current Patch Set: Commit: bf3fc7f1a7bf82c02181c64810bc3f93b3303703

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34474
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 45f85254238a0fa53225897bed8a6bcbe58faf58

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34474 Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 45f85254238a0fa53225897bed8a6bcbe58faf58

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34473
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e8743ff884130a6fcc663df615033d8d14014a74

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34473 Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e8743ff884130a6fcc663df615033d8d14014a74

            I looked at the logs around that eviction highlighted by Sara at the end there.  It does not have any obvious connection to kiblnd issues, so it's a bit of a mystery.  But we don't have a zero rate of evictions on any branch right now, so it's not a huge cause for concern at the moment.

            pfarrell Patrick Farrell (Inactive) added a comment - I looked at the logs around that eviction highlighted by Sara at the end there.  It does not have any obvious connection to kiblnd issues, so it's a bit of a mystery.  But we don't have a zero rate of evictions on any branch right now, so it's not a huge cause for concern at the moment.
            sarah Sarah Liu added a comment - - edited

            soak has been running with Amir's patch #34427 for about 13 hours, it does help the situation a lot, the pass/fail rate is 240 pass/22 fail, which is similar to the 2.10.6. While it still showed RDMA errors, but not seeing the "Async QP" anymore

            ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds
            ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (57): c: 0, oc: 0, rc: 8
            ./soak-6.log:Mar 15 12:57:39 soak-6 kernel: LNetError: 97:0:(o2iblnd_cb.c:2759:kiblnd_rejected()) 192.168.1.108@o2ib rejected: o2iblnd fatal error
            ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds
            ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 3, rc: 8
            ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
            ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (60): c: 0, oc: 1, rc: 8
            ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 11 seconds
            ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 1, rc: 8
            ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 9 seconds
            ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (59): c: 0, oc: 0, rc: 8
            ./soak-8.log:Mar 15 12:57:33 soak-8 kernel: LNetError: 233:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.1.106@o2ib on NA (ib0:0:192.168.1.108): bad dst nid 192.168.1.108@o2ib
            

            we still have some eviction, but at that time, soak was doing OSS restart

            Mar 15 10:34:36 soak-39 kernel: Lustre: soaked-OST000e-osc-ffffa038b9975800: Connection to soaked-OST000e (at 192.168.1.106@o2ib) was lost; in progress operations using this 
            service will wait for recovery to complete
            Mar 15 10:34:36 soak-39 kernel: LustreError: 167-0: soaked-OST000e-osc-ffffa038b9975800: This client was evicted by soaked-OST000e; in progress operations using this service will fail.
            Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000e-osc-ffffa038b9975800: namespace resource [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
            Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount = 2
            Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):
            Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: soaked-OST000e-osc-ffffa038b9975800 lock: ffffa030817c1800/0xd39fa3d0f8107339 lrc: 3/0,1 mode: PW/PW res: [0x2fea580:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526480000000 nid: local remote: 0x56c636f7ad02f38 expref: -99 pid: 49356 timeout: 0 lvb_type: 1
            
            2019-03-15 10:30:52,961:fsmgmt.fsmgmt:INFO     triggering fault oss_restart
            2019-03-15 10:30:52,963:fsmgmt.fsmgmt:INFO     executing cmd pm -h powerman -c soak-7> /dev/null
            2019-03-15 10:31:23,525:fsmgmt.fsmgmt:INFO     trying to connect to soak-7 ...
            2019-03-15 10:31:33,538:fsmgmt.fsmgmt:INFO     trying to connect to soak-7 ...
            
            sarah Sarah Liu added a comment - - edited soak has been running with Amir's patch #34427 for about 13 hours, it does help the situation a lot, the pass/fail rate is 240 pass/22 fail, which is similar to the 2.10.6. While it still showed RDMA errors, but not seeing the "Async QP" anymore ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 7 seconds ./soak-20.log:Mar 15 10:32:36 soak-20 kernel: LNetError: 11172:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (57): c: 0, oc: 0, rc: 8 ./soak-6.log:Mar 15 12:57:39 soak-6 kernel: LNetError: 97:0:(o2iblnd_cb.c:2759:kiblnd_rejected()) 192.168.1.108@o2ib rejected: o2iblnd fatal error ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 11 seconds ./soak-28.log:Mar 15 10:33:22 soak-28 kernel: LNetError: 10081:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 3, rc: 8 ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds ./soak-29.log:Mar 15 10:29:35 soak-29 kernel: LNetError: 13587:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (60): c: 0, oc: 1, rc: 8 ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 11 seconds ./soak-31.log:Mar 15 10:29:24 soak-31 kernel: LNetError: 13599:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (61): c: 0, oc: 1, rc: 8 ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 9 seconds ./soak-18.log:Mar 15 10:31:24 soak-18 kernel: LNetError: 10902:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.107@o2ib (59): c: 0, oc: 0, rc: 8 ./soak-8.log:Mar 15 12:57:33 soak-8 kernel: LNetError: 233:0:(o2iblnd_cb.c:2299:kiblnd_passive_connect()) Can't accept conn from 192.168.1.106@o2ib on NA (ib0:0:192.168.1.108): bad dst nid 192.168.1.108@o2ib we still have some eviction, but at that time, soak was doing OSS restart Mar 15 10:34:36 soak-39 kernel: Lustre: soaked-OST000e-osc-ffffa038b9975800: Connection to soaked-OST000e (at 192.168.1.106@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 15 10:34:36 soak-39 kernel: LustreError: 167-0: soaked-OST000e-osc-ffffa038b9975800: This client was evicted by soaked-OST000e; in progress operations using this service will fail. Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1101:ldlm_resource_complain()) soaked-OST000e-osc-ffffa038b9975800: namespace resource [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x2fea580:0x0:0x0].0x0 (ffffa035ee3469c0) refcount = 2 Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order): Mar 15 10:34:37 soak-39 kernel: LustreError: 50354:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: soaked-OST000e-osc-ffffa038b9975800 lock: ffffa030817c1800/0xd39fa3d0f8107339 lrc: 3/0,1 mode: PW/PW res: [0x2fea580:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526480000000 nid: local remote: 0x56c636f7ad02f38 expref: -99 pid: 49356 timeout: 0 lvb_type: 1 2019-03-15 10:30:52,961:fsmgmt.fsmgmt:INFO triggering fault oss_restart 2019-03-15 10:30:52,963:fsmgmt.fsmgmt:INFO executing cmd pm -h powerman -c soak-7> /dev/null 2019-03-15 10:31:23,525:fsmgmt.fsmgmt:INFO trying to connect to soak-7 ... 2019-03-15 10:31:33,538:fsmgmt.fsmgmt:INFO trying to connect to soak-7 ...

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34427
            Subject: LU-12065 lnd: increase CQ entries
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: aed3633ce8e257b9d03ef03dec33c8e764e95da1

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34427 Subject: LU-12065 lnd: increase CQ entries Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: aed3633ce8e257b9d03ef03dec33c8e764e95da1

            I looked at the lnet/lnd related changes between 2.10.6-RC1 and 2.10.7-RC1. I only see the following patches:

            LU-9810 lnet: fix build with M-OFED 4.1
            LU-10124 lnet: Correctly add peer MR value while importing
            LU-11006 lnet: fix show peer yaml tree with no peer
            LU-11424 lnet: copy the correct amount of cpts to lnet_cpts
            LU-9810 lnd: use less CQ entries for each connection
            LU-10213 lnd: calculate qp max_send_wrs properly

            I'm currently suspecting "lnd: use less CQ entries for each connection". I'm not sure what happens if we run out of CQ entries. If the completion events are being dropped or delayed, that could explain the timeouts we're seeing while transmits are on the active_tx queue. It means they never received completion events and stay on the queue until they timeout.

            I'm going to commit a revert to this change. Would be nice to try this out on soak.

            ashehata Amir Shehata (Inactive) added a comment - I looked at the lnet/lnd related changes between 2.10.6-RC1 and 2.10.7-RC1. I only see the following patches: LU-9810 lnet: fix build with M-OFED 4.1 LU-10124 lnet: Correctly add peer MR value while importing LU-11006 lnet: fix show peer yaml tree with no peer LU-11424 lnet: copy the correct amount of cpts to lnet_cpts LU-9810 lnd: use less CQ entries for each connection LU-10213 lnd: calculate qp max_send_wrs properly I'm currently suspecting "lnd: use less CQ entries for each connection". I'm not sure what happens if we run out of CQ entries. If the completion events are being dropped or delayed, that could explain the timeouts we're seeing while transmits are on the active_tx queue. It means they never received completion events and stay on the queue until they timeout. I'm going to commit a revert to this change. Would be nice to try this out on soak.

            People

              ashehata Amir Shehata (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: