Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12065

Client got evicted when lock callback timer expired on OSS

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Testing soak with b2_10-ib build #107 (same as rc1), seeing following errors on OSS side. Client got evicted, not sure if this is a dup of LU-11818

      app error 245441-iorssf.out

      task 19 writing /mnt/soaked/soaktest/test/iorssf/245441/ssf
      ** error **
      ERROR in aiori-POSIX.c (line 256): transfer failed.
      ERROR: Input/output error
      ** exiting **
      srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
      slurmstepd: error: *** STEP 245441.0 ON soak-17 CANCELLED AT 2019-03-12T19:12:57 ***
      srun: error: soak-39: task 19: Exited with exit code 255
      srun: Terminating job step 245441.0
      srun: error: soak-23: task 12: Killed
      

      client soak-39 syslog

      Mar 12 19:11:59 soak-39 systemd: Removed slice User Slice of root.
      Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417897/real 0]  req@ffff9aef3e3d1e00 x1627823901379328/t0(0) o103->soaked-OST000c-osc-ffff9aef55f82800@192.168.1.104@o2ib:17/18 lens 328/224 e 0 to 1 dl 1552417924 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Mar 12 19:12:04 soak-39 kernel: LustreError: 14277:0:(events.c:199:client_bulk_callback()) event type 1, status -5, desc ffff9aef7f7be000
      Mar 12 19:12:04 soak-39 kernel: Lustre: 14341:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 93 previous similar messages
      Mar 12 19:12:40 soak-39 sshd[18764]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:12:41 soak-39 sshd[18764]: Accepted publickey for root from 10.10.1.116 port 34086 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:12:41 soak-39 systemd: Created slice User Slice of root.
      Mar 12 19:12:41 soak-39 systemd-logind: New session 114 of user root.
      Mar 12 19:12:41 soak-39 systemd: Started Session 114 of user root.
      Mar 12 19:12:41 soak-39 sshd[18764]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Mar 12 19:12:43 soak-39 sshd[18764]: pam_unix(sshd:session): session closed for user root
      Mar 12 19:12:43 soak-39 systemd-logind: Removed session 114.
      Mar 12 19:12:43 soak-39 systemd: Removed slice User Slice of root.
      Mar 12 19:12:48 soak-39 sshd[18798]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:12:48 soak-39 sshd[18798]: Accepted publickey for root from 10.10.1.116 port 34138 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:12:48 soak-39 systemd: Created slice User Slice of root.
      Mar 12 19:12:48 soak-39 systemd-logind: New session 115 of user root.
      Mar 12 19:12:48 soak-39 systemd: Started Session 115 of user root.
      Mar 12 19:12:48 soak-39 sshd[18798]: pam_unix(sshd:session): session opened for user root by (uid=0)
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:339:ptlrpc_invalidate_import()) soaked-OST0003_UUID: rc = -110 waiting for callback (6 != 0)
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:365:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff9ae731b20000 x1627823901365600/t0(0) o103->soaked-OST0003-osc-ffff9aef55f82800@192.168.1.107@o2ib:17/18 lens 328/224 e 0 to 0 dl 1552417817 ref 2 fl UnregRPC:ES/0/ffffffff rc -5/-1
      Mar 12 19:13:02 soak-39 kernel: LustreError: 18489:0:(import.c:379:ptlrpc_invalidate_import()) soaked-OST0003_UUID: Unregistering RPCs found (6). Network is sluggish? Waiting them to error out.
      

      OSS syslog

      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 1
      Mar 12 19:12:33 soak-7 sssd_be: GSSAPI client step 2
      Mar 12 19:12:46 soak-7 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
      Mar 12 19:12:46 soak-7 kernel: LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
      Mar 12 19:12:47 soak-7 kernel: LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
      Mar 12 19:12:49 soak-7 kernel: Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
      Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
      Mar 12 19:12:49 soak-7 kernel: LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
      Mar 12 19:14:14 soak-7 sshd[208612]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
      Mar 12 19:14:14 soak-7 sshd[208612]: Accepted publickey for root from 10.10.1.116 port 32972 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
      Mar 12 19:14:14 soak-7 systemd: Created slice User Slice of root.
      Mar 12 19:14:14 soak-7 systemd-logind: New session 128 of user root.
      Mar 12 19:14:14 soak-7 systemd: Started Session 128 of user root.
      

      console log around the same time from same OSS node

      [ 4246.048619] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 138s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST000b_UUID lock: ffff8e20898cb600/0x1cc0bc14571a1fd6 lrc: 3/0,0 mode: PW/PW res: [0x5c0000400:0x16da44a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000400030020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103dc4 expref: 182 pid: 40382 timeout: 4298912796 lvb_type: 0
      [ 4258.931645] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1552417673/real 0]  req@ffff8e20080d8c00 x1627824065646496/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417680 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 4258.961747] Lustre: 24961:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [ 4307.013970] LNetError: 22817:0:(o2iblnd_cb.c:3147:kiblnd_check_txs_locked()) Timed out tx: active_txs, 6 seconds
      [ 4307.025344] LNetError: 22817:0:(o2iblnd_cb.c:3222:kiblnd_check_conns()) Timed out RDMA with 192.168.1.139@o2ib (12): c: 0, oc: 0, rc: 7
      [ 4307.040012] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c20bbce00
      [ 4307.052087] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c20bbce00
      [ 4307.064166] LustreError: 72826:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c28b19850 x1627823901265440/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:334/0 lens 608/448 e 1 to 0 dl 1552417744 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4307.065457] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 192.168.1.139@o2ib) returned error from blocking AST (req@ffff8e210cc45a00 x1627824065648432 status -107 rc -107), evict it ns: filter-soaked-OST000f_UUID lock: ffff8e219abde400/0x1cc0bc1456b1733c lrc: 4/0,0 mode: PR/PR res: [0x2d1d58f:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f7500fff0a expref: 134 pid: 5311 timeout: 4299074300 lvb_type: 1
      [ 4307.065460] LustreError: 24962:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) Skipped 1 previous similar message
      [ 4307.065462] LustreError: 138-a: soaked-OST000f: A client on nid 192.168.1.139@o2ib was evicted due to a lock blocking callback time out: rc -107
      [ 4307.065903] LustreError: 24506:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417728 with bad export cookie 2071862624129549424
      [ 4307.169926] LustreError: 64567:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8e21812d8300 x1627824065709840/t0(0) o104->soaked-OST000f@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      [ 4307.212124] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4348.569869] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c27fd4450 x1627823901305376/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:454/0 lens 608/448 e 0 to 0 dl 1552417864 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4348.574183] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4348.612117] LustreError: 23845:0:(ldlm_lib.c:3249:target_bulk_io()) Skipped 2 previous similar messages
      [ 4381.366667] Lustre: soaked-OST0003: Client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) reconnecting
      [ 4381.378443] Lustre: Skipped 11 previous similar messages
      [ 4402.265592] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417809/real 1552417811]  req@ffff8e2181a73300 x1627824066270928/t0(0) o104->soaked-OST0003@192.168.1.139@o2ib:15/16 lens 296/224 e 0 to 1 dl 1552417823 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [ 4402.296573] Lustre: 71155:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 31 previous similar messages
      [ 4407.189878] LustreError: 24531:0:(ldlm_lib.c:3199:target_bulk_io()) @@@ bulk READ failed: rc -107  req@ffff8e1c1961e850 x1627823901357664/t0(0) o3->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/432 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4407.217287] Lustre: soaked-OST0003: Bulk IO read error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc -107
      [ 4407.240530] LustreError: 25151:0:(ldlm_lib.c:3249:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff8e1c291dc850 x1627823901357648/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:513/0 lens 608/448 e 0 to 0 dl 1552417923 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4407.267684] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4407.282849] Lustre: Skipped 2 previous similar messages
      [ 4497.255109] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff8e1c24ef2c00
      [ 4497.267214] LustreError: 22817:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8e1c24ef2c00
      [ 4497.279320] LustreError: 23844:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8e1c291d9850 x1627823901365088/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:552/0 lens 608/448 e 0 to 0 dl 1552417962 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4497.306831] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -110
      [ 4545.033624] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 112s: evicting client at 192.168.1.139@o2ib  ns: filter-soaked-OST0003_UUID lock: ffff8e1d4f6b4a00/0x1cc0bc14571229e8 lrc: 3/0,0 mode: PR/PR res: [0x3c0000401:0x1477d0b:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.139@o2ib remote: 0xc961c1f750103360 expref: 13 pid: 243304 timeout: 4299211822 lvb_type: 1
      [ 4545.082619] LustreError: 24531:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1c17972450 x1627823901368672/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:641/0 lens 608/448 e 0 to 0 dl 1552418051 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4545.739600] LustreError: 153453:0:(ldlm_lib.c:3243:target_bulk_io()) @@@ Eviction on bulk WRITE  req@ffff8e1bfbc4b050 x1627823901368608/t0(0) o4->535d4896-d57a-5632-2383-d69ccb94eddf@192.168.1.139@o2ib:637/0 lens 608/448 e 0 to 0 dl 1552418047 ref 1 fl Interpret:/0/0 rc 0/0
      [ 4548.106298] Lustre: soaked-OST0003: Bulk IO write error with 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib), client will retry: rc = -107
      [ 4548.142590] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) ldlm_cancel from 192.168.1.139@o2ib arrived at 1552417969 with bad export cookie 2071862624129549417
      [ 4548.160127] LustreError: 57344:0:(ldlm_lockd.c:2362:ldlm_cancel_handler()) Skipped 6 previous similar messages
      [ 4774.815013] Lustre: soaked-OST0003: haven't heard from client 535d4896-d57a-5632-2383-d69ccb94eddf (at 192.168.1.139@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8e212a901400, cur 1552418196 expire 1552418046 last 1552417969
      

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: