Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3090

client fails to restore connection to OSTs

    XMLWordPrintable

Details

    • 3
    • 7507

    Description

      Three 2.1.4-3chaos OSS nodes rebooted this morning. A client login node has failed to reconnect to most of their OSTs. I will attach complete console logs.

      # aztec3 /root > lfs check servers | grep -v active
      lsc-OST0004-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0008-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST000c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0010-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0014-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0018-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST001c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0020-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0024-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0028-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST002c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0030-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0034-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0038-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0000-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST007f-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0083-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0087-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST008b-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST008f-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0093-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0097-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST009b-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST009f-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST00a3-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST00a7-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST00ab-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST00af-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST00b3-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST007b-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0130-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0134-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0138-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST013c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0140-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0144-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0148-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST014c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0150-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0154-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0158-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST015c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0160-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST0164-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      lsc-OST012c-osc-ffff880c3043a000: check error: Resource temporarily unavailable
      

      Example dmesg from client:

      # aztec3 /root > dmesg | grep OST012c
      Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918372/real 1364918372] req@ffff88011258ec00 x1430682336999865/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918478 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: lsc-OST012c-osc-ffff880c3043a000: Connection to lsc-OST012c (at 172.19.1.121@o2ib100) was lost; in progress operations using this service will wait for recovery to complete
      Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918397/real 1364918397] req@ffff8804cf6c4800 x1430682337000483/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918503 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 7485:0:(client.c:1820:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364918447/real 1364918447] req@ffff880633f37c00 x1430682337002326/t0(0) o400->lsc-OST012c-osc-ffff880c3043a000@172.19.1.121@o2ib100:28/4 lens 192/192 e 0 to 1 dl 1364918553 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      

      Example dmesg for one server:

      # sumom23 /root > dmesg | grep 172.16.66.53
      Lustre: lsc-OST0156: Client ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp) refused reconnection, still busy with 5 active RPCs
      Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
      Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
      Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
      Lustre: lsc-OST013e: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
      Lustre: lsc-OST0156: Client ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp) refused reconnection, still busy with 1 active RPCs
      LustreError: 23397:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff8802fb0af000 x1430682219726723/t0(0) o3->ea981a59-5970-b37c-8c49-7b6886344dc2@172.16.66.53@tcp:0/0 lens 456/400 e 1 to 0 dl 1364611876 ref 1 fl Interpret:/2/0 rc 0/0
      Lustre: lsc-OST0156: Bulk IO read error with ea981a59-5970-b37c-8c49-7b6886344dc2 (at 172.16.66.53@tcp), client will retry: rc -110
      

      LLNL-bug-id: TOSS-2006

      Attachments

        1. console.aztec3
          67 kB
        2. console.sumom1
          227 kB
        3. console.sumom12
          38 kB
        4. console.sumom23
          20 kB
        5. sysrq-t.aztec3
          1.25 MB
        6. sysrq-t.sumom1
          1.63 MB

        Activity

          People

            laisiyao Lai Siyao
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: