Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12774

Lustre client OST stuck in "Evicted" state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.8
    • None
    • 3
    • 9223372036854775807

    Description

      We had one of our lustre clients (this one is acting as an NFS gateway).  It got evicted from an OST and seems to be stuck in that state and never recovers.  Rebooting seems to be required to get it back in operation.

       

      We were getting these out of the kernel log:

      kernel: [ 7226.864597] LustreError: 11-0: lustre-OST0090-osc-ffff88103aeb4000: operation ost_read to node 10.11.200.13@o2ib failed: rc = -107
      kernel: [ 7226.864606] Lustre: lustre-OST0090-osc-ffff88103aeb4000: Connection to lustre-OST0090 (at 10.11.200.13@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      kernel: [ 7226.864772] LustreError: 167-0: lustre-OST0090-osc-ffff88103aeb4000: This client was evicted by lustre-OST0090; in progress operations using this service will fail.

      kernel: [ 7226.877968] LustreError: 6866:0:(ldlm_resource.c:1101:ldlm_resource_complain()) lustre-OST0090-osc-ffff88103aeb4000: namespace resource [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c9
      5f880) refcount nonzero (1) after lock cleanup; forcing cleanup.
      kernel: [ 7226.877972] LustreError: 6866:0:(ldlm_resource.c:1683:ldlm_resource_dump()) — Resource: [0x2680000400:0x2b70b4a:0x0].0x0 (ffff88203c95f880) refcount = 2
      lstgwbal837 kernel: [ 7226.877973] LustreError: 6866:0:(ldlm_resource.c:1686:ldlm_resource_dump()) Granted locks (in reverse order):
      lstgwbal837 kernel: [ 7226.877978] LustreError: 6866:0:(ldlm_resource.c:1689:ldlm_resource_dump()) ### ### ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x526400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1

      kernel: [ 7460.227838] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) extent ffff8807ba25aa98@

      {[6 -> 9/1023], [3|1|-|active|wiuY|ffff8816a6279180], [40960|4|+|-|ffff88100e7a1c00|1024| (null)]}

      lustre-OST0090-osc-ffff88103aeb4000: wait ext to 0 timedout, recovery in progress?
      kernel: [ 7460.227846] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) ### extent: ffff8807ba25aa98 ns: lustre-OST0090-osc-ffff88103aeb4000 lock: ffff88100e7a1c00/0x80fbdb776558f43 lrc: 4/0,1 mode: PW/PW res: [0x2680000400:0x2b70b4a:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 36864->40959) flags: 0x426400020000 nid: local remote: 0xc69e1cc8fc9b178e expref: -99 pid: 5106 timeout: 0 lvb_type: 1
      kernel: [ 7460.227848] LustreError: 5106:0:(osc_cache.c:952:osc_extent_wait()) Skipped 1 previous similar message

       

      We aren't sure whats going on there, but it looked to us like after getting evicted it tried to clean up locks and was failing to clean one up, which was preventing it from trying to recover?

      We are currently running the 2.10.8 client and server.  Any help would be appreciated!

      Thanks!

      Attachments

        Activity

          People

            wc-triage WC Triage
            mcmult Tim McMullan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: