Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-649

client receives reply to lock cancel yet server evicts it 100 seconds later

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.2.0, Lustre 1.8.8
    • None
    • None
    • lustre 1.8.5
    • 3
    • 24,531
    • 5617

    Description

      NASA AMES is testing a cluster in preparation for upgrades.

      "These errors are preventing us from completing a clean migration of the
      filesystem, and we're not going to put it into production until they're
      eliminated. We're planning to move filesystems on six more DDN 9900s to SFA
      hardware and we'd like to complete the process as soon as possible."

      Right now they are running a sles10sp3 variant plus lustre 1.8.2 plus patches on the client and
      centos 5.5 plus lustre 1.8.5 plus patches on the server. I'll add the patch list when I get it.

      client: kernel 2.6.16.60-0.68.1.20100916-nasa, lustre-1.8.2-4nas_ofed151
      server: kernel 2.6.18-194.3.1.el5.20110322lustre185, lustre-1.8.5-4nas_ofed152

      General problem is that they are seeing a lot of client evictions for no obvious reasons.

      In particular this bug, here is the customer's description:

      "client:

      Jun 21 11:04:36 hw-burly1 kernel: LustreError: 11-0:
      nbp1-OST004f-osc-ffff810c48fc7000: An error occurred while communicating with
      NID 10.151.26.34@o2ib; the ost_statfs operation failed with -107
      Jun 21 11:04:36 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
      Connection to nbp1-OST004f (at 10.151.26.34@o2ib) was lost; in progress
      operations using this service will wait for recovery to complete
      Jun 21 11:04:36 hw-burly1 kernel: LustreError: 167-0: This client was evicted
      by nbp1-OST004f; in progress operations using this service will fail.
      Jun 21 11:04:37 hw-burly1 kernel: LustreError:
      27834:0:(ldlm_resource.c:518:ldlm_namespace_cleanup()) Namespace
      nbp1-OST004f-osc-ffff810c48fc7000 resource refcount nonzero (2) after lock
      cleanup; forcing cleanup.
      Jun 21 11:04:37 hw-burly1 kernel: LustreError:
      330:0:(llite_lib.c:1752:ll_setattr_raw()) obd_setattr_async fails: rc=-5
      Jun 21 11:04:37 hw-burly1 kernel: LustreError:
      330:0:(llite_lib.c:1761:ll_setattr_raw()) extent unlock failed: err=-5
      Jun 21 11:04:37 hw-burly1 kernel: LustreError:
      27834:0:(ldlm_resource.c:523:ldlm_namespace_cleanup()) Resource:
      ffff81092f055500 (233731/0/0/0) (rc: 1)
      Jun 21 11:04:38 hw-burly1 kernel: LustreError:
      27834:0:(import.c:1279:ptlrpc_invalidate_import_thread()) dump the log upon
      eviction
      Jun 21 11:04:38 hw-burly1 kernel: LustreError: dumping log to
      /var/dumps/lustre-log.1308679478.27834
      Jun 21 11:04:38 hw-burly1 kernel: Lustre: nbp1-OST004f-osc-ffff810c48fc7000:
      Connection restored to nbp1-OST004f (at 10.151.26.34@o2ib)

      server:

      Jun 21 11:04:36 service178 kernel: LustreError:
      0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired
      after 100s: evicting client at 10.151.61.171@o2ib ns: filter-nbp1-OST004f_UUID
      lock: ffff810a03cb3a00/0x50a73ed8006fda0b lrc: 3/0,0 mode: PW/PW res: 233731/0
      rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x20 remote:
      0x4cdfae7ceeea95bc expref: 9364 pid: 13239 timeout 5323842234
      Jun 21 11:04:36 service178 kernel: LustreError: dumping log to
      /var/dumps/lustre-log.1308679476.13105

      I have debug logs with 'ioctl neterror net warning other dlmtrace error emerg
      ha rpctrace config console' enabled for both the client and server, which I'll
      upload once I have a case number. Unfortunately ntp wasn't running on the
      server at the time, so the times are a bit off. I have logs from several other
      events as well, though I have yet to examine them to see if the pattern is the
      same in every case.

      Here's a quick rundown of the events for this particular eviction:

      • The client sends a PW lock enqueue request in at 1308679375.111977 in
        x1371481946874614, which the server grants.
      • At 1308679375.164893 it wants to upgrade the lock to a PW, in
        x1371481946874670. Since it conflicts with the existing lock, the server sends
        back a blocking AST at 1308679375.958555 in x1371171161806020, which the client
        responds to.
      • The client sends the lock cancellation at 1308679375.165789
        x1371481946874671. However, there's no record of the server receiving it.
        Oddly, the client seems to think it got a response:

      00000100:00000200:21:1308679375.166321:0:9771:0:(client.c:925:ptlrpc_check_reply
      ()) @@@ REPLIED: rc = 1 for req@ffff811f4080dc00 x1371481946874671/t0
      o103->nbp1-OST004f_UUID@10.151.26.34@o2ib:17/18 lens 296/384 e 0 to 1 dl
      1308679413 ref 1 fl Rpc:RN/0/0 rc 0/0
      00000100:00100000:21:1308679375.166329:0:9771:0:(client.c:2171:ptlrpc_queue_wait
      ()) Completed RPC pname:cluuid:pid:xid:nid:opc
      ldlm_bl_00:1ae3c145-b2f5-384f-a14f-5af93cbbb774:9771:x1371481946874671:10.151.26
      .34@o2ib:103

      (or I'm misinterpreting the message).

      Since the server never received the lock cancellation for the blocking AST it
      issued, it evicts the client.

      I checked the ib fabric and HCAs on each end, and didn't find anything
      obviously wrong. I didn't catch any network-related errors in the Lustre logs.

      I upped at_min from the default to 45, then 150, in case the LNet messages were
      taking their time getting to the server. That didn't help.

      I upgraded the client to 1.8.5-4nas about an hour ago, we'll see if that helps.
      If it doesn't, I'm going to consider downgrading the server to a 1.8.4-based
      build, which we've been running on a production filesystem for a while.

      Since the filesystem isn't in production, I have quite a bit of flexibility in
      terms of being able to apply patches - unlike the rest of our filesystems."

      I looked over this as well and all I've come up with so far is that perhaps attachment 31479 [details] bug
      23352 might be of some use.

      I'll attach the aforementioned lustre debug logs.

      ===see bugzilla 24531 for logs===

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              qm137 James Karellas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: