Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12953

LNet timeouts with restarted Lustre production file system

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.3
    • Lustre 2.12.3
    • Lustre OSS server running ZFS.
    • 3
    • 9223372036854775807

    Description

      When restarting our production Lustre file system we encountered this bug:

      [407608.498637] LNetError: 72335:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [407608.509681] LNetError: 72335:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.10.32.102@o2ib2 (5): c: 3, oc: 0, rc: 7
      [407608.526089] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8ca33db8a800
      [407608.537667] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff8ca33db8a800
      [407608.549244] LustreError: 167072:0:(ldlm_lib.c:3259:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8cabcfcba850 x1648066684855104/t0(0) o4->8d9c48a5-020d-9844-4aa4-57225c35d4e2@10.10.32.102@o2ib2:135/0 lens 608/448 e 0 to 0 dl 1573227610 ref 1 fl Interpret:/0/0 rc 0/0
      [407608.576219] Lustre: f2-OST001d: Bulk IO write error with 8d9c48a5-020d-9844-4aa4-57225c35d4e2 (at 10.10.32.102@o2ib2), client will retry: rc = -110

      Eventually we ended up seeing:

      [423015.676012] [<ffffffff98d5d28b>] queued_spin_lock_slowpath+0xb/0xf
      [423015.676017] [<ffffffff98d6b760>] _raw_spin_lock+0x20/0x30
      [423015.676026] [<ffffffffc19ddf39>] ofd_intent_policy+0x1d9/0x920 [ofd]
      [423015.676070] [<ffffffffc161dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [423015.676080] [<ffffffffc12f4033>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
      [423015.676085] [<ffffffffc12f77be>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [423015.676107] [<ffffffffc1646587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [423015.676130] [<ffffffffc166e6d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]

      This looks similar to the issues reported by NASA but just to make sure.

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: