Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1694

ENOENT error encountered during HA tests

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Minor
    • None
    • Lustre 2.1.2
    • None
    • RHEL 6.2
    • 3
    • 10335

    Description

      During failover testing, we are occasionally (about 30% of the time) seeing ENOENT errors on the client. The way the test is performed is by untarring the kernel source on clients and then powering off an OSS. The OSTs failover to the partner OSS. After the state of OST that is being written to goes from RECOVERING to COMPLETE, the client then returns the ENOENT.

      There appear to be some similarities with LU-463, although this also features a client soft lockup. Here are the server and client logs:

      EPAM-ES-Server03
      ----------------
      es01a-OST0000/recovery_status:status: COMPLETE
      es01a-OST0001/recovery_status:status: COMPLETE
      es01a-OST0002/recovery_status:status: RECOVERING
      es01a-OST0003/recovery_status:status: RECOVERING

      es01a-OST0000/recovery_status:status: COMPLETE
      es01a-OST0001/recovery_status:status: COMPLETE
      es01a-OST0002/recovery_status:status: RECOVERING
      es01a-OST0003/recovery_status:status: COMPLETE
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) R
      ECOVERY: service es01a-OST0003, 2 recoverable clients, last_transno 11314055
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) S
      kipped 1 previous similar message
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Now serving es01a-OST0003 on /dev/m
      apper/ost_es01a_3 with recovery enabled
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Will be in recovery for at least 5:
      00, or until 2 clients reconnect
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): recovery complete
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mo
      de. Opts:
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): warning: maximal mount count reached, r
      unning e2fsck is recommended
      Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. Opts:
      Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10205:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0003: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88018fcff800 cur 1340899376 last 1340899368
      Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10118:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0002: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88019fca8400 cur 1340899376 last 1340899368
      Jun 28 09:03:09 EPAM-ES-Server03 kernel: LustreError: 13394:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 522895: rc -2
      Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: sending delayed replies to recovered clients
      Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: received MDS connection from 192.168.3.10@o2ib
      Jun 28 09:03:46 EPAM-ES-Server03 kernel: LustreError: 13448:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 1047468: rc -2

      On client
      Jun 28 09:03:46 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message
      Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: es01a-OST0002-osc-ffff88018a031800: Connection to
      es01a-OST0002 (at 192.168.3.12@o2ib) was lost; in progress operations using this service will wait
      for recovery to complete
      Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
      eq()) Got rc -11 from cancel RPC: canceling anyway
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
      ist()) ldlm_cli_cancel_list: -11
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5921:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
      eq()) Got rc -11 from cancel RPC: canceling anyway
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
      ist()) Skipped 1 previous similar message
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
      ist()) ldlm_cli_cancel_list: -11
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
      eq()) Got rc -11 from cancel RPC: canceling anyway
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
      eq()) Skipped 34 previous similar messages
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
      ist()) ldlm_cli_cancel_list: -11
      Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
      ist()) Skipped 33 previous similar messages
      Jun 28 09:08:20 EPAM-ES-Client01 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd-rcv:4203]
      Jun 28 09:08:20 EPAM-ES-Client01 kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) l
      quota(U) mdc(U) fid(U) fld(U) ptlrpc(U) autofs4 ib_uverbs(U) ib_srp(U) scsi_transport_srp scsi_tgt
      sunrpc ko2iblnd(U) rdma_cm(U) iw_cm(U) ib_addr(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ib_umad(U)
      ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) ipmi_si
      ipmi_devintf ipmi_msghandler power_meter microcode dcdbas serio_raw iTCO_wdt iTCO_vendor_support sg
      i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic
      ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last un
      loaded: scsi_tgt]

      Are there any debug settings we could run with to help aid debugging?

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: