|
During failover testing, we are occasionally (about 30% of the time) seeing ENOENT errors on the client. The way the test is performed is by untarring the kernel source on clients and then powering off an OSS. The OSTs failover to the partner OSS. After the state of OST that is being written to goes from RECOVERING to COMPLETE, the client then returns the ENOENT.
There appear to be some similarities with LU-463, although this also features a client soft lockup. Here are the server and client logs:
EPAM-ES-Server03
----------------
es01a-OST0000/recovery_status:status: COMPLETE
es01a-OST0001/recovery_status:status: COMPLETE
es01a-OST0002/recovery_status:status: RECOVERING
es01a-OST0003/recovery_status:status: RECOVERING
es01a-OST0000/recovery_status:status: COMPLETE
es01a-OST0001/recovery_status:status: COMPLETE
es01a-OST0002/recovery_status:status: RECOVERING
es01a-OST0003/recovery_status:status: COMPLETE
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) R
ECOVERY: service es01a-OST0003, 2 recoverable clients, last_transno 11314055
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: 13393:0:(ldlm_lib.c:1973:target_recovery_init()) S
kipped 1 previous similar message
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Now serving es01a-OST0003 on /dev/m
apper/ost_es01a_3 with recovery enabled
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: Will be in recovery for at least 5:
00, or until 2 clients reconnect
Jun 28 09:02:48 EPAM-ES-Server03 kernel: Lustre: Skipped 1 previous similar message
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): recovery complete
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mo
de. Opts:
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): warning: maximal mount count reached, r
unning e2fsck is recommended
Jun 28 09:02:48 EPAM-ES-Server03 kernel: LDISKFS-fs (dm-9): mounted filesystem with ordered data mode. Opts:
Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10205:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0003: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88018fcff800 cur 1340899376 last 1340899368
Jun 28 09:02:56 EPAM-ES-Server03 kernel: Lustre: 10118:0:(ldlm_lib.c:933:target_handle_connect()) es01a-OST0002: connection from es01a-MDT0000-mdtlov_UUID@192.168.3.10@o2ib recovering/t0 exp ffff88019fca8400 cur 1340899376 last 1340899368
Jun 28 09:03:09 EPAM-ES-Server03 kernel: LustreError: 13394:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 522895: rc -2
Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: sending delayed replies to recovered clients
Jun 28 09:03:46 EPAM-ES-Server03 kernel: Lustre: es01a-OST0003: received MDS connection from 192.168.3.10@o2ib
Jun 28 09:03:46 EPAM-ES-Server03 kernel: LustreError: 13448:0:(ldlm_resource.c:1088:ldlm_resource_get()) lvbo_init failed for resource 1047468: rc -2
On client
Jun 28 09:03:46 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message
Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: es01a-OST0002-osc-ffff88018a031800: Connection to
es01a-OST0002 (at 192.168.3.12@o2ib) was lost; in progress operations using this service will wait
for recovery to complete
Jun 28 09:06:19 EPAM-ES-Client01 kernel: Lustre: Skipped 1 previous similar message
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
eq()) Got rc -11 from cancel RPC: canceling anyway
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
ist()) ldlm_cli_cancel_list: -11
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5921:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
eq()) Got rc -11 from cancel RPC: canceling anyway
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5796:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
ist()) Skipped 1 previous similar message
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
ist()) ldlm_cli_cancel_list: -11
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
eq()) Got rc -11 from cancel RPC: canceling anyway
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1172:ldlm_cli_cancel_r
eq()) Skipped 34 previous similar messages
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
ist()) ldlm_cli_cancel_list: -11
Jun 28 09:07:07 EPAM-ES-Client01 kernel: LustreError: 5795:0:(ldlm_request.c:1799:ldlm_cli_cancel_l
ist()) Skipped 33 previous similar messages
Jun 28 09:08:20 EPAM-ES-Client01 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd-rcv:4203]
Jun 28 09:08:20 EPAM-ES-Client01 kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) l
quota(U) mdc(U) fid(U) fld(U) ptlrpc(U) autofs4 ib_uverbs(U) ib_srp(U) scsi_transport_srp scsi_tgt
sunrpc ko2iblnd(U) rdma_cm(U) iw_cm(U) ib_addr(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ib_umad(U)
ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) ipmi_si
ipmi_devintf ipmi_msghandler power_meter microcode dcdbas serio_raw iTCO_wdt iTCO_vendor_support sg
i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic
ata_piix mptsas mptscsih mptbase scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last un
loaded: scsi_tgt]
Are there any debug settings we could run with to help aid debugging?
|