[LU-2822] softlockups, evictions during recovery-scale Created: 15/Feb/13 Updated: 29/Sep/15 Resolved: 29/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CHAOS4/RHEL5 Old Hyperion system |
||
| Severity: | 3 |
| Rank (Obsolete): | 6828 |
| Description |
|
running the mds-recovery-scale test. Not really sure what is going on. The test completes one failover, then activity ceases. Appear to be having soft lockups on the clients, pdflush is blocked, followed/prceeded by evictions from OSTs. Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-MDT0000-mdc-ffff8101bf048400: tried all connections, increasing latency to 8s Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 16 previous similar messages Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426985036308729 sent from lustre-OST0006-osc-ffff8101bf048400 to NID 192.168.120.86@o2ib 12s ago has timed out (12s prior to deadline). Feb 15 09:59:11 ehyperion607 req@ffff8101a84e8400 x1426985036308729/t0 o8->lustre-OST0006_UUID@192.168.120.86@o2ib:28/4 lens 368/584 e 0 to 1 dl 1360951115 ref 2 fl Rpc:N/0/0 rc 0/0 Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 30 previous similar messages Feb 15 09:59:11 ehyperion607 BUG: soft lockup - CPU#0 stuck for 60s! [events/0:50] Feb 15 09:59:11 ehyperion607 CPU 0: Feb 15 09:59:11 ehyperion607 Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad mlx4_ib mlx4_core ipoib_helper ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev shpchp ahci i7core_edac edac_mc libata i2c_i801 pcspkr i2c_core scsi_mod uhci_hcd tpm_tis tpm tpm_bios nfs nfs_acl lockd sunrpc e1000e Feb 15 09:59:11 ehyperion607 Pid: 50, comm: events/0 Tainted: G ---- 2.6.18-348.1.1.el5 #1 Feb 15 09:59:11 ehyperion607 RIP: 0010:[<ffffffff80077925>] [<ffffffff80077925>] __smp_call_function_many+0x96/0xbc Feb 15 09:59:11 ehyperion607 RSP: 0018:ffff8101bf953d40 EFLAGS: 00000297 Feb 15 09:59:11 ehyperion607 RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000 Feb 15 09:59:11 ehyperion607 RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0 Feb 15 09:59:11 ehyperion607 RBP: ffff8101bf953e10 R08: 0000000000000008 R09: 0000000000000038 Feb 15 09:59:11 ehyperion607 R10: ffff8101bf953ce0 R11: ffff8101c571c500 R12: 00000002ffffffff Feb 15 09:59:11 ehyperion607 R13: ffffffff8008dedf R14: 00000001ffffffff R15: ffffffff80157cde Feb 15 09:59:11 ehyperion607 FS: 0000000000000000(0000) GS:ffffffff80435000(0000) knlGS:0000000000000000 Feb 15 09:59:11 ehyperion607 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Feb 15 09:59:11 ehyperion607 CR2: 00002aaaabb7c088 CR3: 00000001b73bb000 CR4: 00000000000006a0 Feb 15 09:59:11 ehyperion607 Feb 15 09:59:11 ehyperion607 Call Trace: Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30 Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30 Feb 15 09:59:11 ehyperion607 [<ffffffff80077a27>] smp_call_function_many+0x38/0x4c Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30 Feb 15 09:59:11 ehyperion607 [<ffffffff80077b18>] smp_call_function+0x4e/0x5e Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30 Feb 15 09:59:11 ehyperion607 [<ffffffff80073020>] mcheck_timer+0x0/0x6c Feb 15 09:59:11 ehyperion607 [<ffffffff80096be9>] on_each_cpu+0x10/0x22 Feb 15 09:59:11 ehyperion607 [<ffffffff8007303c>] mcheck_timer+0x1c/0x6c Feb 15 09:59:11 ehyperion607 [<ffffffff8004d8b9>] run_workqueue+0x9e/0xfb Feb 15 09:59:11 ehyperion607 [<ffffffff8004a102>] worker_thread+0x0/0x122 Feb 15 09:59:11 ehyperion607 [<ffffffff8004a1f2>] worker_thread+0xf0/0x122 Feb 15 09:59:11 ehyperion607 [<ffffffff8008f38d>] default_wake_function+0x0/0xe Feb 15 09:59:11 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132 Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11 Feb 15 09:59:11 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132 Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11 Feb 15 09:59:11 ehyperion607 Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-OST0006-osc-ffff8101bf048400: tried all connections, increasing latency to 9s Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 25 previous similar messages Feb 15 09:59:54 ehyperion607 INFO: task pdflush:543 blocked for more than 120 seconds. Feb 15 09:59:54 ehyperion607 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 15 09:59:54 ehyperion607 pdflush D ffff8101c546a7a0 0 543 171 544 542 (L-TLB) Feb 15 09:59:54 ehyperion607 ffff81033f57fb30 0000000000000046 ffff81033f57fa80 0000000000000000 Feb 15 09:59:54 ehyperion607 ffffffff88718220 000000000000000a ffff8101bf566820 ffff8101c56de7a0 Feb 15 09:59:54 ehyperion607 00009ea40c4605ce 00000000001a7575 ffff8101bf566a08 0000000705c2b6a0 Feb 15 09:59:54 ehyperion607 Call Trace: Feb 15 09:59:54 ehyperion607 [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90 Feb 15 09:59:54 ehyperion607 [<ffffffff80029173>] sync_page+0x0/0x43 Feb 15 09:59:54 ehyperion607 [<ffffffff800637de>] io_schedule+0x3f/0x67 Feb 15 09:59:54 ehyperion607 [<ffffffff800291b1>] sync_page+0x3e/0x43 Feb 15 09:59:54 ehyperion607 [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66 Feb 15 09:59:54 ehyperion607 [<ffffffff8003ff85>] __lock_page+0x5e/0x64 Feb 15 09:59:54 ehyperion607 [<ffffffff800a3c31>] wake_bit_function+0x0/0x23 Feb 15 09:59:54 ehyperion607 [<ffffffff8001d54e>] mpage_writepages+0x14f/0x37d Feb 15 09:59:54 ehyperion607 [<ffffffff88846c30>] :lustre:ll_writepage_26+0x0/0x10 Feb 15 09:59:54 ehyperion607 [<ffffffff8005aee2>] do_writepages+0x20/0x2f Feb 15 09:59:54 ehyperion607 [<ffffffff8002ff0f>] __writeback_single_inode+0x1a2/0x31c Feb 15 09:59:54 ehyperion607 [<ffffffff8002159a>] sync_sb_inodes+0x1b7/0x271 Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4 Feb 15 09:59:54 ehyperion607 [<ffffffff8005130b>] writeback_inodes+0x82/0xd8 Feb 15 09:59:54 ehyperion607 [<ffffffff800cd802>] wb_kupdate+0xf0/0x16a Feb 15 09:59:54 ehyperion607 [<ffffffff800568b2>] pdflush+0x0/0x1fb Feb 15 09:59:54 ehyperion607 [<ffffffff80056a03>] pdflush+0x151/0x1fb Feb 15 09:59:54 ehyperion607 [<ffffffff800cd712>] wb_kupdate+0x0/0x16a Feb 15 09:59:54 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132 Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11 Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4 Feb 15 09:59:54 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132 Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11 Feb 15 09:59:54 ehyperion607 |
| Comments |
| Comment by Peter Jones [ 15/Feb/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Oleg Drokin [ 15/Feb/13 ] |
|
So, pdflush blocking is expected - the logs show disconnction from OST, so no way to flush those dirty pages until reconnect happens, and I assuem t's not happening? Any further messages to clue us why? Teh soft lockup stack is strange, I don't believe we actually use events API in lustre at all so I am not sure why would it lock up there. |
| Comment by Jian Yu [ 16/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC1 The large-scale test 3a (failing over MDS while running mdsrate on clients) also hung over IB network ( The same test run by autotest passed over TCP network: https://maloo.whamcloud.com/test_sets/ad2ba008-7683-11e2-bc2f-52540035b04c The recovery-mds-scale test run by autotest also passed over TCP network with hard failing over MDS 95 times in 24 hours: https://maloo.whamcloud.com/sub_tests/9caf800a-77dc-11e2-abae-52540035b04c |
| Comment by Niu Yawei (Inactive) [ 16/Feb/13 ] |
|
Cliff, is there any log from OSS and MDS? Thanks. |
| Comment by Jian Yu [ 18/Feb/13 ] |
|
Lustre Tag: v1_8_9_WC1_RC1 MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive)
\ /
1 combined MGS/MDT
OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
6 OSTs
Client Nodes: fat-intel-2-ib,client-[1,3,6,17]-ib
I started a manual test run of recovery-mds-scale over IB network with the above configuration on Toro cluster (the number of client is far less than that on Hyperion cluster). Finally, the MDS failed over 24 times in 6 hours without issue, and the OST failed over 1 time with hitting |
| Comment by Peter Jones [ 18/Feb/13 ] |
|
Could it be worth testing on the OpenSFS test cluster? That should give us more clients to play with (though still not quite as many as Hyperion) |
| Comment by Cliff White (Inactive) [ 18/Feb/13 ] |
|
MDS log: Feb 15 09:56:52 hyperion-mds3 Lustre: MDT lustre-MDT0000 has stopped. Feb 15 09:56:52 hyperion-mds3 LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (stopping) Feb 15 09:56:52 hyperion-mds3 LustreError: Skipped 2 previous similar messages Feb 15 09:56:52 hyperion-mds3 LustreError: 13540:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (- 19) req@ffff81023f2b2c50 x1426985035882333/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1360951112 ref 1 fl Interpret :/0/0 rc -19/0 Feb 15 09:56:52 hyperion-mds3 LustreError: 13540:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 1 previous simi lar message Feb 15 09:56:52 hyperion-mds3 LustreError: 13549:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr om 12345-192.168.116.64@o2ib Feb 15 09:56:53 hyperion-mds3 LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target ) Feb 15 09:56:53 hyperion-mds3 LustreError: Skipped 2 previous similar messages Feb 15 09:56:53 hyperion-mds3 Lustre: server umount lustre-MDT0000 complete Feb 15 09:56:53 hyperion-mds3 mrshd[14785]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '");echo XXRETCODE:$?' Feb 15 09:57:03 hyperion-mds3 mrshd[14795]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "hostname");echo XXRETCODE:$?' Feb 15 09:57:03 hyperion-mds3 mrshd[14801]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "mkdir -p /p/mds; mount -t lustre -o user_xattr /dev/md1 /p/mds");echo XXRETCODE:$?' Feb 15 09:57:04 hyperion-mds3 LDISKFS-fs (md1): mounted filesystem with ordered data mode Feb 15 09:57:04 hyperion-mds3 JBD: barrier-based sync failed on md1-8 - disabling barriers Feb 15 09:57:05 hyperion-mds3 LDISKFS-fs (md1): warning: maximal mount count reached, running e2fsck is recommended Feb 15 09:57:06 hyperion-mds3 LDISKFS-fs (md1): mounted filesystem with ordered data mode Feb 15 09:57:06 hyperion-mds3 Lustre: Enabling user_xattr Feb 15 09:57:06 hyperion-mds3 Lustre: Skipped 1 previous similar message Feb 15 09:57:06 hyperion-mds3 Lustre: Enabling user_xattr Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_fs.c:677:mds_init_server_data()) RECOVERY: service lustre-MDT0000 , 70 recoverable clients, 0 delayed clients, last_transno 38654758668 Feb 15 09:57:06 hyperion-mds3 JBD: barrier-based sync failed on md1-8 - disabling barriers Feb 15 09:57:06 hyperion-mds3 Lustre: lustre-MDT0000: Now serving lustre-MDT0000 on /dev/md1 with recovery enabled Feb 15 09:57:06 hyperion-mds3 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 70 clients rec onnect Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upca ll set to /usr/sbin/l_getgroups Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OS T0000_UUID Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) Skipped 4 previous similar messages Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) MDS lustre-MDT0000: in recovery, not rese tting orphans on lustre-OST0000_UUID Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) Skipped 1 previous similar message Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OS T0004_UUID Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) Skipped 3 previous similar messages Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) MDS lustre-MDT0000: in recovery, not rese tting orphans on lustre-OST0005_UUID Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) Skipped 4 previous similar messages Feb 15 09:57:06 hyperion-mds3 mrshd[14929]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/ioz one/bin:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/tests:/opt/mdtest-1.7.4/bin:/opt/mib-1.9.10/bin:/opt/ior-2.1 0.2/bin:/opt/simul-1.15/bin:/opt/mvapich-gnu-gen2-1.2/bin:/admin/bin:/admin/sbin:/admin/scripts:/usr/lib64/qt-3.3/bi n:/bin:/usr/bin:/usr/bin/X11:/sbin:/usr/sbin:/usr/kerberos/bin:/usr/kerberos/sbin:/usr/lib64/lustre/utils:/usr/lib64 /lustre/../lustre-iokit/sgpdd-survey::/usr/lib64/lustre/tests/mpi NAME=ncli sh rpc.sh set_default_debug \"0x33f1504\ " \" 0xffb7e3ff\" 16 ");echo XXRETCODE:$?' Feb 15 09:57:06 hyperion-mds3 mrshd[15014]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "e2label /dev/md1");echo XXRETCODE:$?' Feb 15 09:57:07 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 69 recoverable clients remain Feb 15 09:57:07 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 2 previous similar messages Feb 15 09:57:07 hyperion-mds3 mrshd[15022]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr /bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust re/tests; sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK");echo XXRETCO DE:$?' Feb 15 09:57:07 hyperion-mds3 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr om 12345-192.168.116.10@o2ib Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(handler.c:1515:mds_handle()) Skipped 14 previous similar message s Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (- 107) req@ffff810f919ca000 x1426985034582787/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1360951043 ref 1 fl Interpr et:H/0/0 rc -107/0 Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 21 previous sim ilar messages Feb 15 09:57:11 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 25 recoverable clients remain Feb 15 09:57:11 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 48 previous similar messages Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr om 12345-192.168.116.78@o2ib Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(handler.c:1515:mds_handle()) Skipped 9 previous similar messages Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (- 107) req@ffff810837941400 x1426985038316154/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1360951052 ref 1 fl Interpr et:H/0/0 rc -107/0 Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 9 previous simi lar messages Feb 15 09:57:19 hyperion-mds3 Lustre: 14850:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: d18624c2-39 f3-9e9c-8a23-42b31c59c704 reconnecting Feb 15 09:57:19 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 5 r ecoverable clients remain Feb 15 09:57:19 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 9 previous similar messages Feb 15 09:57:21 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: e1802d77-7b 80-d67a-2994-6269f294be6d reconnecting Feb 15 09:57:27 hyperion-mds3 Lustre: 14877:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c 6c-ef27-da63-3094a685e606 reconnecting Feb 15 09:57:35 hyperion-mds3 Lustre: 14860:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c 6c-ef27-da63-3094a685e606 reconnecting Feb 15 09:57:35 hyperion-mds3 Lustre: 14860:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar messages Feb 15 09:57:44 hyperion-mds3 Lustre: 14852:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c 6c-ef27-da63-3094a685e606 reconnecting Feb 15 09:57:44 hyperion-mds3 Lustre: 14852:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 4 previous similar messages Feb 15 10:00:01 hyperion-mds3 mrshd[15031]: root@ehyperion0 as root: cmd='rdistd -S' Feb 15 10:01:33 hyperion-mds3 Lustre: MGS: haven't heard from client 2d23bcad-fae7-5d3d-38a1-4db19f074eb8 (at 192.16 8.116.30@o2ib) in 228 seconds. I think it's dead, and I am evicting it. Feb 15 10:01:33 hyperion-mds3 Lustre: MGS: haven't heard from client 1fdb7e20-6f21-c410-56a3-6c818466b316 (at 192.16 8.116.32@o2ib) in 226 seconds. I think it's dead, and I am evicting it. Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 31a45b78-d0 37-7471-33be-0b77d9441939 reconnecting Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 3 previous similar messages Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:03:40 hyperion-mds3 LustreError: 14857:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (- 16) req@ffff810c46390000 x1426985036859689/t0 o38->31a45b78-d037-7471-33be-0b77d9441939@NET_0x50000c0a8741b_UUID:0/ 0 lens 368/264 e 0 to 0 dl 1360951520 ref 1 fl Interpret:/0/0 rc -16/0 Feb 15 10:03:40 hyperion-mds3 LustreError: 14857:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 7 previous simi lar messages Feb 15 10:03:48 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:03:56 hyperion-mds3 Lustre: 14870:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:04:04 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 31a45b78-d0 37-7471-33be-0b77d9441939 reconnecting Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 3 previous similar messages Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:04:12 hyperion-mds3 LustreError: 14872:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (- 16) req@ffff8105ddcde400 x1426985036859702/t0 o38->31a45b78-d037-7471-33be-0b77d9441939@NET_0x50000c0a8741b_UUID:0/ 0 lens 368/264 e 0 to 0 dl 1360951552 ref 1 fl Interpret:/0/0 rc -16/0 Feb 15 10:04:12 hyperion-mds3 LustreError: 14872:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 3 previous simi lar messages Feb 15 10:04:20 hyperion-mds3 Lustre: 14856:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Feb 15 10:04:36 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active RPCs Let me know if you need more |
| Comment by Cliff White (Inactive) [ 18/Feb/13 ] |
|
Typical OST:
Feb 15 09:57:07 ehyperion1158 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Feb 15 09:57:40 ehyperion1158 Lustre: 30886:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: e1802d77-7b
80-d67a-2994-6269f294be6d reconnecting
Feb 15 09:57:41 ehyperion1158 Lustre: 30840:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0007: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:41 ehyperion1158 Lustre: 30840:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar
message
Feb 15 09:57:43 ehyperion1158 Lustre: 30852:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:45 ehyperion1158 Lustre: 30935:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:45 ehyperion1158 Lustre: 30935:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar
message
Feb 15 09:57:48 ehyperion1158 Lustre: 30932:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0007: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:48 ehyperion1158 Lustre: 30932:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar
messages
Feb 15 09:57:52 ehyperion1158 Lustre: 30887:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:52 ehyperion1158 Lustre: 30887:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar
messages
Feb 15 10:00:01 ehyperion1158 mrshd[7679]: root@ehyperion0 as root: cmd='rdistd -S'
Feb 15 10:01:42 ehyperion1158 Lustre: lustre-OST0000: haven't heard from client e1802d77-7b80-d67a-2994-6269f294be6d
(at 192.168.116.32@o2ib) in 227 seconds. I think it's dead, and I am evicting it.
Feb 15 10:01:42 ehyperion1158 Lustre: Skipped 1 previous similar message
Feb 15 10:01:42 ehyperion1158 Lustre: lustre-OST0000: haven't heard from client d18624c2-39f3-9e9c-8a23-42b31c59c704
(at 192.168.116.28@o2ib) in 222 seconds. I think it's dead, and I am evicting it.
Feb 15 10:07:07 ehyperion1158 Lustre: lustre-OST0000: received MDS connection from 192.168.120.126@o2ib
Feb 15 10:07:07 ehyperion1158 Lustre: 30878:0:(filter.c:3129:filter_destroy_precreated()) lustre-OST0000: deleting o
rphan objects from 683938 to 683981, orphan objids won't be reused any more.
|
| Comment by Cliff White (Inactive) [ 18/Feb/13 ] |
|
One OST did report an issue during this time. Feb 15 09:57:07 ehyperion1157 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Feb 15 09:57:41 ehyperion1157 Lustre: 31000:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39 f3-9e9c-8a23-42b31c59c704 reconnecting Feb 15 09:57:41 ehyperion1157 Lustre: 31000:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar messages Feb 15 09:57:48 ehyperion1157 Lustre: 30938:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39 f3-9e9c-8a23-42b31c59c704 reconnecting Feb 15 09:57:48 ehyperion1157 Lustre: 30938:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar message Feb 15 09:57:56 ehyperion1157 Lustre: 31047:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39 f3-9e9c-8a23-42b31c59c704 reconnecting Feb 15 09:57:56 ehyperion1157 Lustre: 31047:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar message Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, desc ffff8104da95e000 Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, desc ffff8104da95e000 Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, desc ffff8102d9e48000 Feb 15 09:58:07 ehyperion1157 LustreError: 32307:0:(ost_handler.c:1078:ost_brw_write()) @@@ network error on bulk GE T 0(1048576) req@ffff81031378f850 x1426985036338408/t0 o4->2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e _UUID:0/0 lens 448/416 e 1 to 0 dl 1360951137 ref 1 fl Interpret:/0/0 rc 0/0 Feb 15 09:58:07 ehyperion1157 Lustre: 32307:0:(ost_handler.c:1229:ost_brw_write()) lustre-OST0003: ignoring bulk IO comm error with 2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e_UUID id 12345-192.168.116.30@o2ib - client will retry Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, desc ffff8102d9e48000 Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, desc ffff8102e61e6000 Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, desc ffff8102e61e6000 Feb 15 09:58:08 ehyperion1157 LustreError: 32276:0:(ost_handler.c:1078:ost_brw_write()) @@@ network error on bulk GE T 0(1048576) req@ffff810625080000 x1426985036338409/t0 o4->2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e _UUID:0/0 lens 448/416 e 1 to 0 dl 1360951137 ref 1 fl Interpret:/0/0 rc 0/0 Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, desc ffff810544376000 |
| Comment by Johann Lombardi (Inactive) [ 18/Feb/13 ] |
|
What about Client's sysrq-t (or even better crash dump)? |
| Comment by Cliff White (Inactive) [ 19/Feb/13 ] |
|
I will attempt to reproduce today and get sysrq. |
| Comment by Andreas Dilger [ 29/Sep/15 ] |
|
Closing old 1.8 bug. |