[LU-2822] softlockups, evictions during recovery-scale Created: 15/Feb/13  Updated: 29/Sep/15  Resolved: 29/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CHAOS4/RHEL5 Old Hyperion system


Severity: 3
Rank (Obsolete): 6828

 Description   

running the mds-recovery-scale test. Not really sure what is going on. The test completes one failover, then activity ceases. Appear to be having soft lockups on the clients, pdflush is blocked, followed/prceeded by evictions from OSTs.
Example, this is during an MDS failover event:

Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-MDT0000-mdc-ffff8101bf048400: tried all connections, increasing latency to 8s
Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 16 previous similar messages
Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426985036308729 sent from lustre-OST0006-osc-ffff8101bf048400 to NID 192.168.120.86@o2ib 12s ago has timed out (12s prior to deadline).
Feb 15 09:59:11 ehyperion607 req@ffff8101a84e8400 x1426985036308729/t0 o8->lustre-OST0006_UUID@192.168.120.86@o2ib:28/4 lens 368/584 e 0 to 1 dl 1360951115 ref 2 fl Rpc:N/0/0 rc 0/0
Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
Feb 15 09:59:11 ehyperion607 BUG: soft lockup - CPU#0 stuck for 60s! [events/0:50]
Feb 15 09:59:11 ehyperion607 CPU 0:
Feb 15 09:59:11 ehyperion607 Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad mlx4_ib mlx4_core ipoib_helper ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev shpchp ahci i7core_edac edac_mc libata i2c_i801 pcspkr i2c_core scsi_mod uhci_hcd tpm_tis tpm tpm_bios nfs nfs_acl lockd sunrpc e1000e
Feb 15 09:59:11 ehyperion607 Pid: 50, comm: events/0 Tainted: G     ---- 2.6.18-348.1.1.el5 #1
Feb 15 09:59:11 ehyperion607 RIP: 0010:[<ffffffff80077925>]  [<ffffffff80077925>] __smp_call_function_many+0x96/0xbc
Feb 15 09:59:11 ehyperion607 RSP: 0018:ffff8101bf953d40  EFLAGS: 00000297
Feb 15 09:59:11 ehyperion607 RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000
Feb 15 09:59:11 ehyperion607 RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
Feb 15 09:59:11 ehyperion607 RBP: ffff8101bf953e10 R08: 0000000000000008 R09: 0000000000000038
Feb 15 09:59:11 ehyperion607 R10: ffff8101bf953ce0 R11: ffff8101c571c500 R12: 00000002ffffffff
Feb 15 09:59:11 ehyperion607 R13: ffffffff8008dedf R14: 00000001ffffffff R15: ffffffff80157cde
Feb 15 09:59:11 ehyperion607 FS:  0000000000000000(0000) GS:ffffffff80435000(0000) knlGS:0000000000000000
Feb 15 09:59:11 ehyperion607 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Feb 15 09:59:11 ehyperion607 CR2: 00002aaaabb7c088 CR3: 00000001b73bb000 CR4: 00000000000006a0
Feb 15 09:59:11 ehyperion607
Feb 15 09:59:11 ehyperion607 Call Trace:
Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
Feb 15 09:59:11 ehyperion607 [<ffffffff80077a27>] smp_call_function_many+0x38/0x4c
Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
Feb 15 09:59:11 ehyperion607 [<ffffffff80077b18>] smp_call_function+0x4e/0x5e
Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
Feb 15 09:59:11 ehyperion607 [<ffffffff80073020>] mcheck_timer+0x0/0x6c
Feb 15 09:59:11 ehyperion607 [<ffffffff80096be9>] on_each_cpu+0x10/0x22
Feb 15 09:59:11 ehyperion607 [<ffffffff8007303c>] mcheck_timer+0x1c/0x6c
Feb 15 09:59:11 ehyperion607 [<ffffffff8004d8b9>] run_workqueue+0x9e/0xfb
Feb 15 09:59:11 ehyperion607 [<ffffffff8004a102>] worker_thread+0x0/0x122
Feb 15 09:59:11 ehyperion607 [<ffffffff8004a1f2>] worker_thread+0xf0/0x122
Feb 15 09:59:11 ehyperion607 [<ffffffff8008f38d>] default_wake_function+0x0/0xe
Feb 15 09:59:11 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132
Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11
Feb 15 09:59:11 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132
Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11
Feb 15 09:59:11 ehyperion607
Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-OST0006-osc-ffff8101bf048400: tried all connections, increasing latency to 9s
Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 25 previous similar messages
Feb 15 09:59:54 ehyperion607 INFO: task pdflush:543 blocked for more than 120 seconds.
Feb 15 09:59:54 ehyperion607 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 09:59:54 ehyperion607 pdflush       D ffff8101c546a7a0     0   543    171           544   542 (L-TLB)
Feb 15 09:59:54 ehyperion607 ffff81033f57fb30 0000000000000046 ffff81033f57fa80 0000000000000000
Feb 15 09:59:54 ehyperion607 ffffffff88718220 000000000000000a ffff8101bf566820 ffff8101c56de7a0
Feb 15 09:59:54 ehyperion607 00009ea40c4605ce 00000000001a7575 ffff8101bf566a08 0000000705c2b6a0
Feb 15 09:59:54 ehyperion607 Call Trace:
Feb 15 09:59:54 ehyperion607 [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90
Feb 15 09:59:54 ehyperion607 [<ffffffff80029173>] sync_page+0x0/0x43
Feb 15 09:59:54 ehyperion607 [<ffffffff800637de>] io_schedule+0x3f/0x67
Feb 15 09:59:54 ehyperion607 [<ffffffff800291b1>] sync_page+0x3e/0x43
Feb 15 09:59:54 ehyperion607 [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
Feb 15 09:59:54 ehyperion607 [<ffffffff8003ff85>] __lock_page+0x5e/0x64
Feb 15 09:59:54 ehyperion607 [<ffffffff800a3c31>] wake_bit_function+0x0/0x23
Feb 15 09:59:54 ehyperion607 [<ffffffff8001d54e>] mpage_writepages+0x14f/0x37d
Feb 15 09:59:54 ehyperion607 [<ffffffff88846c30>] :lustre:ll_writepage_26+0x0/0x10
Feb 15 09:59:54 ehyperion607 [<ffffffff8005aee2>] do_writepages+0x20/0x2f
Feb 15 09:59:54 ehyperion607 [<ffffffff8002ff0f>] __writeback_single_inode+0x1a2/0x31c
Feb 15 09:59:54 ehyperion607 [<ffffffff8002159a>] sync_sb_inodes+0x1b7/0x271
Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4
Feb 15 09:59:54 ehyperion607 [<ffffffff8005130b>] writeback_inodes+0x82/0xd8
Feb 15 09:59:54 ehyperion607 [<ffffffff800cd802>] wb_kupdate+0xf0/0x16a
Feb 15 09:59:54 ehyperion607 [<ffffffff800568b2>] pdflush+0x0/0x1fb
Feb 15 09:59:54 ehyperion607 [<ffffffff80056a03>] pdflush+0x151/0x1fb
Feb 15 09:59:54 ehyperion607 [<ffffffff800cd712>] wb_kupdate+0x0/0x16a
Feb 15 09:59:54 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132
Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11
Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4
Feb 15 09:59:54 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132
Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11
Feb 15 09:59:54 ehyperion607


 Comments   
Comment by Peter Jones [ 15/Feb/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Oleg Drokin [ 15/Feb/13 ]

So, pdflush blocking is expected - the logs show disconnction from OST, so no way to flush those dirty pages until reconnect happens, and I assuem t's not happening? Any further messages to clue us why?

Teh soft lockup stack is strange, I don't believe we actually use events API in lustre at all so I am not sure why would it lock up there.

Comment by Jian Yu [ 16/Feb/13 ]

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64(server), RHEL6.3/x86_64(client)

The large-scale test 3a (failing over MDS while running mdsrate on clients) also hung over IB network (LU-2823). The logs also showed that connection to OST was lost.

The same test run by autotest passed over TCP network: https://maloo.whamcloud.com/test_sets/ad2ba008-7683-11e2-bc2f-52540035b04c

The recovery-mds-scale test run by autotest also passed over TCP network with hard failing over MDS 95 times in 24 hours: https://maloo.whamcloud.com/sub_tests/9caf800a-77dc-11e2-abae-52540035b04c

Comment by Niu Yawei (Inactive) [ 16/Feb/13 ]

Cliff, is there any log from OSS and MDS? Thanks.

Comment by Jian Yu [ 18/Feb/13 ]

Lustre Tag: v1_8_9_WC1_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/256
Distro/Arch: RHEL5.9/x86_64
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive)
                                  \  /
                            1 combined MGS/MDT

OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
                              \  /
                              6 OSTs

Client Nodes: fat-intel-2-ib,client-[1,3,6,17]-ib

I started a manual test run of recovery-mds-scale over IB network with the above configuration on Toro cluster (the number of client is far less than that on Hyperion cluster). Finally, the MDS failed over 24 times in 6 hours without issue, and the OST failed over 1 time with hitting LU-463:
https://maloo.whamcloud.com/test_sets/d7e82752-79db-11e2-8fd2-52540035b04c

Comment by Peter Jones [ 18/Feb/13 ]

Could it be worth testing on the OpenSFS test cluster? That should give us more clients to play with (though still not quite as many as Hyperion)

Comment by Cliff White (Inactive) [ 18/Feb/13 ]

MDS log:

Feb 15 09:56:52 hyperion-mds3 Lustre: MDT lustre-MDT0000 has stopped.
Feb 15 09:56:52 hyperion-mds3 LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (stopping)
Feb 15 09:56:52 hyperion-mds3 LustreError: Skipped 2 previous similar messages
Feb 15 09:56:52 hyperion-mds3 LustreError: 13540:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-
19)  req@ffff81023f2b2c50 x1426985035882333/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1360951112 ref 1 fl Interpret
:/0/0 rc -19/0
Feb 15 09:56:52 hyperion-mds3 LustreError: 13540:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 1 previous simi
lar message
Feb 15 09:56:52 hyperion-mds3 LustreError: 13549:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr
om 12345-192.168.116.64@o2ib
Feb 15 09:56:53 hyperion-mds3 LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (no target
)
Feb 15 09:56:53 hyperion-mds3 LustreError: Skipped 2 previous similar messages
Feb 15 09:56:53 hyperion-mds3 Lustre: server umount lustre-MDT0000 complete
Feb 15 09:56:53 hyperion-mds3 mrshd[14785]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '");echo XXRETCODE:$?'
Feb 15 09:57:03 hyperion-mds3 mrshd[14795]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "hostname");echo XXRETCODE:$?'
Feb 15 09:57:03 hyperion-mds3 mrshd[14801]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "mkdir -p /p/mds; mount -t lustre -o user_xattr  /dev/md1 /p/mds");echo XXRETCODE:$?'
Feb 15 09:57:04 hyperion-mds3 LDISKFS-fs (md1): mounted filesystem with ordered data mode
Feb 15 09:57:04 hyperion-mds3 JBD: barrier-based sync failed on md1-8 - disabling barriers
Feb 15 09:57:05 hyperion-mds3 LDISKFS-fs (md1): warning: maximal mount count reached, running e2fsck is recommended
Feb 15 09:57:06 hyperion-mds3 LDISKFS-fs (md1): mounted filesystem with ordered data mode
Feb 15 09:57:06 hyperion-mds3 Lustre: Enabling user_xattr
Feb 15 09:57:06 hyperion-mds3 Lustre: Skipped 1 previous similar message
Feb 15 09:57:06 hyperion-mds3 Lustre: Enabling user_xattr
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_fs.c:677:mds_init_server_data()) RECOVERY: service lustre-MDT0000
, 70 recoverable clients, 0 delayed clients, last_transno 38654758668
Feb 15 09:57:06 hyperion-mds3 JBD: barrier-based sync failed on md1-8 - disabling barriers
Feb 15 09:57:06 hyperion-mds3 Lustre: lustre-MDT0000: Now serving lustre-MDT0000 on /dev/md1 with recovery enabled
Feb 15 09:57:06 hyperion-mds3 Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 70 clients rec
onnect
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upca
ll set to /usr/sbin/l_getgroups
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OS
T0000_UUID
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) Skipped 4 previous similar messages
Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) MDS lustre-MDT0000: in recovery, not rese
tting orphans on lustre-OST0000_UUID
Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) Skipped 1 previous similar message
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OS
T0004_UUID
Feb 15 09:57:06 hyperion-mds3 Lustre: 14918:0:(mds_lov.c:1155:mds_notify()) Skipped 3 previous similar messages
Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) MDS lustre-MDT0000: in recovery, not rese
tting orphans on lustre-OST0005_UUID
Feb 15 09:57:06 hyperion-mds3 Lustre: 3083:0:(mds_lov.c:1191:mds_notify()) Skipped 4 previous similar messages
Feb 15 09:57:06 hyperion-mds3 mrshd[14929]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/ioz
one/bin:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/tests:/opt/mdtest-1.7.4/bin:/opt/mib-1.9.10/bin:/opt/ior-2.1
0.2/bin:/opt/simul-1.15/bin:/opt/mvapich-gnu-gen2-1.2/bin:/admin/bin:/admin/sbin:/admin/scripts:/usr/lib64/qt-3.3/bi
n:/bin:/usr/bin:/usr/bin/X11:/sbin:/usr/sbin:/usr/kerberos/bin:/usr/kerberos/sbin:/usr/lib64/lustre/utils:/usr/lib64
/lustre/../lustre-iokit/sgpdd-survey::/usr/lib64/lustre/tests/mpi NAME=ncli sh rpc.sh set_default_debug \"0x33f1504\
" \" 0xffb7e3ff\" 16 ");echo XXRETCODE:$?'
Feb 15 09:57:06 hyperion-mds3 mrshd[15014]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "e2label /dev/md1");echo XXRETCODE:$?'
Feb 15 09:57:07 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 69 
recoverable clients remain
Feb 15 09:57:07 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 2 previous 
similar messages
Feb 15 09:57:07 hyperion-mds3 mrshd[15022]: root@ehyperion582 as root: cmd='PATH=/admin/scripts:/admin/bin:/bin:/usr
/bin:/sbin:/usr/sbin;(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lust
re/tests; sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK");echo XXRETCO
DE:$?'
Feb 15 09:57:07 hyperion-mds3 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr
om 12345-192.168.116.10@o2ib
Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(handler.c:1515:mds_handle()) Skipped 14 previous similar message
s
Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-
107)  req@ffff810f919ca000 x1426985034582787/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1360951043 ref 1 fl Interpr
et:H/0/0 rc -107/0
Feb 15 09:57:07 hyperion-mds3 LustreError: 14871:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 21 previous sim
ilar messages
Feb 15 09:57:11 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 25 
recoverable clients remain
Feb 15 09:57:11 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 48 previous
 similar messages
Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(handler.c:1515:mds_handle()) operation 400 on unconnected MDS fr
om 12345-192.168.116.78@o2ib
Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(handler.c:1515:mds_handle()) Skipped 9 previous similar messages
Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-
107)  req@ffff810837941400 x1426985038316154/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1360951052 ref 1 fl Interpr
et:H/0/0 rc -107/0
Feb 15 09:57:16 hyperion-mds3 LustreError: 14865:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 9 previous simi
lar messages
Feb 15 09:57:19 hyperion-mds3 Lustre: 14850:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:19 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) lustre-MDT0000: 5 r
ecoverable clients remain

Feb 15 09:57:19 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:1817:target_queue_last_replay_reply()) Skipped 9 previous 
similar messages
Feb 15 09:57:21 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: e1802d77-7b
80-d67a-2994-6269f294be6d reconnecting
Feb 15 09:57:27 hyperion-mds3 Lustre: 14877:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:35 hyperion-mds3 Lustre: 14860:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:35 hyperion-mds3 Lustre: 14860:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar 
messages
Feb 15 09:57:44 hyperion-mds3 Lustre: 14852:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:44 hyperion-mds3 Lustre: 14852:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 4 previous similar 
messages
Feb 15 10:00:01 hyperion-mds3 mrshd[15031]: root@ehyperion0 as root: cmd='rdistd -S'
Feb 15 10:01:33 hyperion-mds3 Lustre: MGS: haven't heard from client 2d23bcad-fae7-5d3d-38a1-4db19f074eb8 (at 192.16
8.116.30@o2ib) in 228 seconds. I think it's dead, and I am evicting it.
Feb 15 10:01:33 hyperion-mds3 Lustre: MGS: haven't heard from client 1fdb7e20-6f21-c410-56a3-6c818466b316 (at 192.16
8.116.32@o2ib) in 226 seconds. I think it's dead, and I am evicting it.
Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 31a45b78-d0
37-7471-33be-0b77d9441939 reconnecting
Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 3 previous similar 
messages
Feb 15 10:03:40 hyperion-mds3 Lustre: 14857:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:03:40 hyperion-mds3 LustreError: 14857:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-
16)  req@ffff810c46390000 x1426985036859689/t0 o38->31a45b78-d037-7471-33be-0b77d9441939@NET_0x50000c0a8741b_UUID:0/
0 lens 368/264 e 0 to 0 dl 1360951520 ref 1 fl Interpret:/0/0 rc -16/0
Feb 15 10:03:40 hyperion-mds3 LustreError: 14857:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 7 previous simi
lar messages
Feb 15 10:03:48 hyperion-mds3 Lustre: 14854:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:03:56 hyperion-mds3 Lustre: 14870:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:04:04 hyperion-mds3 Lustre: 14865:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-MDT0000: 31a45b78-d0
37-7471-33be-0b77d9441939 reconnecting
Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 3 previous similar 
messages
Feb 15 10:04:12 hyperion-mds3 Lustre: 14872:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:04:12 hyperion-mds3 LustreError: 14872:0:(ldlm_lib.c:1921:target_send_reply_msg()) @@@ processing error (-
16)  req@ffff8105ddcde400 x1426985036859702/t0 o38->31a45b78-d037-7471-33be-0b77d9441939@NET_0x50000c0a8741b_UUID:0/
0 lens 368/264 e 0 to 0 dl 1360951552 ref 1 fl Interpret:/0/0 rc -16/0
Feb 15 10:04:12 hyperion-mds3 LustreError: 14872:0:(ldlm_lib.c:1921:target_send_reply_msg()) Skipped 3 previous simi
lar messages
Feb 15 10:04:20 hyperion-mds3 Lustre: 14856:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs
Feb 15 10:04:36 hyperion-mds3 Lustre: 14866:0:(ldlm_lib.c:876:target_handle_connect()) lustre-MDT0000: refuse reconn
ection from 31a45b78-d037-7471-33be-0b77d9441939@192.168.116.27@o2ib to 0xffff8103bf910e00; still busy with 1 active
 RPCs

Let me know if you need more

Comment by Cliff White (Inactive) [ 18/Feb/13 ]

Typical OST:

Feb 15 09:57:07 ehyperion1158 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Feb 15 09:57:40 ehyperion1158 Lustre: 30886:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: e1802d77-7b
80-d67a-2994-6269f294be6d reconnecting
Feb 15 09:57:41 ehyperion1158 Lustre: 30840:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0007: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:41 ehyperion1158 Lustre: 30840:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar 
message
Feb 15 09:57:43 ehyperion1158 Lustre: 30852:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:45 ehyperion1158 Lustre: 30935:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:45 ehyperion1158 Lustre: 30935:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar 
message
Feb 15 09:57:48 ehyperion1158 Lustre: 30932:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0007: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:48 ehyperion1158 Lustre: 30932:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar 
messages
Feb 15 09:57:52 ehyperion1158 Lustre: 30887:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0000: 2d202d86-7c
6c-ef27-da63-3094a685e606 reconnecting
Feb 15 09:57:52 ehyperion1158 Lustre: 30887:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar 
messages
Feb 15 10:00:01 ehyperion1158 mrshd[7679]: root@ehyperion0 as root: cmd='rdistd -S'
Feb 15 10:01:42 ehyperion1158 Lustre: lustre-OST0000: haven't heard from client e1802d77-7b80-d67a-2994-6269f294be6d
 (at 192.168.116.32@o2ib) in 227 seconds. I think it's dead, and I am evicting it.
Feb 15 10:01:42 ehyperion1158 Lustre: Skipped 1 previous similar message
Feb 15 10:01:42 ehyperion1158 Lustre: lustre-OST0000: haven't heard from client d18624c2-39f3-9e9c-8a23-42b31c59c704
 (at 192.168.116.28@o2ib) in 222 seconds. I think it's dead, and I am evicting it.
Feb 15 10:07:07 ehyperion1158 Lustre: lustre-OST0000: received MDS connection from 192.168.120.126@o2ib
Feb 15 10:07:07 ehyperion1158 Lustre: 30878:0:(filter.c:3129:filter_destroy_precreated()) lustre-OST0000: deleting o
rphan objects from 683938 to 683981, orphan objids won't be reused any more.
Comment by Cliff White (Inactive) [ 18/Feb/13 ]

One OST did report an issue during this time.

Feb 15 09:57:07 ehyperion1157 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Feb 15 09:57:41 ehyperion1157 Lustre: 31000:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:41 ehyperion1157 Lustre: 31000:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 2 previous similar 
messages
Feb 15 09:57:48 ehyperion1157 Lustre: 30938:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:48 ehyperion1157 Lustre: 30938:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar 
message
Feb 15 09:57:56 ehyperion1157 Lustre: 31047:0:(ldlm_lib.c:576:target_handle_reconnect()) lustre-OST0003: d18624c2-39
f3-9e9c-8a23-42b31c59c704 reconnecting
Feb 15 09:57:56 ehyperion1157 Lustre: 31047:0:(ldlm_lib.c:576:target_handle_reconnect()) Skipped 1 previous similar 
message
Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, 
desc ffff8104da95e000
Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, 
desc ffff8104da95e000
Feb 15 09:58:07 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, 
desc ffff8102d9e48000
Feb 15 09:58:07 ehyperion1157 LustreError: 32307:0:(ost_handler.c:1078:ost_brw_write()) @@@ network error on bulk GE
T 0(1048576)  req@ffff81031378f850 x1426985036338408/t0 o4->2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e
_UUID:0/0 lens 448/416 e 1 to 0 dl 1360951137 ref 1 fl Interpret:/0/0 rc 0/0
Feb 15 09:58:07 ehyperion1157 Lustre: 32307:0:(ost_handler.c:1229:ost_brw_write()) lustre-OST0003: ignoring bulk IO 
comm error with 2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e_UUID id 12345-192.168.116.30@o2ib - client 
will retry
Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, 
desc ffff8102d9e48000
Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, 
desc ffff8102e61e6000
Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 2, status -103, 
desc ffff8102e61e6000
Feb 15 09:58:08 ehyperion1157 LustreError: 32276:0:(ost_handler.c:1078:ost_brw_write()) @@@ network error on bulk GE
T 0(1048576)  req@ffff810625080000 x1426985036338409/t0 o4->2d202d86-7c6c-ef27-da63-3094a685e606@NET_0x50000c0a8741e
_UUID:0/0 lens 448/416 e 1 to 0 dl 1360951137 ref 1 fl Interpret:/0/0 rc 0/0
Feb 15 09:58:08 ehyperion1157 LustreError: 30804:0:(events.c:387:server_bulk_callback()) event type 4, status -103, 
desc ffff810544376000
Comment by Johann Lombardi (Inactive) [ 18/Feb/13 ]

What about Client's sysrq-t (or even better crash dump)?

Comment by Cliff White (Inactive) [ 19/Feb/13 ]

I will attempt to reproduce today and get sysrq.

Comment by Andreas Dilger [ 29/Sep/15 ]

Closing old 1.8 bug.

Generated at Sat Feb 10 01:28:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.