[LU-3940] Clients in recovery for a very long time - LustreError: (ldlm_lib.c:941:target_handle_connect()) Created: 12/Sep/13  Updated: 16/Oct/13  Resolved: 16/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Matteo Piccinini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Linux Kernel version 2.6.32-220.4.2.el6_lustre.x86_64 (jenkins@client-31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012

Lustre Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

Processors: 2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz - Ram: 48GB


Attachments: File var.log.messages.tar.gz    
Severity: 3
Rank (Obsolete): 10422

 Description   

Dear support,

We encountered a serious problem with a production lustre installation where customers experience is negatively affected.
More specifically, some OSTs became offline following a node cluster fault and these volumes tend to remain in recovery status forever, denying connection from new clients (XXX clients in recovery for 18446744073709550863s).
A practical solution we found requires a manual abort of the recovery procedure to properly mount the volumes.
We need to understand whether this situation is due to a bug or a configuration problem since we have not been able to retrieve useful information on the bug tracking system.
Below we attach a short analysis of the problem and some extract from /var/log/messages file.

Thanks in advance for your support.

Kernel Version installed on the OSS:

Linux version 2.6.32-220.4.2.el6_lustre.x86_64 (jenkins@client-31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012

Lustre version installed on the OSS and MGS/MDS:
Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

Lustre version installed on the clients:
There is a mix of versions (we can be more precise if needed)
Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64
Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64

Processors:
2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz
Ram:
48GB

Infrastructure info:
2 node for MDS/MGS with pacemaker/corosync cluster
8 node for OSS in a 2 node cluster configuration with pacemaker/corosync cluster with a dothill storage controller per pair.
~1000 client

Problem description:

The problem described below was found on a two node cluster n-oss07 and n-oss08.
Storage configuration:

n-oss07:
nero-OST0006 -> vd01
nero-OST000e -> vd03
nero-OST0016 -> vd05
nero-OST001e -> vd07

n-oss08:
nero-OST0007 -> vd02
nero-OST000f -> vd04
nero-OST0017 -> vd02
nero-OST001f -> vd04

On 27.08 at about 11:24 the node n-oss07 hang and was shutdown by the node n-oss08 that successfully took over the the n-oss07 volumes putting the OSTs in recovery mode, particularly the OST nero-OST001e with a very long recovery time.
There was plenty of log like the following extract:

-------------------------------------------------------------------------
Aug 27 11:34:25 n-oss08 kernel: LustreError: 7913:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (eb0a36de-8fcf-cfb5-e996-eb6968148594): 329 clients in recovery for 18446744073709551490s
-------------------------------------------------------------------------

When the node n-oss07 came back online, it took back its resources forcing the unmount thus hanging the node n-oss08.

-------------------------------------------------------------------------
Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Running stop for /dev/mapper/vd07 on /lustre/vd07
Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Trying to unmount /lustre/vd07
Aug 27 11:41:21 n-oss08 kernel: Lustre: Failing over nero-OST001e
Aug 27 11:41:21 n-oss08 kernel: LustreError: 18866:0 (ldlm_lib.c:1978:target_stop_recovery_thread()) nero-OST001e: Aborting recovery
Aug 27 11:41:40 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)
Aug 27 11:41:40 n-oss08 kernel: LustreError: 4278:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@
processing error (19) req@ffff8805a769d800 x1443813942539660/t0(0) o8><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377596600
ref 1 fl Interpret:/0/ffffffff rc -19/-1
Aug 27 11:41:40 n-oss08 kernel: LustreError: 4278:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 15 previous similar messages
Aug 27 11:42:30 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)
Aug 27 11:42:30 n-oss08 kernel: LustreError: Skipped 709 previous similar messages
Aug 27 11:43:35 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)

Aug 27 11:43:35 n-oss08 kernel: LustreError: Skipped 1053 previous similar messages
Aug 27 11:43:48 n-oss08 kernel: INFO: task umount:18866 blocked for more than 120 seconds.
Aug 27 11:43:48 n-oss08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 27 11:43:48 n-oss08 kernel: umount D 0000000000000008 0 18866 18812 0x00000080
Aug 27 11:43:48 n-oss08 kernel: ffff880198ce7998 0000000000000082 0000000000000000 ffffffffa06e6ee3
Aug 27 11:43:48 n-oss08 kernel: ffff880198ce7928 ffffffffa0436105 00000000000007ba ffffffffa06d9670
Aug 27 11:43:48 n-oss08 kernel: ffff8803f7467b38 ffff880198ce7fd8 000000000000f4e8 ffff8803f7467b38
Aug 27 11:43:48 n-oss08 kernel: Call Trace:
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0436105>] ? cfs_print_to_console+0x75/0xe0 [libcfs]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814edb95>] schedule_timeout+0x215/0x2e0
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814ed813>] wait_for_common+0x123/0x180
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814ed92d>] wait_for_completion+0x1d/0x20
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa06664a0>] target_stop_recovery_thread+0x50/0xa0 [ptlrpc]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa066728e>] target_recovery_fini+0x1e/0x30 [ptlrpc]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa07c2266>] filter_precleanup+0xa6/0x470 [obdfilter]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa051dd66>] ? class_disconnect_exports+0x126/0x220 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0531da9>] class_cleanup+0x199/0xa30 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa04404f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0519db6>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0533303>] class_process_config+0xcc3/0x1670 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0437823>] ? cfs_alloc+0x63/0x90 [libcfs]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa052fc4b>] ? lustre_cfg_new+0x31b/0x640 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0533dfc>] class_manual_cleanup+0x14c/0x560 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0519db6>] ? class_name2dev+0x56/0xe0 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa053f5ec>] server_put_super+0xaac/0xf40 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff81191376>] ? invalidate_inodes+0xf6/0x190
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811786cb>] generic_shutdown_super+0x5b/0xe0
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811787b6>] kill_anon_super+0x16/0x60
Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0535a06>] lustre_kill_super+0x36/0x60 [obdclass]
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff81179740>] deactivate_super+0x70/0x90
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811956cf>] mntput_no_expire+0xbf/0x110
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8119616b>] sys_umount+0x7b/0x3a0
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff810d4582>] ? audit_syscall_entry+0x272/0x2a0
Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
-------------------------------------------------------------------------

Both the node were manually rebooted and they recovered their own resources but n-oss07 still continued to show a very long recovery time wrt OST nero-OST001e

-------------------------------------------------------------------------
Aug 27 12:05:49 n-oss07 kernel: LustreError: 3851:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying
connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551063s
Aug 27 12:06:14 n-oss07 kernel: LustreError: 3785:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551038s
Aug 27 12:06:39 n-oss07 kernel: LustreError: 3762:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551013s
Aug 27 12:07:04 n-oss07 kernel: LustreError: 3760:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550988s
Aug 27 12:07:29 n-oss07 kernel: LustreError: 3851:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection
for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550963s
Aug 27 12:07:54 n-oss07 kernel: LustreError: 3773:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550938s
Aug 27 12:08:19 n-oss07 kernel: LustreError: 3851:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550913s
Aug 27 12:09:09 n-oss07 kernel: LustreError: 3762:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550863s
-------------------------------------------------------------------------

Same thing for the OST nero-OST001f on the n-oss08 node

-------------------------------------------------------------------------
Aug 27 12:09:09 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550945s
Aug 27 12:09:09 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect())
Skipped 1 previous similar message
Aug 27 12:09:59 n-oss08 kernel: LustreError: 3739:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (16) req@ffff88049d05a400 x1444516067280684/t0(0) o8><?>@<?>:0/0 lens 400/264 e 0 to 0 dl 1377598298 ref 1 fl Interpret:/0/0 rc -16/0
Aug 27 12:09:59 n-oss08 kernel: LustreError: 3739:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 10 previous similar messages
Aug 27 12:10:16 n-oss08 cib: [3236]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
Aug 27 12:10:24 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550870s
Aug 27 12:10:24 n-oss08 kernel: LustreError: 3821:0:(ldlm_lib.c:941:target_handle_connect()) Skipped 2 previous similar messages
Aug 27 12:12:54 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550720s
Aug 27 12:12:54 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) Skipped 5 previous similar
messages
Aug 27 12:17:29 n-oss08 kernel: LustreError: 3843:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550445s
Aug 27 12:17:29 n-oss08 kernel: LustreError: 3843:0 (ldlm_lib.c:941:target_handle_connect())
Skipped 10 previous similar messages
Aug 27 12:20:16 n-oss08 cib: [3236]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
-------------------------------------------------------------------------

The only possible action to permit the mount of the OSTs was to manually abort the recovery procedure for nero-OST001e and nero-OST001f

-------------------------------------------------------------------------
Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Now serving nero-OST001e on /dev/mapper/vd07 with recovery enabled
Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Aborting recovery.
Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [...]
Aug 27 12:54:45 n-oss07 kernel: Lustre: Failing over nero-OST001e
Aug 27 12:54:46 n-oss07 kernel: LustreError: 3814:0 (ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Aug 27 12:54:46 n-oss07 kernel: Lustre: nero-OST001e: shutting down for failover; client state will be preserved.
Aug 27 12:54:46 n-oss07 kernel: LustreError: 3814:0 (ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Aug 27 12:54:46 n-oss07 kernel: Lustre: OST nero-OST001e has stopped.
Aug 27 12:54:49 n-oss07 kernel: Lustre: server umount nero-OST001e complete
[...]
Aug 27 12:56:40 n-oss07 Filesystem[5002]: INFO: Running start for /dev/mapper/vd07 on /lustre/vd07
Aug 27 12:56:40 n-oss07 kernel: Lustre: nero-OST0016: Will be in recovery for at least 2:30, or until 746 clients reconnect
Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
Aug 27 12:56:40 n-oss07 kernel: Lustre: 5111:0 (ldlm_lib.c:2019:target_recovery_init()) RECOVERY: service nero-OST001e,
746 recoverable clients, last_transno 734586500
[...]
Aug 27 12:56:41 n-oss07 kernel: Lustre: nero-OST001e: Will be in recovery for at least 2:30, or until 746 clients reconnect
-------------------------------------------------------------------------

-------------------------------------------------------------------------
Aug 27 12:53:12 n-oss08 kernel: Lustre: nero-OST001f: Now serving nero-OST001f on /dev/mapper/vd08 with recovery enabled
Aug 27 12:53:13 n-oss08 kernel: Lustre: nero-OST001f: Aborting recovery.
Aug 27 12:53:13 n-oss08 kernel: Lustre: nero-OST001f: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Aug 27 12:53:20 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST000e_UUID' is not available for connect (no target)
Aug 27 12:53:20 n-oss08 kernel: LustreError: 3402:0 (ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (19) req@ffff8805ff3f4400 x1439334118204356/t0(0) o8><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377600899 ref 1 fl Interpret:/0/ffffffff rc -19/-1
Aug 27 12:53:20 n-oss08 kernel: LustreError: 3402:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 743 previous similar messages
Aug 27 12:53:20 n-oss08 kernel: LustreError: Skipped 676 previous similar messages
Aug 27 12:53:21 n-oss08 kernel: Lustre: 3430:0 (filter.c:2691:filter_connect_internal()) nero-OST001f: Received MDS connection for group 0
Aug 27 12:53:21 n-oss08 kernel: Lustre: nero-OST001f: received MDS connection from 10.201.62.14@o2ib
Aug 27 12:53:22 n-oss08 kernel: Lustre: 3412:0 (filter.c:2547:filter_llog_connect()) nero-OST001f: Recovery from log 0xf9f0621/0x0:c4b6c135
Aug 27 12:53:52 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST0006_UUID' is not available for connect (no target)
Aug 27 12:53:52 n-oss08 kernel: LustreError: 3458:0 (ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-19)
req@ffff8805f7623800 x1443956471336464/t0(0) o8-><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377600931 ref 1 fl Interpret:/0/ffffffff rc -19/-1
Aug 27 12:53:52 n-oss08 kernel: LustreError: 3458:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 2056 previous similar messages
Aug 27 12:53:52 n-oss08 kernel: LustreError: Skipped 2120 previous similar messages
[...]
Aug 27 12:54:37 n-oss08 kernel: Lustre: Failing over nero-OST001f Aug 27 12:54:39 n-oss08 kernel: LustreError: 3734:0 (ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc
-108 from cancel RPC: canceling anyway
Aug 27 12:54:39 n-oss08 kernel: Lustre: nero-OST001f: shutting down for failover; client state will be preserved.
Aug 27 12:54:39 n-oss08 kernel: LustreError: 3734:0 (ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Aug 27 12:54:39 n-oss08 kernel: Lustre: OST nero-OST001f has stopped. Aug 27 12:54:42 n-oss08 kernel: Lustre: server umount nero-OST001f complete
[...]
Aug 27 12:56:40 n-oss08 Filesystem[4734]: INFO: Running start for /dev/mapper/vd08 on /lustre/vd08
Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
Aug 27 12:56:40 n-oss08 kernel: Lustre: 4843:0 (ldlm_lib.c:2019:target_recovery_init()) RECOVERY: service nero-OST001f,
746 recoverable clients, last_transno 730833312
Aug 27 12:56:40 n-oss08 kernel: Lustre: nero-OST001f: Now serving nero-OST001f on /dev/mapper/vd08 with recovery enabled
Aug 27 12:56:40 n-oss08 kernel: Lustre: nero-OST001f: temporarily refusing client connection from 10.201.51.36@o2ib
Aug 27 12:56:40 n-oss08 kernel: Lustre: Skipped 42 previous similar messages
Aug 27 12:56:41 n-oss08 lrmd: [3762]: info: Managed fs_vd08:start process 4734 exited with return code 0.
Aug 27 12:56:41 n-oss08 crmd: [3765]: info: process_lrm_event: LRM operation fs_vd08_start_0 (call=17, rc=0, cib-update=24, confirmed=true) ok
Aug 27 12:56:41 n-oss08 kernel: Lustre: nero-OST001f: Will be in recovery for at least 2:30, or until 746 clients reconnect
-------------------------------------------------------------------------



 Comments   
Comment by Bruno Faccini (Inactive) [ 12/Sep/13 ]

Hello Matteo,
Do you have any idea of what caused the original n-oss07 hang ? Is there any infos available about it (crash-dump, stack-traces, ...) ??
Also, what is not clear for me is did n-oss08 fully take-over before n-oss07 was rebooted and took control back ??
Again, any infos (crash-dump, stack-traces, ...) available for next hangs/reboots ??

Comment by Matteo Piccinini (Inactive) [ 13/Sep/13 ]

Hello Bruno,

just to clarify see my inline reply.

> Do you have any idea of what caused the original n-oss07 hang? Is there any infos available about it (crash-dump, stack-traces, ...)??

Unfortunately no.

> Also, what is not clear for me is did n-oss08 fully take-over before n-oss07 was rebooted and took control back ??

Yes, I can confirm that node n-oss08 took-over all the resources from n-oss07, as shown in the messages log file extracted below.
But when n-oss07 took control back the first resource to migrate from the n-oss08 was the fs_vd07 (nero-OST0001e) mounted with the very long time recovery thus hanging the node.

---------------------------------------------------------------------------------------------------------------------------------------
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: group_print: Resource Group: oss_odd
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd01#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd03#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd05#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd07#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: group_print: Resource Group: oss_even
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd02#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd04#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd06#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch
Aug 27 11:41:20 n-oss08 pengine: [3794]: info: native_print: fs_vd08#011(ocf::heartbeat:Filesystem):#011Started n-oss08.hpc-net.ethz.ch

[...]

Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Move fs_vd01#011(Started n-oss08.hpc-net.ethz.ch -> n-oss07.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Move fs_vd03#011(Started n-oss08.hpc-net.ethz.ch -> n-oss07.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Move fs_vd05#011(Started n-oss08.hpc-net.ethz.ch -> n-oss07.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Move fs_vd07#011(Started n-oss08.hpc-net.ethz.ch -> n-oss07.hpc net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Leave fs_vd02#011(Started n-oss08.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Leave fs_vd04#011(Started n-oss08.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Leave fs_vd06#011(Started n-oss08.hpc-net.ethz.ch)
Aug 27 11:41:20 n-oss08 pengine: [3794]: notice: LogActions: Leave fs_vd08#011(Started n-oss08.hpc-net.ethz.ch)

[...]

Aug 27 11:41:21 n-oss08 crmd: [3795]: info: te_rsc_command: Initiating action 26: stop fs_vd07_stop_0 on n-oss08.hpc-net.ethz.ch (local)

[...]

Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Running stop for /dev/mapper/vd07 on /lustre/vd07
Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Trying to unmount /lustre/vd07
Aug 27 11:41:21 n-oss08 kernel: Lustre: Failing over nero-OST001e
Aug 27 11:41:21 n-oss08 kernel: LustreError: 18866:0:(ldlm_lib.c:1978:target_stop_recovery_thread()) nero-OST001e: Aborting recovery
Aug 27 11:41:40 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)
Aug 27 11:41:40 n-oss08 kernel: LustreError: 4278:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (19) req@ffff8805a769d800 x1443813942539660/t0(0) o8><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377596600 ref 1 fl Interpret:/0/ffffffff rc -19/-1
-----------------------------------------------------------------------------------------------------------------------------------------

> Again, any infos (crash-dump, stack-traces, ...) available for next hangs/reboots ??

Again!, unfortunately no crash dumps available.

Comment by Bruno Faccini (Inactive) [ 01/Oct/13 ]

I can't find the reason of the so long recovery, but this may be due to some issue on Clients side.
BTW, your current configuration with 2.4 Clients and 2.2 Servers is not a supported nor tested one so we may face some interop issue there ....

Comment by Bruno Faccini (Inactive) [ 15/Oct/13 ]

Hello Matteo,
Have you been able to find any infos/msgs in Clients logs/Consoles ??
On the other hand, if the issue did not reproduce, can we at least downgrade this ticket's priority ?

Comment by Matteo Piccinini (Inactive) [ 16/Oct/13 ]

Hello Bruno,

We cannot found any other infos on the system and fortunately the issue did not reproduce.
We can close the ticket.
Thanks.

Comment by Bruno Faccini (Inactive) [ 16/Oct/13 ]

Thank's for the update Matteo. So closing ticket as not-reproducible.

Generated at Sat Feb 10 01:38:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.