[LU-7520] OSTs are not available to client Created: 06/Dec/15  Updated: 08/Dec/15  Resolved: 08/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Haisong Cai (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-2.5.3.90-2.6.32_431.29.2.el6_lustre.gb8d9077.x86_64_gb8d9077.x86_64
kernel-2.6.32-431.29.2.el6_lustre.gb8d9077.x86_64


Attachments: File debug_kernel.123     Text File debug_kernel.20288.gz     File dmesg.4251    
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

After OS failure and reinstalled, 3 out of 4 OSTs are unavailable to clients.
Getting various errors - dmesg attached.



 Comments   
Comment by Haisong Cai (Inactive) [ 06/Dec/15 ]

What I found strange is the lines below when I first try to mount OST. The OST was never configured for HA.
Why would client think so?

LustreError: 137-5: monkey-OST0016_UUID: not available for connect from 10.7.101.42@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: 137-5: monkey-OST0016_UUID: not available for connect from 10.7.103.215@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 19 previous similar messages
LustreError: 137-5: monkey-OST0016_UUID: not available for connect from 10.7.100.88@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 8 previous similar messages
LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 132.249.107.14@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: 137-5: monkey-OST0016_UUID: not available for connect from 132.249.107.14@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.

Comment by Haisong Cai (Inactive) [ 06/Dec/15 ]

Client:

LustreError: 11-0: monkey-OST0016-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 68 previous similar messages
LustreError: 11-0: monkey-OST0036-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 72 previous similar messages
LustreError: 11-0: monkey-OST0016-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 73 previous similar messages
LustreError: 11-0: monkey-OST0076-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 73 previous similar messages
LustreError: 11-0: monkey-OST0016-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 72 previous similar messages
LustreError: 11-0: monkey-OST0036-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 72 previous similar messages
LustreError: 11-0: monkey-OST0016-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 73 previous similar messages
LustreError: 11-0: monkey-OST0076-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 73 previous similar messages
LustreError: 11-0: monkey-OST0016-osc-ffff880339b5d400: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
LustreError: Skipped 72 previous similar messages

Comment by Haisong Cai (Inactive) [ 06/Dec/15 ]

Saw this in dmesg, how to clear it?

...
LustreError: 2853:0:(obd_mount_server.c:1120:server_register_target()) monkey-OST0016: error registering with the MGS: rc = -5 (not fatal)

...

Comment by Haisong Cai (Inactive) [ 06/Dec/15 ]

Further testing shows that above MGS error register when first OST is mounted on OSS:

Dec 5 21:58:35 monkey-oss-16-1 kernel: LNet: HW CPU cores: 16, npartitions: 4
Dec 5 21:58:35 monkey-oss-16-1 kernel: alg: No test for crc32 (crc32-table)
Dec 5 21:58:35 monkey-oss-16-1 kernel: alg: No test for adler32 (adler32-zlib)
Dec 5 21:58:36 monkey-mds-10-4 kernel: Lustre: 3644:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1449381480/real 1449381480] req@ffff88031983b000 x1504076313551084/t0(0) o8->monkey-OST0076-osc@172.25.32.234@tcp:28/4 lens 400/544 e 0 to 1 dl 1449381516 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 5 21:58:36 monkey-mds-10-4 kernel: Lustre: 3644:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Dec 5 21:58:43 monkey-oss-16-1 kernel: Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-31844-gb8d9077-PRISTINE-2.6.32-431.29.2.el6_lustre.gb8d9077.x86_64
Dec 5 21:58:43 monkey-oss-16-1 kernel: LNet: Added LNI 172.25.32.234@tcp [8/256/0/180]
Dec 5 21:58:43 monkey-oss-16-1 kernel: LNet: Accept secure, port 988
Dec 5 22:00:28 monkey-oss-1-1 kernel: Lustre: 4566:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1449381582/real 1449381582] req@ffff8806056a7000 x1504076377940368/t0(0) o8->monkey-OST0036-osc-ffff880339b5d400@172.25.32.234@tcp:28/4 lens 400/544 e 0 to 1 dl 1449381628 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 5 22:00:28 monkey-oss-1-1 kernel: Lustre: 4566:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Dec 5 22:00:36 monkey-oss-16-1 kernel: LDISKFS-fs (sde): mounted filesystem with ordered data mode. quota=off. Opts:
Dec 5 22:00:36 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0056_UUID: not available for connect from 10.7.100.219@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:36 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 10.7.103.221@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:36 monkey-oss-16-1 kernel: LustreError: Skipped 104 previous similar messages
Dec 5 22:00:37 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 132.249.107.85@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:37 monkey-oss-16-1 kernel: LustreError: Skipped 213 previous similar messages
Dec 5 22:00:39 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0016_UUID: not available for connect from 10.7.101.213@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:39 monkey-oss-16-1 kernel: LustreError: Skipped 321 previous similar messages
Dec 5 22:00:41 monkey-oss-16-1 kernel: Lustre: 10081:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1449381636/real 1449381636] req@ffff8805f0cacc00 x1519786679861252/t0(0) o250->MGC172.25.32.253@tcp@172.25.32.253@tcp:26/25 lens 400/544 e 0 to 1 dl 1449381641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 5 22:00:43 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 10.7.102.72@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:43 monkey-oss-16-1 kernel: LustreError: Skipped 695 previous similar messages
Dec 5 22:00:47 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861256/t0(0) o253->MGC172.25.32.253@tcp@172.25.32.253@tcp:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Dec 5 22:00:47 monkey-oss-16-1 kernel: LustreError: 10124:0:(obd_mount_server.c:1120:server_register_target()) monkey-OST0056: error registering with the MGS: rc = -5 (not fatal)
Dec 5 22:00:52 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 10.7.101.159@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:00:52 monkey-oss-16-1 kernel: LustreError: Skipped 393 previous similar messages
Dec 5 22:00:53 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861260/t0(0) o101->MGC172.25.32.253@tcp@172.25.32.253@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Dec 5 22:00:59 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861264/t0(0) o101->MGC172.25.32.253@tcp@172.25.32.253@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Dec 5 22:00:59 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861264/t0(0) o101->MGC172.25.32.253@tcp@172.25.32.253@tcp:
server.
Dec 5 22:00:52 monkey-oss-16-1 kernel: LustreError: Skipped 393 previous similar messages
Dec 5 22:00:53 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861260/t0(0) o101->MGC172.25.32.253@tcp@172.25.32.253@tcp:
26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Dec 5 22:00:59 monkey-oss-16-1 kernel: LustreError: 10124:0:(client.c:1096:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8805f0cac800 x1519786679861264/t0(0) o101->MGC172.25.32.253@tcp@172.25.32.253@tcp:
26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Dec 5 22:00:59 monkey-oss-16-1 kernel: Lustre: 10218:0:(ofd_dev.c:255:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
Dec 5 22:01:01 monkey-oss-16-1 kernel: Lustre: monkey-OST0056: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
Dec 5 22:01:01 monkey-oss-16-1 kernel: Lustre: monkey-OST0056: Will be in recovery for at least 2:30, or until 1210 clients reconnect
Dec 5 22:01:06 monkey-oss-16-1 kernel: Lustre: 10081:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1449381661/real 1449381661] req@ffff8805eaed6c00 x1519786679861372/t0(0) o38->monkey-MDT0000-lwp-OST0056@172.25.32.253@tcp:12/10 lens 400/544 e 0 to 1 dl 1449381666 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 5 22:01:07 oasis-monkey sshd[27267]: Set /proc/self/oom_score_adj to 0
Dec 5 22:01:07 oasis-monkey sshd[27267]: Connection from 192.31.21.156 port 59248
Dec 5 22:01:07 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381667.986:112163): user pid=27268 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=e0:d7:8e:6c:f6:a4:83:fd:33:cd:ec:c3:fb:f3:1c:b3 direction=? spid=27268 suid=0 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:07 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381667.986:112164): user pid=27268 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=67:7b:43:dc:e9:d8:b5:30:6e:b5:93:7d:97:ac:94:50 direction=? spid=27268 suid=0 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:08 oasis-monkey sshd[27268]: Postponed keyboard-interactive for cai from 192.31.21.156 port 59248 ssh2
Dec 5 22:01:08 monkey-oss-16-1 kernel: LustreError: 137-5: monkey-OST0036_UUID: not available for connect from 10.7.101.144@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 5 22:01:18 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381678.853:112165): user pid=27267 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=session fp=? direction=both spid=27268 suid=74 rport=59248 laddr=192.168.111.6 lport=22 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:18 oasis-monkey sshd[27267]: pam_unix(sshd:session): session opened for user cai by (uid=0)
Dec 5 22:01:18 oasis-monkey sshd[27267]: User child is on pid 27278
Dec 5 22:01:18 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381678.866:112166): user pid=27267 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=session fp=? direction=both spid=27267 suid=0 rport=59248 laddr=192.168.111.6 lport=22 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:18 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381678.867:112167): user pid=27278 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=e0:d7:8e:6c:f6:a4:83:fd:33:cd:ec:c3:fb:f3:1c:b3 direction=? spid=27278 suid=0 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:18 oasis-monkey audispd: node=oasis-monkey.sdsc.edu type=CRYPTO_KEY_USER msg=audit(1449381678.868:112168): user pid=27278 uid=0 auid=4294967295 ses=4294967295 msg='op=destroy kind=server fp=67:7b:43:dc:e9:d8:b5:30:6e:b5:93:7d:97:ac:94:50 direction=? spid=27278 suid=0 exe="/usr/sbin/sshd" hostname=? addr=192.31.21.156 terminal=? res=success'
Dec 5 22:01:21 monkey-mds-10-4 kernel: Lustre: 3644:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1449381630/real 1449381630] req@ffff88031b0af400 x1504076313567188/t0(0) o8->monkey-OST0076-osc@172.25.32.234@tcp:28/4 lens 400/544 e 0 to 1 dl 1449381681 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 5 22:01:21 monkey-mds-10-4 kernel: Lustre: 3644:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 9 previous similar messages

Comment by Peter Jones [ 06/Dec/15 ]

Bobijam

Could you please assist with this issue?

Thanks

Peter

Comment by Oleg Drokin [ 06/Dec/15 ]

when you get errors about mgs registration, are there any problems reported on the MGS itself?

The failover pair error could arise even if you do not have failover configure - basically it means that this server got a request for a service that is not started there. It could be due to failure to register on MGS (And therefore server mount failure), OST might have not been started yet (also visible in your logs with e.g. OST0036) and similar.

The error -16 means that a client tried to reconnect to a server, but the server is already handling a request from this client. It's not clear in the logs what might be the reason for this one, possibly just too long recovery.

I see you aborted recovery later on, so after that do you only see problems related to OSTs that failed to start?

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

Oleg,

1) Here are the errors on MDS/MGS:

Dec 5 21:09:59 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 21:09:59 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 21:20:24 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 21:20:24 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 21:30:49 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 21:30:49 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 21:41:14 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -19.
Dec 5 21:41:14 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 21:51:45 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -19.
Dec 5 21:51:45 monkey-mds-10-4 kernel: LustreError: Skipped 62 previous similar messages
Dec 5 22:02:10 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0076-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -19.
Dec 5 22:02:10 monkey-mds-10-4 kernel: LustreError: Skipped 49 previous similar messages
Dec 5 22:12:35 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -19.
Dec 5 22:12:35 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 22:22:49 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 22:22:49 monkey-mds-10-4 kernel: LustreError: Skipped 63 previous similar messages
Dec 5 22:33:14 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 22:33:14 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 22:43:39 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 22:43:39 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 22:54:04 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 22:54:04 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:04:29 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:04:29 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:14:54 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0076-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:14:54 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:25:19 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:25:19 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:35:44 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:35:44 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:46:09 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0076-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:46:09 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 5 23:56:09 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0076-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 5 23:56:09 monkey-mds-10-4 kernel: LustreError: Skipped 71 previous similar messages
Dec 6 00:06:34 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0036-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 6 00:06:34 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 6 00:16:59 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 6 00:16:59 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 6 00:27:24 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0016-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 6 00:27:24 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages
Dec 6 00:37:49 monkey-mds-10-4 kernel: LustreError: 11-0: monkey-OST0076-osc: Communicating with 172.25.32.234@tcp, operation ost_connect failed with -16.
Dec 6 00:37:49 monkey-mds-10-4 kernel: LustreError: Skipped 74 previous similar messages

2) OSTs always start on server side. It's client that can't connect to them.

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

How long a typical recovery take? The filesystem has about 1200 clients. We have OSTs mounted on server side since last night 22:00. For about 18 hourse clients still can't access OSTs.

I also want to point out that OSS hosts 4 OSTs. 3 are inaccessible from client side (via "df", "lfs df") and 1 is accessible.

Comment by Zhenyu Xu [ 07/Dec/15 ]

Can you find in the MGS logs (debug log could be more useful) to check when OST16/36/76 tried to register on the MGS, what caused the -5 error? It seems that OST16/36/76 did not successfully registered on MGS as available devices.

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

Zhenyu,

Do you mean to run "debug_kernel > /tmp/log" on MDS/MGS server?

Haisong

Comment by Zhenyu Xu [ 07/Dec/15 ]

The debug message buffer could have been recycled, you can mount one OST again (like OST16), and collect the debug log from the MGS (lctl dk, or similar)

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

Zhenyu,

I uploaded 2 files both from MGS server

debug_kernel.20228.gz was taken before I unmount OST0016
debug_kernel.123 was taken after I unmount-then-mount OST0016

Comment by Zhenyu Xu [ 07/Dec/15 ]

The OST16/36/76 cannot successfully recover, you can mount them with "-o abort_recov" to abort the recovery process.

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

Zhengyu,

The abort_recov has made the clients able to access the OSTs now.
We are running sanity checks.
I will update this ticket tomorrow morning on the results.

Thanks very much for you and Oleg's help,
Haisong

Comment by Haisong Cai (Inactive) [ 07/Dec/15 ]

Zhengyu,

Filesystem is operating normally. You may close this ticket.

Thanks again for all the helps,
Haisong

Comment by John Fuchs-Chesney (Inactive) [ 08/Dec/15 ]

Thank you Haisong.
~ jfc.

Generated at Sat Feb 10 02:09:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.