[LU-2460] Failed to mount ost: Transport endpoint is not connected Created: 10/Dec/12  Updated: 21/Dec/12  Resolved: 21/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Chris Gearing (Inactive)
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 5804

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/3b49fb7c-4241-11e2-adcf-52540035b04c.

Transport endpoint is not connected



 Comments   
Comment by Sarah Liu [ 13/Dec/12 ]

I resubmitted the test and this error gone.
https://maloo.whamcloud.com/test_sessions/94231f70-44a4-11e2-8c8b-52540035b04c

Comment by Peter Jones [ 13/Dec/12 ]

Chris

Any idea why this occurred?

Peter

Comment by Frank Heckes (Inactive) [ 14/Dec/12 ]

Last entry from the maloo log files indicated that MDT on client-20 were created and mounted,
but mounting of first created OST on client-21 failed. The error listed indicate an connection problem:

— snip ----
...
11:41:20:Started lustre-MDT0000
11:41:20:CMD: client-21-ib mkdir -p /mnt/ost1
11:41:20:CMD: client-21-ib test -b /dev/lvm-OSS/P1
11:41:21:Starting ost1: /dev/lvm-OSS/P1 /mnt/ost1
11:41:21:CMD: client-21-ib mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-OSS/P1 /mnt/ost1
11:41:42:client-21-ib: mount.lustre: mount /dev/mapper/lvm--OSS-P1 at /mnt/ost1 failed: Transport endpoint is not connected
...

This could indicate an IB problem, LNET or problem with missing module, but might be a problem with the storage backend.
Unfortunatly the messsage files of the nodes client-

{20,21}

are gone since they have been re-installed by 12.12.
I will check on Monday with Chis whether the message files can be recovered.

The other track is the IB health.

The opensmd might have been 'moved' client-6, this might be the reason:

root 6727 1 11 Dec11 ? 06:49:20 /usr/sbin/opensm -B -F /etc/rdma/opensm.conf.[0-9]*
[root@client-6 ~]# uptime
09:39:51 up 2 days, 11:35, 1 user, load average: 2.01, 2.01, 1.98
[root@client-6 ~]# date
Fri Dec 14 09:39:55 PST 2012

Therefore no opensm.log entries are available when the test were running (09.12.2012 ~ 11:40).
At the moment I could see many part state changes for 2 switches, that could indicate some problems.
I will contact Joshua

Have to check this with Chris on Monday since I don't have access to some tools on mgmt and brent.
I couldn't find the output of inspectdianet on client-6. Have to contact Joshua on Monday. I don't
want to mess up something.
This is another track for analysis...

Comment by Frank Heckes (Inactive) [ 14/Dec/12 ]

root FS ran full on client-6. Reason is most likely VM feature enable for HCA on client-9
opensmd isn't capable of handling VM request --> leads to tons of error message of the form:

Dec 11 22:12:45 701384 [5A875700] 0x01 -> osm_gir_rcv_process: ERR 5105: Unsupported Method (SubnAdmSet).

opensm.log size:
[root@client-6 log]# ll opensm.log
rw-rr- 1 root root 19136745472 Dec 14 11:51 opensm.log

Saved copy of opensm.log to /home/frank/logs and stopped HCA on client-9 and cleaned up logfile.

subnet manager operational. ping test of ib-interface were successfull, before and after the cleanup.

Comment by Chris Gearing (Inactive) [ 21/Dec/12 ]

Resolved by restarting opensm and we have a ticket in place (TT-992) to properly resolve the opensm issue.

Generated at Sat Feb 10 01:25:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.