[LU-187] umount hang when running conf-sanity test 29 Created: 02/Apr/11 Updated: 30/May/13 Resolved: 01/Jun/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Jian Yu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 8477 |
| Description |
|
umount mds hang when running conf-sanity test 29 MDS: INFO: task umount:24781 blocked for more than 120 seconds. |
| Comments |
| Comment by Peter Jones [ 02/Apr/11 ] |
|
Yu Jian Can you please try and understand the cause of this hang? Thanks Peter |
| Comment by Jian Yu [ 20/Apr/11 ] |
|
From the syslog of the MDS node, we could see: Lustre: Permanently reactivating lustre-OST0001 Lustre: lustre-OST0001-osc-MDT0000: Connection to service lustre-OST0001 via nid 192.168.4.129@o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail. Lustre: lustre-OST0001-osc-MDT0000: Connection restored to service lustre-OST0001 using nid 192.168.4.129@o2ib. LustreError: 11-0: an error occurred while communicating with 192.168.4.129@o2ib. The ost_set_info operation failed with -107 After reactivating the lustre-OST0001, the communication between MDS and OSS hit -107 (-ENOTCONN, Transport endpoint is not connected) error. Sarah, do you have the dmesg/syslog of the OSS node? I could not reproduce this issue on Toro cluster, and then could not figure out what happened on the OSS node. |
| Comment by Sarah Liu [ 20/Apr/11 ] |
|
sorry, I don't have those logs. I will try to reproduce this bug and gather the logs. |
| Comment by Jian Yu [ 01/Jun/11 ] |
|
The issue has not been reproduced for two months. Let's close the ticket. If you hit the issue again, please feel free to reopen it. |
| Comment by Colin Faber [X] (Inactive) [ 22/Jun/11 ] |
|
Hi, I'm able to reproduce this pretty reliably: INFO: task touch:22268 blocked for more than 120 seconds. |
| Comment by Stephen Champion [ 30/May/13 ] |
|
I tripped over this running acceptance on a SLES11SP1 / Lustre 2.1.5 client : crash> bt 24884 I doubt I'll have time to spare chasing it down, but I can make the core and logs available to anyone who wants to take a look. |