[LU-1135] connection between MDS and OSS constantly being dropped and reestablished. Created: 24/Feb/12 Updated: 08/Mar/12 Resolved: 08/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Oleg Drokin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.1.56 servers with Lustre 2.1.56 clients on a cray system. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6448 |
| Description |
|
After a IOR job was launched on our clients nodes if one attempts to go into the directory where the many files are being created the OSTS would start goin to recovery every few minutes. |
| Comments |
| Comment by James A Simmons [ 24/Feb/12 ] |
|
I also placed the debug logs from the severs at the ftp site in uploads/ |
| Comment by Ian Colle (Inactive) [ 24/Feb/12 ] |
|
This must be fixed prior to the IR test at ORNL. |
| Comment by Peter Jones [ 24/Feb/12 ] |
|
Oleg Could you please look into this one as your top priority? Thanks Peter |
| Comment by James A Simmons [ 24/Feb/12 ] |
|
Okay been busy bisecting and I think I found the source of the problem. Its was commit 0204171fd3e1b393c53bd374aff228e80080a55a from |
| Comment by James A Simmons [ 24/Feb/12 ] |
|
Uploading more logs to uploads/ |
| Comment by James A Simmons [ 02/Mar/12 ] |
|
Doing some more testing I discovered that the problem went away when I upgraded to OFED 1.5.4 when on RHEL5.7. It is unknown if this is a Lustre bug or a OFED bug at this point. Will investigate with a image with a older OFED. |
| Comment by James A Simmons [ 07/Mar/12 ] |
|
Testing with the RHEL6 image with the default OFED stack shows the same problem. On the OSS Lustre: 4473:0:(ldlm_lib.c:634:target_handle_reconnect()) lustre-OST0018: lustre-MDT0000-mdtlov_UUID reconnecting And on the mds Oleg has a idea that it's a race condition in the ptlrpc layer. I observed in the rhel5 distro with OFED 1.5.4 the problem was reduced. |
| Comment by James A Simmons [ 07/Mar/12 ] |
|
This error also happened on the MDS 2012-03-07 13:32:17 Lustre: 10224:0:(import.c:525:import_select_connection()) lustre-OST0002-osc-MDT0000: tried all connections, increasing latency to 10s |
| Comment by James A Simmons [ 08/Mar/12 ] |
|
The reason for the errorswhen using the rhel6 image was due to the file system not being rebuilt. Previous I built the file system using the rhel5 image. After moving to a rhel6 image the problem was still present. I attempted to test IR but it toppled my client so the next time I reformated the file system. After the reformat all the problems went away. |
| Comment by Peter Jones [ 08/Mar/12 ] |
|
ok thanks for letting us know James. |
| Comment by James A Simmons [ 08/Mar/12 ] |
|
Just as a note if anyone will migrated from a RHEL5 envirnoment to RHEL6 with Lustre pre 2.2 that they would reformat there file system before use. |