[LU-1135] connection between MDS and OSS constantly being dropped and reestablished. Created: 24/Feb/12  Updated: 08/Mar/12  Resolved: 08/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Oleg Drokin
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre 2.1.56 servers with Lustre 2.1.56 clients on a cray system.


Attachments: File dmesg-mds     File dmesg-mgs     File dmesg-oss1     File dmesg-oss2     File dmesg-oss3     File dmesg-oss4    
Severity: 3
Rank (Obsolete): 6448

 Description   

After a IOR job was launched on our clients nodes if one attempts to go into the directory where the many files are being created the OSTS would start goin to recovery every few minutes.



 Comments   
Comment by James A Simmons [ 24/Feb/12 ]

I also placed the debug logs from the severs at the ftp site in uploads/LU-1135/logs.tar

Comment by Ian Colle (Inactive) [ 24/Feb/12 ]

This must be fixed prior to the IR test at ORNL.

Comment by Peter Jones [ 24/Feb/12 ]

Oleg

Could you please look into this one as your top priority?

Thanks

Peter

Comment by James A Simmons [ 24/Feb/12 ]

Okay been busy bisecting and I think I found the source of the problem. Its was commit 0204171fd3e1b393c53bd374aff228e80080a55a from LU-1028. Running some final test to determine if this is the cause.

Comment by James A Simmons [ 24/Feb/12 ]

Uploading more logs to uploads/LU-1135/bug.tbz

Comment by James A Simmons [ 02/Mar/12 ]

Doing some more testing I discovered that the problem went away when I upgraded to OFED 1.5.4 when on RHEL5.7. It is unknown if this is a Lustre bug or a OFED bug at this point. Will investigate with a image with a older OFED.

Comment by James A Simmons [ 07/Mar/12 ]

Testing with the RHEL6 image with the default OFED stack shows the same problem.

On the OSS

Lustre: 4473:0:(ldlm_lib.c:634:target_handle_reconnect()) lustre-OST0018: lustre-MDT0000-mdtlov_UUID reconnecting
Lustre: 4473:0:(filter.c:2692:filter_connect_internal()) lustre-OST0018: Received MDS connection for group 0
Lustre: lustre-OST0018: received MDS connection from 10.37.248.61@o2ib1
Lustre: 4473:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880432273000 - ffff8803f7954800
Lustre: 4473:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880432273000 - ffff8803f7954800
Lustre: 4473:0:(filter.c:2548:filter_llog_connect()) lustre-OST0018: Recovery from log 0x3d7c29a/0x0:a1d23d49

And on the mds
LustreError: 10220:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc 107 from async create: failing oscc req@ffff880391376800 x1395817301711434/t0(0) o5>lustre-OST0001-osc-MDT0000@10.37.248.63@o2ib1:7/4 lens 400/400 e 0 to 1 dl 1331144444 ref 1 fl Interpret:RXN/0/ffffffff rc -107/-1
Lustre: 10220:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1331144519/real 1331144519] req@ffff8803f9d06000 x1395817301712253/t0(0) o400->lustre-OST0005-osc-MDT0000@10.37.248.63@o2ib1:28/4 lens 192/192 e 0 to 1 dl 1331144526 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Lustre: lustre-OST0005-osc-MDT0000: Connection to service lustre-OST0005 via nid 10.37.248.63@o2ib1 was lost; in progress operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc-MDT0000: Connection restored to service lustre-OST0005 using nid 10.37.248.63@o2ib1.
LustreError: 18947:0:(lov_obd.c:1069:lov_clear_orphans()) error in orphan recovery on OST idx 1/28: rc = -5
LustreError: 18947:0:(mds_lov.c:882:__mds_lov_synchronize()) lustre-OST0001_UUID failed at mds_lov_clear_orphans: -5
LustreError: 18947:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST0001_UUID sync failed -5, deactivating

Oleg has a idea that it's a race condition in the ptlrpc layer. I observed in the rhel5 distro with OFED 1.5.4 the problem was reduced.

Comment by James A Simmons [ 07/Mar/12 ]

This error also happened on the MDS

2012-03-07 13:32:17 Lustre: 10224:0:(import.c:525:import_select_connection()) lustre-OST0002-osc-MDT0000: tried all connections, increasing latency to 10s
2012-03-07 13:32:17 Lustre: 10224:0:(import.c:525:import_select_connection()) Skipped 2 previous similar messages
2012-03-07 13:33:32 LustreError: 10221:0:(lov_request.c:558:lov_update_create_set()) error creating fid 0x4c sub-object on OST idx 2/25: rc = -11
2012-03-07 13:33:32 LustreError: 10221:0:(lov_request.c:558:lov_update_create_set()) Skipped 15 previous similar messages
2012-03-07 13:33:32 LustreError: 10218:0:(lov_request.c:558:lov_update_create_set()) error creating fid 0x4c sub-object on OST idx 2/25: rc = -11
2012-03-07 13:33:33 LustreError: 10216:0:(lov_request.c:558:lov_update_create_set()) error creating fid 0x4c sub-object on OST idx 26/25: rc = -11
2012-03-07 13:33:33 LustreError: 10216:0:(lov_request.c:558:lov_update_create_set()) Skipped 53 previous similar messages
2012-03-07 13:33:33 Lustre: 11252:0:(lov_qos.c:472:qos_shrink_lsm()) using fewer stripes for object 76: old 25 new 19

Comment by James A Simmons [ 08/Mar/12 ]

The reason for the errorswhen using the rhel6 image was due to the file system not being rebuilt. Previous I built the file system using the rhel5 image. After moving to a rhel6 image the problem was still present. I attempted to test IR but it toppled my client so the next time I reformated the file system. After the reformat all the problems went away.

Comment by Peter Jones [ 08/Mar/12 ]

ok thanks for letting us know James.

Comment by James A Simmons [ 08/Mar/12 ]

Just as a note if anyone will migrated from a RHEL5 envirnoment to RHEL6 with Lustre pre 2.2 that they would reformat there file system before use.

Generated at Sat Feb 10 01:13:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.