[LU-953] OST connection lost Created: 29/Dec/11  Updated: 12/Sep/13  Resolved: 12/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-1.8.6.81
OFED1.5.3.1
NASA AMES


Attachments: File new-build.sh     File new-lustre_start.sh    
Severity: 3
Rank (Obsolete): 7037

 Description   

Upgrading to lustre 1.8.6 and OFED1.5.3.1 we have started to see OST<->MDT connection issue.
We have checked the IB fabric for errors and have found none.
Are there any know issues with Lustre1.8.6 and OFED1.5.3?

=== ERROR ON MDS ===
Dec 28 07:04:56 service100 kernel: Lustre: 6149:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1389011653751232 sent from nbp6-OST0002-osc to NID 10.151.25.157@o2ib 7s ago has timed out (7s prior to deadline).
Dec 28 07:04:56 service100 kernel: req@ffff81071b30ac00 x1389011653751232/t0 o13->nbp6-OST0002_UUID@10.151.25.157@o2ib:7/4 lens 192/528 e 0 to 1 dl 1325084696 ref 1 fl Rpc:N/0/0 rc 0/0
Dec 28 07:04:56 service100 kernel: Lustre: 6149:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 258 previous similar messages
Dec 28 07:04:56 service100 kernel: Lustre: nbp6-OST0002-osc: Connection to service nbp6-OST0002 via nid 10.151.25.157@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Dec 28 07:04:56 service100 kernel: Lustre: Skipped 2 previous similar messages
Dec 28 07:05:04 service100 kernel: Lustre: 6151:0:(import.c:517:import_select_connection()) nbp6-OST000a-osc: tried all connections, increasing latency to 11s
Dec 28 07:05:04 service100 kernel: Lustre: 6151:0:(import.c:517:import_select_connection()) Skipped 220 previous similar messages
Dec 28 07:05:06 service100 kernel: Lustre: nbp6-OST0042-osc: Connection restored to service nbp6-OST0042 using nid 10.151.25.157@o2ib.
Dec 28 07:05:06 service100 kernel: Lustre: Skipped 14 previous similar messages
Dec 28 07:05:06 service100 kernel: LustreError: 30626:0:(quota_ctl.c:473:lov_quota_ctl()) ost 75 is inactive
Dec 28 07:05:06 service100 kernel: LustreError: 30626:0:(quota_ctl.c:473:lov_quota_ctl()) Skipped 5 previous similar messages Dec 28 07:05:06 service100 kernel: Lustre: MDS nbp6-MDT0000: nbp6-OST0042_UUID now active, resetting orphans
Dec 28 07:05:06 service100 kernel: Lustre: Skipped 29 previous similar messages
Dec 28 07:05:07 service100 kernel: LustreError: 30630:0:(quota_master.c:1698:qmaster_recovery_main()) nbp6-MDT0000: qmaster recovery failed for uid 11631 rc:-11)
Dec 28 07:05:07 service100 kernel: LustreError: 30630:0:(quota_master.c:1698:qmaster_recovery_main()) Skipped 52 previous similar messages



 Comments   
Comment by Peter Jones [ 29/Dec/11 ]

Lsi

Can you please comment on this issue?

Thanks

Peter

Comment by Cliff White (Inactive) [ 03/Jan/12 ]

We need to have more information. Can you please attach the MDS system log for the 5 hours prior to this event, and the system logs from the node providing nid 10.151.25.157@o2ib for the same time period.

Comment by James A Simmons [ 23/Mar/12 ]

I also saw the problem Lustre pre-2.2 with OFED 1.5.3 on RHEL5. Upgrading to OFED 1.5.4 made the problem go away for me. Also Lustre pre-2.2 without OFED on rhle6 shows this error but it appears to be a minor probelm. Have you seen sever problems with this?

Comment by James A Simmons [ 27/Mar/12 ]

Okay I just moved to OFED 1.5.4.1 on rhel6 and I still see this issue.

Comment by James A Simmons [ 27/Mar/12 ]

Also I want to comment that this is affecting the stripe placement on our OSSs. For example on our test file system I set the stripe count to 28 which is the total number of OSTs I have, each OSS has 7 OSTs. Doing a lfs getstripe on a file in this case yields

testfile.out.00000009
lmm_magic: 0x0BD10BD0
lmm_seq: 0x2000013aa
lmm_object_id: 0x16
lmm_stripe_count: 7
lmm_stripe_size: 1048576
lmm_stripe_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 12
obdidx objid objid group
12 6928 0x1b10 0
16 6928 0x1b10 0
20 7185 0x1c11 0
24 7504 0x1d50 0
0 7184 0x1c10 0
4 6992 0x1b50 0
8 6928 0x1b10 0

All those object creates happen on one OSS. This is the case for all files. Mohmoud can you verify that you are seeing this behavior as well. Because of this the best performance I get for writing out a file is 250 MB/s versus the 2.5GB/s I got before. This is a blocker for developing in a production environment.

Comment by Ian Colle (Inactive) [ 05/Apr/12 ]

Also observed in Lustre 2.2 at ORNL.

Comment by Peter Jones [ 05/Apr/12 ]

Liang

Could you please help with this one?

Thanks

Peter

Comment by Liang Zhen (Inactive) [ 06/Apr/12 ]

Did OFED complained anything while you saw this issue? Also, could you please turn on neterror print so we can check whether there is any LNet/o2iblnd problem (echo +1 > /proc/sys/lnet/printk).
I feel it's more like an issue in ptlrpc layer, so it would be helpful to get debug log, dmesg and console output from both the MDS and OSS.

Liang

Comment by James A Simmons [ 10/Apr/12 ]

Mahmoud can you try the following patch against your 1.8 source.

http://review.whamcloud.com/#change,1797

For me it seems to have helped. Let me know if it helps with yoru problem as well. I will be doing more testing on my side. I still have the strange striping pattern tho.

Comment by James A Simmons [ 11/Apr/12 ]

Managed to collect logs for this problem and hand them off to Oleg.

Comment by Oleg Drokin [ 11/Apr/12 ]

Looking at the ORNL logs from yesterday I see that MDS is constantly trying to connect to OST0001 at an address of oss1 (I assume, becaue that's where the connections end up at reported as "no such OST here").
Now the OST0001 is started on oss2 where no connection attempts are made, as the oss1 address is the only one listed for OST0001 apparently.
As such the problem seems to be some sort of a configuration error

Comment by James A Simmons [ 12/Apr/12 ]

I attached my build scripts to see if it is indeed a config error. I will test with the llmount.sh script as well.

Comment by James A Simmons [ 12/Apr/12 ]

Mahmoud do you format your OSTs with --index="some number". We do that at the lab to allow parallel mounting of the OSTs.It appears to be causing problems. I'm going to do a test format without using the index to see if we still have the problems.

Comment by Mahmoud Hanafi [ 18/Apr/12 ]

Sorry for the late reply.
Yes we do use the --index. we set it to the ost number.

Comment by James A Simmons [ 19/Apr/12 ]

No problem about the delay. I have done some tracking down of the problem and discovered how to replicate this issue. The problem only shows up when formatting the OST with index="number". Whats causes the problem is a mounting order. If you mount MGS > MDS > OSS(s) no problems will show up. If you mount MGS > OSS(s) > MDS then you will experience this problem.

Now here is a extra bit of info. If you format with OST index and you mount in the MGS > MDS > OSS order then umount the file system then remount in the order of MGS > OSS(s) > MDS you will not run into the connection problem. This tells you the problem is the wrong data being written to the llog on the MDS. For some reason the data sent by the OSS to the OSC layer on the MDS is different if the MDS sents a signal to the OSS to send it's configuration data versus the OSS successfully sending its configuration data to MDS already available.

Comment by James A Simmons [ 19/Apr/12 ]

Another interesting clue is on the MDS if you do a for i in $(ls /proc/fs/lustre/osc/-OST/ost_conn_uuid); do cat $i; done you will see all the NIDS are exactly the same.

Comment by James A Simmons [ 15/Nov/12 ]

Okay I just tested this again on Lustre 2.3.54 and it still exist.

Comment by James A Simmons [ 28/Feb/13 ]

Tested this bug on Lustre 2.3.61 and the problem seems to have been fixed.

Generated at Sat Feb 10 01:12:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.