[LU-1308] 2.2 clients unable to mount upgraded MDT Created: 11/Apr/12 Updated: 07/Jun/12 Resolved: 03/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Marek Magrys | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Environment: |
Scientific Linux 5.5, Lustre 2.2.0 on servers, patchless 2.1.1 and 2.2.0 on clients |
||
| Attachments: |
|
| Severity: | 3 |
| Epic: | client, interop |
| Rank (Obsolete): | 4641 |
| Description |
|
We are hitting a strange bug while upgrading to 2.2.0. We moved all the servers and some clients to 2.2 already, however our TCP clients are unable to mount the filesystem, because they are unable to find a suitable NID to connect to the MDT. 2.1.1 clients work fine. are the first networks listed in all configs (MGS/MDT/OST config), and the tcp one is occuring as the third one. All the clients which use o2ib work fine, as the first MDT NID they get from MGS works for them, however TCP ones fail (at least thats what we supose). Servers have: And the client gets this: [root@n1-4-1 ~]# lctl ping 172.16.126.1@tcp [root@n1-4-1 ~]# mount -t lustre 172.16.126.1@tcp:/scratch /mnt/lustre/scratch/ Dmesg says: I'm also attaching two debug dumps (lctl dk) for 2.1.1 client (works fine) and 2.2.0 client (fails). |
| Comments |
| Comment by Marek Magrys [ 11/Apr/12 ] |
|
I should've mention it in the description: previous version on the servers was 2.1.0, clients were running mostly 2.1.1. |
| Comment by Jinshan Xiong (Inactive) [ 13/Apr/12 ] |
|
Hi Marek, can you please dump the log on both client and MGS and post it here? Please set debug and subsystem_debug to -1 before you mount the client. thanks. |
| Comment by Marek Magrys [ 13/Apr/12 ] |
|
Here are the logs you asked for, the MGS/MDS one is rather long, as the system is in production. |
| Comment by Peter Jones [ 16/Apr/12 ] |
|
Oleg will be working on this one |
| Comment by Oleg Drokin [ 16/Apr/12 ] |
|
The problem was introduced by this patch http://review.whamcloud.com/1189 Copy and paste error in class_add_uuid(). Patch in testing at cyfronet |
| Comment by Lukasz Flis [ 16/Apr/12 ] |
|
client debug log with test patch by Oleg. |
| Comment by Lukasz Flis [ 16/Apr/12 ] |
|
Client with the test patch is still unable to mount via tcp. |
| Comment by Supporto Lustre Jnet2000 (Inactive) [ 17/Apr/12 ] |
|
Oleg, Which Lustre version are affected by this bug? Is It affected Lustre server version 2.1.x ? Thanks in advance. |
| Comment by Peter Jones [ 17/Apr/12 ] |
|
This is a 2.2 issue only so it will not be present in 2.1.x. |
| Comment by Oleg Drokin [ 17/Apr/12 ] |
|
Please try this patch, I think it fixes all issues: http://review.whamcloud.com/2561 |
| Comment by Lukasz Flis [ 17/Apr/12 ] |
|
Oleg, thanks for the patch. I am attaching client logs: mount-debug-patch2.log.gz |
| Comment by Lukasz Flis [ 17/Apr/12 ] |
|
Debug log from clients with patch#2 applied. |
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 22/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Marek Magrys [ 25/Apr/12 ] |
|
This patch doesn't fix the problem and more and more people are starting to hit this one (see *-discuss lists). |
| Comment by Andrei Maslennikov [ 25/Apr/12 ] |
|
I can confirm that this patch does not solve the issue. Had to roll back to 2.1.1. |
| Comment by Oleg Drokin [ 25/Apr/12 ] |
|
http://review.whamcloud.com/2599 is a follow on patch that should nail the rest of issues you are seeing, I hope. |
| Comment by Cory Spitz [ 26/Apr/12 ] |
|
This patch works, thank you. |
| Comment by Andrei Maslennikov [ 26/Apr/12 ] |
|
Worked for me as well. Thanks! |
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Marek Magrys [ 26/Apr/12 ] |
|
I can confirm that with both patches the problem is solved, thanks. |
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 26/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
Oleg Drokin : f176db6b88f2e932d1cf7e018e42f9a995301e76
|
| Comment by Christian Schausberger [ 14/May/12 ] |
|
I have seen, that the patch that caused this ( Christian |
| Comment by Peter Jones [ 14/May/12 ] |
|
Thanks Christian - good point! |