[LU-1252] Imperative recovery bugs go here Created: 22/Mar/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Sub-Tasks: |
|
||||||||||
| Severity: | 3 | ||||||||||
| Rank (Obsolete): | 4601 |
| Description |
|
contain for imperative recovery. |
| Comments |
| Comment by James A Simmons [ 23/Mar/12 ] |
|
First patch is here http://review.whamcloud.com/#change,2371 |
| Comment by James A Simmons [ 26/Mar/12 ] |
|
While testing with the patch I seen two bugs. One was for a bogus recovery time out as seen below. I don't think the clients are in recovery for 18446744073709551615s LustreError: 4486:0:(ldlm_lib.c:941:target_handle_connect()) lustre-OST000c: denying connection for new client 12@gni (fde45892-b2d3-a6d0-0ff6-b0e9b7d0740b): 18 clients in recovery for 18446744073709551615s The second is this report: [ 945.256721] Lustre: lustre-OST0004: Recovery over after 0:01, of 24 clients 23 recovered and 1 was evicted. Judging by the timestamps it took longer then 1 second to recovery. |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/12 ] |
|
Hi James, can you please apply this patch: http://review.whamcloud.com/#change,1797 when you do IR test next time. I found this patch helped a lot to reduce reconnecting time. |
| Comment by James A Simmons [ 02/Apr/12 ] |
|
Merged it to our build system. Will test tomorrow. |
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 07/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jinshan Xiong (Inactive) [ 10/Apr/12 ] |
|
Another patch is at: http://review.whamcloud.com/#change,2410 |
| Comment by James A Simmons [ 11/Apr/12 ] |
|
Is this needed for my testings? |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/12 ] |
|
It's only needed if you cluster is composed of heterogeneous nodes. So I don;t think you need apply it. |
| Comment by James A Simmons [ 17/Apr/12 ] |
|
I just got over 100GB of logs to look at. Its for one test run. In the test I attempted to powerman one OSS node. Well tanks to memory bugs in the the debug daemon all the OSS server went pop. The logs cover the entire length of recover. Server side I have systems log pre and post crash. |
| Comment by Jinshan Xiong (Inactive) [ 25/Apr/12 ] |
|
It'll be really fun to look at 100G logs |
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Cory Spitz [ 14/Aug/12 ] |
|
The b2_2 version of change #2410 is at http://review.whamcloud.com/#change,3008. |