[LU-5599] Lustre Error: Impossible state: 4 Created: 09/Sep/14 Updated: 24/Jul/18 Resolved: 24/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Bernel (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Fedora 19 x86_64 on Washington Pass nodes, 1GbE & FDR IB |
||
| Issue Links: |
|
||||||||||||
| Epic/Theme: | Kernel, Lustre-2.4, Panic | ||||||||||||
| Severity: | 3 | ||||||||||||
| Project: | Fast Forward | ||||||||||||
| Rank (Obsolete): | 15656 | ||||||||||||
| Description |
|
Lustre Version: 2.4.52 2nd Instance: Message from syslogd@bar1 at Sep 8 16:08:51 ... 1st Instance: Message from syslogd@bar4 at Aug 14 15:34:57 ... Message from syslogd@bar4 at Aug 14 15:34:57 ... Message from syslogd@bar4 at Aug 14 15:34:57 ... |
| Comments |
| Comment by Cliff White (Inactive) [ 09/Sep/14 ] |
|
There should be a stack dump to go along with the ASSERTION - can you please acquire and attach to the ticket? In addition, it would be useful to have logs for some time before the actual assertion - are there any LustreErrors previous to this? |
| Comment by Peter Jones [ 10/Sep/14 ] |
|
Bob Could you please help with this issue? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 10/Sep/14 ] |
|
the reported lustre version is "Lustre Version: 2.4.52". this suggests it was built from source or derived from a review build in between our standard release versions with names like 2.4.2 or 2.4.3. Could we get detail about how this lustre was generated or obtained? knowing the exact origin is very important to help us understand the problem. |
| Comment by Bob Glossman (Inactive) [ 10/Sep/14 ] |
|
this may not be known but I wonder if there was any particular applications or load that was running near the time of the 2 reported panic instances. |
| Comment by Brian Bernel (Inactive) [ 10/Sep/14 ] |
|
Hi Bob, Thanks so much for your help on this. The Lustre kernel for X-Stack was put together by Gabrielle Paciucci. …As for what might have been going on at the time of the problem, the only thing I have to go on at this point is that git was involved. No kdump available, I’m afraid. A cursory look at the logs doesn’t show anything glaringly amiss, but it does corroborate that garret/git was in play in the minute leading up to the Lustre error. (Thanks Josh, for directing me to reply by comment vs email) Regards, Brian |
| Comment by Bob Glossman (Inactive) [ 10/Sep/14 ] |
|
Brian, |
| Comment by Keith Mannthey (Inactive) [ 02/Oct/14 ] |
|
We do not need a kernel dump to get this information. Can you upload /var/log/messages or "dmesg" right after this issue is hit? |
| Comment by Brian Bernel (Inactive) [ 10/Nov/14 ] |
|
Recent Kernel Dump information from bar1 on X-Stack: [ 3.911516] hpgmg-fv[96927]: segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so[7fd22c300000+2e000] |
| Comment by Brian Bernel (Inactive) [ 10/Nov/14 ] |
|
Related correspondence and screen errors: Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump Romain From: Romain Cledat <romain.e.cledat@intel.com> Hello, It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah) Nov 8 04:51:12 bar1 systemd-logind[748]: New session 17884 of user vincentc. I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have: Nov 7 22:03:11 bar1 systemd[1]: Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s. So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines: What do you know, it happens to be Lustre again Romain PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though… |
| Comment by Jay Lan (Inactive) [ 10/Nov/14 ] |
|
NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel. |
| Comment by Zhenyu Xu [ 11/Nov/14 ] |
|
It looks like 2.5 already has them. |
| Comment by Jay Lan (Inactive) [ 11/Nov/14 ] |
|
We had both of them in our 2.4.3-7nasC. |
| Comment by Zhenyu Xu [ 11/Nov/14 ] |
|
Can you try apply these patches? http://review.whamcloud.com/#/c/8530/ |
| Comment by Peter Jones [ 11/Nov/14 ] |
|
Jay I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different Peter |
| Comment by Jay Lan (Inactive) [ 11/Nov/14 ] |
|
Peter, Will do. Jay |
| Comment by Brian Bernel (Inactive) [ 18/May/16 ] |
|
Is there set of instructions somewhere on how to apply these patches? |
| Comment by Peter Jones [ 24/Jul/18 ] |
|
Out of date ticket |