[LU-9015] Meaning of revalidate FID rc=-4 Created: 12/Jan/17 Updated: 19/Jan/17 Resolved: 19/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Peter Bortas | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre ee client-2.5.42.4-2.6.32_642.6.2 |
||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We have a curious problem where a user is managing to wedge a few COS6 compute nodes every day to the point they don't even respond to the console. The only thing we see logged before this happens is: Jan 12 15:59:14 n602 kernel: [3463048.109323] LustreError: 21705:0:(file.c:3256:ll_inode_revalidate_fini()) fouo5: revalidate FID [0x2000068c4:0x16fd0:0x0] error: rc = -4 I have no idea if this is a symptom of the node breaking down in some way unrelated to Lustre or if this is part of the cause, so I'd like to figure out what this error means. After a quick look at the code this should be the return code from either md_intent_lock or md_getattr, but I can't find where the error code -4 is defined. Any tips? |
| Comments |
| Comment by Oleg Drokin [ 13/Jan/17 ] |
|
-4 is EINTR, and it likely has nothing to do with the hang at hand. This particular one was silenced in Now if you want to investigate the hang, you probably want to do something like sysrq-t, do you have other debugging stuff enabled? nmi watchdogs, serial console attached? Are you setup to collect crashdumps? |
| Comment by Peter Jones [ 13/Jan/17 ] |
|
Yang Sheng Could you please assist with this one as further informaiton is supplied Thanks Peter |
| Comment by Peter Bortas [ 16/Jan/17 ] |
|
Thanks Oleg, That makes sense. It looks like (probably) all occurrences of this hang corresponds with the users jobs getting preemted and thus killed by Slurm. We do not currently have sysrq-t enabled in the node images, but I'll enable that and roll out new images during the week. Serial console is available via IPMI but we are not set up to collect crash dumps. Making a device available for crash dumps might not be possible with the limited space available. |
| Comment by Peter Bortas [ 19/Jan/17 ] |
|
We pushed out sysrq-enabled images Yesterday, and unfortunately this hang does not look like it can be escaped with sysrq on the console or break via SOL. Lustre is now only one of many suspects since the error message probably has nothing to do with it. So I'm fine with closing this ticket for now. I can reopen it if further testing on our side points more fingers at Lustre. |
| Comment by Yang Sheng [ 19/Jan/17 ] |
|
Many thanks, Peter. |
| Comment by Oleg Drokin [ 19/Jan/17 ] |
|
Just a late note in case you did not know, but you don't need a dedicated crashdump device for every node. I highly recommend you to look into setting something like that up (it does not even need to be crazy fast or have redundant storage, just a cheap huge multiterabate HDD in a node that's up all the time and has nfs server on it.) Having this setup early would save you tons of trouble later when you actually need a crashdump from something to debug some other problem. |
| Comment by Peter Bortas [ 19/Jan/17 ] |
|
I did not know you could crash-dump over network filesystems. Very interesting, thanks! Dump capability is now added to the roadmap for our cluster environment development. |