[LU-9015] Meaning of revalidate FID rc=-4 Created: 12/Jan/17  Updated: 19/Jan/17  Resolved: 19/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Peter Bortas Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre ee client-2.5.42.4-2.6.32_642.6.2


Rank (Obsolete): 9223372036854775807

 Description   

We have a curious problem where a user is managing to wedge a few COS6 compute nodes every day to the point they don't even respond to the console. The only thing we see logged before this happens is:

Jan 12 15:59:14 n602 kernel: [3463048.109323] LustreError: 21705:0:(file.c:3256:ll_inode_revalidate_fini()) fouo5: revalidate FID [0x2000068c4:0x16fd0:0x0] error: rc = -4

I have no idea if this is a symptom of the node breaking down in some way unrelated to Lustre or if this is part of the cause, so I'd like to figure out what this error means. After a quick look at the code this should be the return code from either md_intent_lock or md_getattr, but I can't find where the error code -4 is defined.

Any tips?



 Comments   
Comment by Oleg Drokin [ 13/Jan/17 ]

-4 is EINTR, and it likely has nothing to do with the hang at hand. This particular one was silenced in LU-6627 I believe.
It means that an application got some sort of a signal (like somebody pressed ^C or some other sources) in the middle of waiting for RPC reply from MDS.
application pid was 21705 if you somehow can track what it was back then.
It's defined in linux kernel source in include/uapi/asm-generic/errno-base.h file.

Now if you want to investigate the hang, you probably want to do something like sysrq-t, do you have other debugging stuff enabled? nmi watchdogs, serial console attached? Are you setup to collect crashdumps?

Comment by Peter Jones [ 13/Jan/17 ]

Yang Sheng

Could you please assist with this one as further informaiton is supplied

Thanks

Peter

Comment by Peter Bortas [ 16/Jan/17 ]

Thanks Oleg,

That makes sense. It looks like (probably) all occurrences of this hang corresponds with the users jobs getting preemted and thus killed by Slurm.

We do not currently have sysrq-t enabled in the node images, but I'll enable that and roll out new images during the week. Serial console is available via IPMI but we are not set up to collect crash dumps. Making a device available for crash dumps might not be possible with the limited space available.

Comment by Peter Bortas [ 19/Jan/17 ]

We pushed out sysrq-enabled images Yesterday, and unfortunately this hang does not look like it can be escaped with sysrq on the console or break via SOL.

Lustre is now only one of many suspects since the error message probably has nothing to do with it. So I'm fine with closing this ticket for now. I can reopen it if further testing on our side points more fingers at Lustre.

Comment by Yang Sheng [ 19/Jan/17 ]

Many thanks, Peter.

Comment by Oleg Drokin [ 19/Jan/17 ]

Just a late note in case you did not know, but you don't need a dedicated crashdump device for every node.
You can have an nfs share where all nodes dump to once the need arises, this is kind of a setup that is very common and I do it too.

I highly recommend you to look into setting something like that up (it does not even need to be crazy fast or have redundant storage, just a cheap huge multiterabate HDD in a node that's up all the time and has nfs server on it.)

Having this setup early would save you tons of trouble later when you actually need a crashdump from something to debug some other problem.

Comment by Peter Bortas [ 19/Jan/17 ]

I did not know you could crash-dump over network filesystems. Very interesting, thanks!

Dump capability is now added to the roadmap for our cluster environment development.

Generated at Sat Feb 10 02:22:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.