[LU-12063] mktemp fails with ENOENT and MDS log reports lod_gen_component_ea() Can not locate [0x700000bd9:0x56:0x0]: rc = -2 Created: 12/Mar/19  Updated: 15/Jul/19  Resolved: 15/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Patrick Farrell (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.12.0_1.chaos_2_g3ee692e
kernel 3.10.0-957.5.1.3chaos.ch6.x86_64
distro RHEL 7.6 derivative
backend zfs v0.7.11-5llnl


Attachments: File lu-12063-2.tar.gz     File lu-12063-3.tar.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

file create fails intermittently, errno is ENOENT.

bash-4.2$ mktemp /p/lquake/faaland1/make-busy/mdt7/mdtest.enoent.XXXX
mktemp: failed to create file via template '/p/lquake/faaland1/make-busy/mdt7/mdtest.enoent.XXXX': No such file or directory
bash-4.2$ ls -l /p/lquake/faaland1/make-busy/mdt7
total 65
drwx------ 3 faaland1 faaland1 33280 Mar 12 12:24 mdtest.6qUohi
drwx------ 3 faaland1 faaland1 33280 Mar 12 12:24 mdtest.ldXmqW

MDS console log reports:

[Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0007-mdtlov: Can not locate [0x700000bd9:0x56:0x0]: rc = -2
[Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) Skipped 6 previous similar messages


 Comments   
Comment by Olaf Faaland [ 12/Mar/19 ]

The lustre patch stack on both client and server is:

* 3ee692e (HEAD, 2.12.0-llnl) TOSS-4431 build: build ldiskfs only for x86_64
* e3844bf LU-11827 llog: protect cathandle in llog_cat_declare_add_rec
* 13a3da2 (tag: 2.12.0_1.chaos, llnlstash/2.12.0-llnl) llnl: disable ldiskfs build under rpmbuild
* 7308687 build: no zlib check during configure --enable-dist
Comment by Olaf Faaland [ 12/Mar/19 ]

lfs getdirstripe and getstripe output:

bash-4.2$ lfs getdirstripe /p/lquake/faaland1/make-busy/mdt7
lmv_stripe_count: 0 lmv_stripe_offset: 7 lmv_hash_type: none
bash-4.2$ lfs getstripe /p/lquake/faaland1/make-busy/mdt7
/p/lquake/faaland1/make-busy/mdt7
stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1

/p/lquake/faaland1/make-busy/mdt7/mdtest.6qUohi
stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1

/p/lquake/faaland1/make-busy/mdt7/mdtest.ldXmqW
stripe_count:  1 stripe_size:   1048576 pattern:       0 stripe_offset: -1
Comment by Olaf Faaland [ 12/Mar/19 ]

Note that neither of the two subdirs of mdt7 are the one mktemp tried to create - they both already existed, as show by the mdtest artifacts they contain:

bash-4.2$ ls /p/lquake/faaland1/make-busy/mdt7/*/
/p/lquake/faaland1/make-busy/mdt7/mdtest.6qUohi/:
#test-dir.0

/p/lquake/faaland1/make-busy/mdt7/mdtest.ldXmqW/:
#test-dir.0
Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ]

Olaf,

Anything special about that mktemp script?

And can you provide dmesg from the MDSes serving up the root (MDT0) and MDT0007?

Comment by Olaf Faaland [ 12/Mar/19 ]

Hi Patrick,

The mktemp used is the utility packaged with RHEL. The tar file attached, lu-12063-2.tar.gz, has dmesg for the client (opal110), MDS with MDT0 (jet1), and MDT0007 (jet8), as well as the debug logs for each of those. The debug mask was default on the client and -1 on the servers, I believe.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Thanks, Olaf!

Can you check your other MDS/MDT dmesg logs for this sort of error?  Searching for LustreError and then lod_gen_component_ea should do the trick.

 [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0007-mdtlov: Can not locate [0x700000bd9:0x56:0x0]: rc = -2

The MDT0 dmesg leads me to think we've got failures that are probably on other MDTs (ie other than MDT7), would be interesting to see.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Hmm, actually, we probably don't have such errors on the other MDTs - But I'd love to know if we do.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Olaf,

mktemp without an option is not mkdir - it's creating a file.  That matches with some of what I'm seeing in the logs, and your previous report in the earlier ticket for the EINVAL (sorry for missing that).

Do you have other cause to think you've seen this with mkdir?

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Sorry for the flurry of updates...

00000004:00080000:8.0:1552423369.592170:0:57771:0:(osp_object.c:1592:osp_create()) lquake-OST0004-osc-MDT0007: Wrote last used FID: [0x700000bd9:0x56:0x0], index 4: 0 

Do you have logs, even just dmesg, from OST0004?

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Can you do fid2path and getstripe on a file on OST0004, and on a few other OSTs in the file system?  (If possible, files on both MDT0 and another MDT.  MDT7 would be great.)

Basically, at least some of the time, we're failing to find the sequence associated with certain OSTs.  Might just be OST0004 - I'm still working at decoding the FID.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

New theory, based on errors seen so far:
Problem is specific to OST0004.  Curious to know if you have persistent issues creating files there from some or all MDTs.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

OK, one more... Let's dump the FID sequence tables as viewed from MDT0, and also at least one remote MDT.

 

lctl get_param seq.*.*

That's going to give a lot of output, sorry.

Comment by Olaf Faaland [ 13/Mar/19 ]

Hi Patrick,

I lost the cluster again for a little while. I hope to get it back within a couple days, and I'll fetch the FID sequence tables and try creates using specified individual OSTs then.

I was mistaken about mkdir failing. All slurm job logs report create failures, no mkdir errors. I mixed up the two problems. I've updated the summary to reflect that.

I'm attaching dmesg and lctl dk output from jet21, where OST0004 was running, as lu-12063-3.tar.gz.

There is some noise in the logs from two routers which are down, NIDs with IPs 172.19.1.22 and 172.19.1.23. They are for a system not actually running LNet or Lustre at the moment and they are not between jet and opal, so should be unrelated to this issue.

thanks,
Olaf

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Thanks, Olaf.  Too bad you're not able to get those tables & do that testing, it would tell us a lot.  I'll look at the logs, but I suspect they're going to be clean.  I think it's more likely there's something wrong on the MDS(es), but it's a little tricky to say what.

Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ]

Olaf,

Have you seen this issue recently and/or had another chance to run this test?

Comment by Olaf Faaland [ 15/Jul/19 ]

I am not seeing this anymore on Lustre 2.12.2. Closing.

Generated at Sat Feb 10 02:49:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.