[LU-12063] mktemp fails with ENOENT and MDS log reports lod_gen_component_ea() Can not locate [0x700000bd9:0x56:0x0]: rc = -2 Created: 12/Mar/19 Updated: 15/Jul/19 Resolved: 15/Jul/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre 2.12.0_1.chaos_2_g3ee692e |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
file create fails intermittently, errno is ENOENT. bash-4.2$ mktemp /p/lquake/faaland1/make-busy/mdt7/mdtest.enoent.XXXX MDS console log reports: [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0007-mdtlov: Can not locate [0x700000bd9:0x56:0x0]: rc = -2 [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) Skipped 6 previous similar messages |
| Comments |
| Comment by Olaf Faaland [ 12/Mar/19 ] |
|
The lustre patch stack on both client and server is: * 3ee692e (HEAD, 2.12.0-llnl) TOSS-4431 build: build ldiskfs only for x86_64 * e3844bf LU-11827 llog: protect cathandle in llog_cat_declare_add_rec * 13a3da2 (tag: 2.12.0_1.chaos, llnlstash/2.12.0-llnl) llnl: disable ldiskfs build under rpmbuild * 7308687 build: no zlib check during configure --enable-dist |
| Comment by Olaf Faaland [ 12/Mar/19 ] |
|
lfs getdirstripe and getstripe output: bash-4.2$ lfs getdirstripe /p/lquake/faaland1/make-busy/mdt7 lmv_stripe_count: 0 lmv_stripe_offset: 7 lmv_hash_type: none bash-4.2$ lfs getstripe /p/lquake/faaland1/make-busy/mdt7 /p/lquake/faaland1/make-busy/mdt7 stripe_count: 1 stripe_size: 1048576 pattern: 0 stripe_offset: -1 /p/lquake/faaland1/make-busy/mdt7/mdtest.6qUohi stripe_count: 1 stripe_size: 1048576 pattern: 0 stripe_offset: -1 /p/lquake/faaland1/make-busy/mdt7/mdtest.ldXmqW stripe_count: 1 stripe_size: 1048576 pattern: 0 stripe_offset: -1 |
| Comment by Olaf Faaland [ 12/Mar/19 ] |
|
Note that neither of the two subdirs of mdt7 are the one mktemp tried to create - they both already existed, as show by the mdtest artifacts they contain: bash-4.2$ ls /p/lquake/faaland1/make-busy/mdt7/*/ /p/lquake/faaland1/make-busy/mdt7/mdtest.6qUohi/: #test-dir.0 /p/lquake/faaland1/make-busy/mdt7/mdtest.ldXmqW/: #test-dir.0 |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
Olaf, Anything special about that mktemp script? And can you provide dmesg from the MDSes serving up the root (MDT0) and MDT0007? |
| Comment by Olaf Faaland [ 12/Mar/19 ] |
|
Hi Patrick, The mktemp used is the utility packaged with RHEL. The tar file attached, lu-12063-2.tar.gz, has dmesg for the client (opal110), MDS with MDT0 (jet1), and MDT0007 (jet8), as well as the debug logs for each of those. The debug mask was default on the client and -1 on the servers, I believe. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Thanks, Olaf! Can you check your other MDS/MDT dmesg logs for this sort of error? Searching for LustreError and then lod_gen_component_ea should do the trick. [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0007-mdtlov: Can not locate [0x700000bd9:0x56:0x0]: rc = -2 The MDT0 dmesg leads me to think we've got failures that are probably on other MDTs (ie other than MDT7), would be interesting to see. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Hmm, actually, we probably don't have such errors on the other MDTs - But I'd love to know if we do. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Olaf, mktemp without an option is not mkdir - it's creating a file. That matches with some of what I'm seeing in the logs, and your previous report in the earlier ticket for the EINVAL (sorry for missing that). Do you have other cause to think you've seen this with mkdir? |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Sorry for the flurry of updates... 00000004:00080000:8.0:1552423369.592170:0:57771:0:(osp_object.c:1592:osp_create()) lquake-OST0004-osc-MDT0007: Wrote last used FID: [0x700000bd9:0x56:0x0], index 4: 0 Do you have logs, even just dmesg, from OST0004? |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Can you do fid2path and getstripe on a file on OST0004, and on a few other OSTs in the file system? (If possible, files on both MDT0 and another MDT. MDT7 would be great.) Basically, at least some of the time, we're failing to find the sequence associated with certain OSTs. Might just be OST0004 - I'm still working at decoding the FID. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
New theory, based on errors seen so far: |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
OK, one more... Let's dump the FID sequence tables as viewed from MDT0, and also at least one remote MDT.
lctl get_param seq.*.* That's going to give a lot of output, sorry. |
| Comment by Olaf Faaland [ 13/Mar/19 ] |
|
Hi Patrick, I lost the cluster again for a little while. I hope to get it back within a couple days, and I'll fetch the FID sequence tables and try creates using specified individual OSTs then. I was mistaken about mkdir failing. All slurm job logs report create failures, no mkdir errors. I mixed up the two problems. I've updated the summary to reflect that. I'm attaching dmesg and lctl dk output from jet21, where OST0004 was running, as lu-12063-3.tar.gz. There is some noise in the logs from two routers which are down, NIDs with IPs 172.19.1.22 and 172.19.1.23. They are for a system not actually running LNet or Lustre at the moment and they are not between jet and opal, so should be unrelated to this issue. thanks, |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Thanks, Olaf. Too bad you're not able to get those tables & do that testing, it would tell us a lot. I'll look at the logs, but I suspect they're going to be clean. I think it's more likely there's something wrong on the MDS(es), but it's a little tricky to say what. |
| Comment by Patrick Farrell (Inactive) [ 12/Jul/19 ] |
|
Olaf, Have you seen this issue recently and/or had another chance to run this test? |
| Comment by Olaf Faaland [ 15/Jul/19 ] |
|
I am not seeing this anymore on Lustre 2.12.2. Closing. |