Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12063

mktemp fails with ENOENT and MDS log reports lod_gen_component_ea() Can not locate [0x700000bd9:0x56:0x0]: rc = -2

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0
    • Lustre 2.12.0_1.chaos_2_g3ee692e
      kernel 3.10.0-957.5.1.3chaos.ch6.x86_64
      distro RHEL 7.6 derivative
      backend zfs v0.7.11-5llnl
    • 3
    • 9223372036854775807

    Description

      file create fails intermittently, errno is ENOENT.

      bash-4.2$ mktemp /p/lquake/faaland1/make-busy/mdt7/mdtest.enoent.XXXX
      mktemp: failed to create file via template '/p/lquake/faaland1/make-busy/mdt7/mdtest.enoent.XXXX': No such file or directory
      bash-4.2$ ls -l /p/lquake/faaland1/make-busy/mdt7
      total 65
      drwx------ 3 faaland1 faaland1 33280 Mar 12 12:24 mdtest.6qUohi
      drwx------ 3 faaland1 faaland1 33280 Mar 12 12:24 mdtest.ldXmqW

      MDS console log reports:

      [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0007-mdtlov: Can not locate [0x700000bd9:0x56:0x0]: rc = -2
      [Tue Mar 12 13:42:47 2019] LustreError: 57771:0:(lod_lov.c:896:lod_gen_component_ea()) Skipped 6 previous similar messages
      

      Attachments

        Activity

          [LU-12063] mktemp fails with ENOENT and MDS log reports lod_gen_component_ea() Can not locate [0x700000bd9:0x56:0x0]: rc = -2
          ofaaland Olaf Faaland added a comment -

          I am not seeing this anymore on Lustre 2.12.2. Closing.

          ofaaland Olaf Faaland added a comment - I am not seeing this anymore on Lustre 2.12.2. Closing.

          Olaf,

          Have you seen this issue recently and/or had another chance to run this test?

          pfarrell Patrick Farrell (Inactive) added a comment - Olaf, Have you seen this issue recently and/or had another chance to run this test?

          Thanks, Olaf.  Too bad you're not able to get those tables & do that testing, it would tell us a lot.  I'll look at the logs, but I suspect they're going to be clean.  I think it's more likely there's something wrong on the MDS(es), but it's a little tricky to say what.

          pfarrell Patrick Farrell (Inactive) added a comment - Thanks, Olaf.  Too bad you're not able to get those tables & do that testing, it would tell us a lot.  I'll look at the logs, but I suspect they're going to be clean.  I think it's more likely there's something wrong on the MDS(es), but it's a little tricky to say what.
          ofaaland Olaf Faaland added a comment -

          Hi Patrick,

          I lost the cluster again for a little while. I hope to get it back within a couple days, and I'll fetch the FID sequence tables and try creates using specified individual OSTs then.

          I was mistaken about mkdir failing. All slurm job logs report create failures, no mkdir errors. I mixed up the two problems. I've updated the summary to reflect that.

          I'm attaching dmesg and lctl dk output from jet21, where OST0004 was running, as lu-12063-3.tar.gz.

          There is some noise in the logs from two routers which are down, NIDs with IPs 172.19.1.22 and 172.19.1.23. They are for a system not actually running LNet or Lustre at the moment and they are not between jet and opal, so should be unrelated to this issue.

          thanks,
          Olaf

          ofaaland Olaf Faaland added a comment - Hi Patrick, I lost the cluster again for a little while. I hope to get it back within a couple days, and I'll fetch the FID sequence tables and try creates using specified individual OSTs then. I was mistaken about mkdir failing. All slurm job logs report create failures, no mkdir errors. I mixed up the two problems. I've updated the summary to reflect that. I'm attaching dmesg and lctl dk output from jet21, where OST0004 was running, as lu-12063-3.tar.gz. There is some noise in the logs from two routers which are down, NIDs with IPs 172.19.1.22 and 172.19.1.23. They are for a system not actually running LNet or Lustre at the moment and they are not between jet and opal, so should be unrelated to this issue. thanks, Olaf

          OK, one more... Let's dump the FID sequence tables as viewed from MDT0, and also at least one remote MDT.

           

          lctl get_param seq.*.*

          That's going to give a lot of output, sorry.

          pfarrell Patrick Farrell (Inactive) added a comment - OK, one more... Let's dump the FID sequence tables as viewed from MDT0, and also at least one remote MDT.   lctl get_param seq.*.* That's going to give a lot of output, sorry.

          New theory, based on errors seen so far:
          Problem is specific to OST0004.  Curious to know if you have persistent issues creating files there from some or all MDTs.

          pfarrell Patrick Farrell (Inactive) added a comment - New theory, based on errors seen so far: Problem is specific to OST0004.  Curious to know if you have persistent issues creating files there from some or all MDTs.

          Can you do fid2path and getstripe on a file on OST0004, and on a few other OSTs in the file system?  (If possible, files on both MDT0 and another MDT.  MDT7 would be great.)

          Basically, at least some of the time, we're failing to find the sequence associated with certain OSTs.  Might just be OST0004 - I'm still working at decoding the FID.

          pfarrell Patrick Farrell (Inactive) added a comment - - edited Can you do fid2path and getstripe on a file on OST0004, and on a few other OSTs in the file system?  (If possible, files on both MDT0 and another MDT.  MDT7 would be great.) Basically, at least some of the time, we're failing to find the sequence associated with certain OSTs.  Might just be OST0004 - I'm still working at decoding the FID.

          Sorry for the flurry of updates...

          00000004:00080000:8.0:1552423369.592170:0:57771:0:(osp_object.c:1592:osp_create()) lquake-OST0004-osc-MDT0007: Wrote last used FID: [0x700000bd9:0x56:0x0], index 4: 0 

          Do you have logs, even just dmesg, from OST0004?

          pfarrell Patrick Farrell (Inactive) added a comment - Sorry for the flurry of updates... 00000004:00080000:8.0:1552423369.592170:0:57771:0:(osp_object.c:1592:osp_create()) lquake-OST0004-osc-MDT0007: Wrote last used FID: [0x700000bd9:0x56:0x0], index 4: 0 Do you have logs, even just dmesg, from OST0004?

          Olaf,

          mktemp without an option is not mkdir - it's creating a file.  That matches with some of what I'm seeing in the logs, and your previous report in the earlier ticket for the EINVAL (sorry for missing that).

          Do you have other cause to think you've seen this with mkdir?

          pfarrell Patrick Farrell (Inactive) added a comment - Olaf, mktemp without an option is not mkdir - it's creating a file.  That matches with some of what I'm seeing in the logs, and your previous report in the earlier ticket for the EINVAL (sorry for missing that). Do you have other cause to think you've seen this with mkdir?

          Hmm, actually, we probably don't have such errors on the other MDTs - But I'd love to know if we do.

          pfarrell Patrick Farrell (Inactive) added a comment - Hmm, actually, we probably don't have such errors on the other MDTs - But I'd love to know if we do.

          People

            pfarrell Patrick Farrell (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: