Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11984

Intermittent file create or rm fail with EINVAL

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      mdtest intermittently fails and reports EINVAL error when trying to create or remove a file.

      mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes
      Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v
      Path: /p/lquake/faaland1                                                                                                    
      FS: 1867.3 TiB   Used FS: 34.2%   Inodes: 765.8 Mi   Used Inodes: 57.1%                                                     
      1024 tasks, 1048576 files
      
       Operation               Duration              Rate
         ---------               --------              ----
       * iteration 1 02/20/2019 13:37:43 *                 
         Tree creation     :      0.076 sec,     13.191 ops/sec
      
      02/20/2019 13:39:00: Process 158(opal119): FAILED in create_remove_items_helper, unable to unlink file file.mdtest.158.223 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.158.0): Invalid argument                                                                         
      --------------------------------------------------------------------------                                                                       
      MPI_ABORT was invoked on rank 158 in communicator MPI_COMM_WORLD                                                                                 with errorcode 1.
      

      Seen with:
      no DoM
      no PFL
      16 MDTs in the file system, but directory mdtest is using is not striped.
      64 nodes x 16 ppn

      Attachments

        1. dk.jet3.1550709103.gz
          3.45 MB
          Olaf Faaland
        2. dk.opal67.1550709082.gz
          9.17 MB
          Olaf Faaland

        Issue Links

          Activity

            [LU-11984] Intermittent file create or rm fail with EINVAL
            pfarrell Patrick Farrell (Inactive) made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Duplicate of LU-11827

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Duplicate of  LU-11827
            pfarrell Patrick Farrell (Inactive) made changes -
            Link New: This issue duplicates LU-11827 [ LU-11827 ]
            pfarrell Patrick Farrell (Inactive) made changes -
            Link Original: This issue duplicates LU-1182 [ LU-1182 ]
            pfarrell Patrick Farrell (Inactive) made changes -
            Link New: This issue duplicates LU-1182 [ LU-1182 ]

            Whoops, missed this update...  But this most recent report is (your new ticket) LU-12063.  Let's close this one out as a duplicate of LU-11827 and move discussion there.

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Whoops, missed this update...  But this most recent report is (your new ticket)  LU-12063 .  Let's close this one out as a duplicate of LU-11827 and move discussion there.
            ofaaland Olaf Faaland added a comment - - edited

            Hi Patrick,

            I got the cluster back.  I applied the patch from LU-11827 to Lustre 2.12.0 and am using that build on both client and server. Creates now fail, but much more consistently and with different symptoms.  The user process gets back ENOENT instead of EINVAL. On the server's console is a lustre error, which did not occur before. It is:

            LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2
            

            This was produced running a single-node x 16tpp mdtest, without DoM or PFL.

            This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there.

            ofaaland Olaf Faaland added a comment - - edited Hi Patrick, I got the cluster back.  I applied the patch from LU-11827 to Lustre 2.12.0 and am using that build on both client and server. Creates now fail, but much more consistently and with different symptoms.  The user process gets back ENOENT instead of EINVAL. On the server's console is a lustre error, which did not occur before. It is: LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2 This was produced running a single-node x 16tpp mdtest, without DoM or PFL. This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there.

            Alex,

            Thank you very much!  This does indeed look like LU-11827ofaaland, when you get a chance, it would be good to try that out.

            pfarrell Patrick Farrell (Inactive) added a comment - Alex, Thank you very much!  This does indeed look like LU-11827 .  ofaaland , when you get a chance, it would be good to try that out.
            aboyko Alexander Boyko added a comment - - edited

            This probably duplicate of LU-11827. We saw unlink fail with invalid argument during mdtest regular.

            @Olaf Faaland, could you check with LU-11827 patch? it was landed to master yesterday.

            aboyko Alexander Boyko added a comment - - edited This probably duplicate of LU-11827 . We saw unlink fail with invalid argument during mdtest regular. @Olaf Faaland, could you check with LU-11827 patch? it was landed to master yesterday.

            No problem.

            All right, I'll wait for more from you.  Thanks.

            pfarrell Patrick Farrell (Inactive) added a comment - No problem. All right, I'll wait for more from you.  Thanks.

            People

              pfarrell Patrick Farrell (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: