Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11984

Intermittent file create or rm fail with EINVAL

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      mdtest intermittently fails and reports EINVAL error when trying to create or remove a file.

      mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes
      Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v
      Path: /p/lquake/faaland1                                                                                                    
      FS: 1867.3 TiB   Used FS: 34.2%   Inodes: 765.8 Mi   Used Inodes: 57.1%                                                     
      1024 tasks, 1048576 files
      
       Operation               Duration              Rate
         ---------               --------              ----
       * iteration 1 02/20/2019 13:37:43 *                 
         Tree creation     :      0.076 sec,     13.191 ops/sec
      
      02/20/2019 13:39:00: Process 158(opal119): FAILED in create_remove_items_helper, unable to unlink file file.mdtest.158.223 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.158.0): Invalid argument                                                                         
      --------------------------------------------------------------------------                                                                       
      MPI_ABORT was invoked on rank 158 in communicator MPI_COMM_WORLD                                                                                 with errorcode 1.
      

      Seen with:
      no DoM
      no PFL
      16 MDTs in the file system, but directory mdtest is using is not striped.
      64 nodes x 16 ppn

      Attachments

        Issue Links

          Activity

            [LU-11984] Intermittent file create or rm fail with EINVAL

            Duplicate of LU-11827

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Duplicate of  LU-11827

            Whoops, missed this update...  But this most recent report is (your new ticket) LU-12063.  Let's close this one out as a duplicate of LU-11827 and move discussion there.

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Whoops, missed this update...  But this most recent report is (your new ticket)  LU-12063 .  Let's close this one out as a duplicate of LU-11827 and move discussion there.
            ofaaland Olaf Faaland added a comment - - edited

            Hi Patrick,

            I got the cluster back.  I applied the patch from LU-11827 to Lustre 2.12.0 and am using that build on both client and server. Creates now fail, but much more consistently and with different symptoms.  The user process gets back ENOENT instead of EINVAL. On the server's console is a lustre error, which did not occur before. It is:

            LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2
            

            This was produced running a single-node x 16tpp mdtest, without DoM or PFL.

            This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there.

            ofaaland Olaf Faaland added a comment - - edited Hi Patrick, I got the cluster back.  I applied the patch from LU-11827 to Lustre 2.12.0 and am using that build on both client and server. Creates now fail, but much more consistently and with different symptoms.  The user process gets back ENOENT instead of EINVAL. On the server's console is a lustre error, which did not occur before. It is: LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2 This was produced running a single-node x 16tpp mdtest, without DoM or PFL. This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there.

            Alex,

            Thank you very much!  This does indeed look like LU-11827ofaaland, when you get a chance, it would be good to try that out.

            pfarrell Patrick Farrell (Inactive) added a comment - Alex, Thank you very much!  This does indeed look like LU-11827 .  ofaaland , when you get a chance, it would be good to try that out.
            aboyko Alexander Boyko added a comment - - edited

            This probably duplicate of LU-11827. We saw unlink fail with invalid argument during mdtest regular.

            @Olaf Faaland, could you check with LU-11827 patch? it was landed to master yesterday.

            aboyko Alexander Boyko added a comment - - edited This probably duplicate of LU-11827 . We saw unlink fail with invalid argument during mdtest regular. @Olaf Faaland, could you check with LU-11827 patch? it was landed to master yesterday.

            No problem.

            All right, I'll wait for more from you.  Thanks.

            pfarrell Patrick Farrell (Inactive) added a comment - No problem. All right, I'll wait for more from you.  Thanks.

            Unfortunately, the MDS dk log starts just a few seconds after that failure:

            Argh, I should have seen that.

            There's nothing in dmesg from the MDS anywhere near the time of the failure.   I should be able to get the cluster back later today or tomorrow to try again.

            ofaaland Olaf Faaland added a comment - Unfortunately, the MDS dk log starts just a few seconds after that failure: Argh, I should have seen that. There's nothing in dmesg from the MDS anywhere near the time of the failure.   I should be able to get the cluster back later today or tomorrow to try again.

            Unfortunately, the MDS dk log starts just a few seconds after that failure:

             00000100:00100000:6.0:1550708942.344015:0:56668:0:(ptlrpcd.c:409:ptlrpcd_check()) transfer 1 async RPCs [1->0]
            1550708942 <--
            
            00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22 req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301
            
            1550708843 <--

            So it contains no useful info.

            Dmesg from the MDS might have something...?

            pfarrell Patrick Farrell (Inactive) added a comment - Unfortunately, the MDS dk log starts just a few seconds after that failure: 00000100:00100000:6.0:1550708942.344015:0:56668:0:(ptlrpcd.c:409:ptlrpcd_check()) transfer 1 async RPCs [1->0] 1550708942 <-- 00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22 req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301 1550708843 <-- So it contains no useful info. Dmesg from the MDS might have something...?
            ofaaland Olaf Faaland added a comment -

            Patrick,

            I attached dk.jet3.1550709103.gz which is from the node running MDT0002 at the time.  This is from yesterday, so does not have any of the debug flags you asked for set.  When I get time on that machine again, hopefully today, I'll try again with your flags.  Thank you.

            ofaaland Olaf Faaland added a comment - Patrick, I attached dk.jet3.1550709103.gz which is from the node running MDT0002 at the time.  This is from yesterday, so does not have any of the debug flags you asked for set.  When I get time on that machine again, hopefully today, I'll try again with your flags.  Thank you.

            People

              pfarrell Patrick Farrell (Inactive)
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: