Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18071

Single client job -o flock zombie flock remains on file, -o local flock works fine

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      Single node mpi data analysis job (8 cpu) running on client. Job completes fine but subsequent runs fail because a file in the dataset remains locked. A remount clears file lock.

      After application on Client of LU-17692 and LU-17589 the below error message is issued on the client after the job runs successfully and then the file lock status is checked.

      00000080:00020000:7.0F:1722023688.289109:0:2533:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2318/2533, start=0/0, end=0/9223372036854775807,type=0/2

      Remounting with -o localflock the job runs fine and no zombie flock remains. 

      Back to mounting on client '-o flock' and re-running application job. File remains locked. Any attempt to release the lock triggers the "Flock LS mismatch error" on the client.

      00000080:00020000:7.0:1722024764.304245:0:3100:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3100, start=0/0, end=0/9223372036854775807,type=0/2
      00000080:00020000:7.0:1722025303.376488:0:3179:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3179, start=0/0, end=0/9223372036854775807,type=0/2
      Debug log: 390 lines, 390 kept, 0 dropped, 0 bad.
      00000080:00020000:7.0F:1722025407.995185:0:3184:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3184, start=0/0, end=0/9223372036854775807,type=0/2
      Debug log: 1 lines, 1 kept, 0 dropped, 0 bad.

      Also used a test program to check the lock (attached).

       

      Attachments

        1. debug02.flock.4core.txt
          16.05 MB
        2. LU18071-dk-client.txt
          11 kB
        3. LU18071-dk-server.txt
          720 kB
        4. r2u05n1-ldebug.txt
          19 kB
        5. r2u31n1-ldebug.txt
          8 kB
        6. wholocked.c
          3 kB

        Issue Links

          Activity

            [LU-18071] Single client job -o flock zombie flock remains on file, -o local flock works fine
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-17871 [ LU-17871 ]
            aeonjeff Jeff Johnson made changes -
            Attachment New: r2u05n1-ldebug.txt [ 55918 ]
            Attachment New: r2u31n1-ldebug.txt [ 55919 ]
            arshad512 Arshad Hussain made changes -
            Attachment New: debug02.flock.4core.txt [ 55747 ]
            aeonjeff Jeff Johnson made changes -
            Attachment New: LU18071-dk-client.txt [ 55593 ]
            Attachment New: LU18071-dk-server.txt [ 55594 ]
            arshad512 Arshad Hussain made changes -
            Description Original: Single node mpi data analysis job (8 cpu) running on client. Job completes fine but subsequent runs fail because a file in the dataset remains locked. A remount clears file lock.

            After application on Client of LU-17692 and LU-17589 the below error message is issued on the client after the job runs successfully and then the file lock status is checked.

            {{00000080:00020000:7.0F:1722023688.289109:0:2533:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2318/2533, start=0/0, end=0/9223372036854775807,type=0/2}}

            Remounting with -o localflock the job runs fine and no zombie flock remains. 

            Back to mounting on client '-o flock' and re-running application job. File remains locked. Any attempt to release the lock triggers the "Flock LS mismatch error" on the client.

            {{00000080:00020000:7.0:1722024764.304245:0:3100:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3100, start=0/0, end=0/9223372036854775807,type=0/2}}
            {{00000080:00020000:7.0:1722025303.376488:0:3179:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3179, start=0/0, end=0/9223372036854775807,type=0/2}}
            {{Debug log: 390 lines, 390 kept, 0 dropped, 0 bad.}}
            {{00000080:00020000:7.0F:1722025407.995185:0:3184:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3184, start=0/0, end=0/9223372036854775807,type=0/2}}
            {{Debug log: 1 lines, 1 kept, 0 dropped, 0 bad.}}

            Also used a test program to check the lock (attached).

             
            New: Single node mpi data analysis job (8 cpu) running on client. Job completes fine but subsequent runs fail because a file in the dataset remains locked. A remount clears file lock.

            After application on Client of LU-17692 and LU-17589 the below error message is issued on the client after the job runs successfully and then the file lock status is checked.
            {noformat}
            00000080:00020000:7.0F:1722023688.289109:0:2533:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2318/2533, start=0/0, end=0/9223372036854775807,type=0/2{noformat}
            Remounting with -o localflock the job runs fine and no zombie flock remains. 

            Back to mounting on client '-o flock' and re-running application job. File remains locked. Any attempt to release the lock triggers the "Flock LS mismatch error" on the client.
            {noformat}
            00000080:00020000:7.0:1722024764.304245:0:3100:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3100, start=0/0, end=0/9223372036854775807,type=0/2
            00000080:00020000:7.0:1722025303.376488:0:3179:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3179, start=0/0, end=0/9223372036854775807,type=0/2
            Debug log: 390 lines, 390 kept, 0 dropped, 0 bad.
            00000080:00020000:7.0F:1722025407.995185:0:3184:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2778/3184, start=0/0, end=0/9223372036854775807,type=0/2
            Debug log: 1 lines, 1 kept, 0 dropped, 0 bad.{noformat}
            Also used a test program to check the lock (attached).

             
            aeonjeff Jeff Johnson created issue -

            People

              wc-triage WC Triage
              aeonjeff Jeff Johnson
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: