Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1353

mdt_reint_open() @@@ OPEN & CREAT not in open replay

Details

    • 3
    • 9748

    Description

      We occasionally see the message in the summary show up in the MDS console log during server recovery. What might cause this?

      Attachments

        Issue Links

          Activity

            [LU-1353] mdt_reint_open() @@@ OPEN & CREAT not in open replay

            Close old ticket.

            adilger Andreas Dilger added a comment - Close old ticket.
            laisiyao Lai Siyao added a comment -

            The code looks correct in all open, unlink handling, we need to find a simple way to reproduce this to collect more debug logs with 'vfstrace,inode' enabled on both MDS and clients.

            laisiyao Lai Siyao added a comment - The code looks correct in all open, unlink handling, we need to find a simple way to reproduce this to collect more debug logs with 'vfstrace,inode' enabled on both MDS and clients.

            Here you go:

            # grove513 /root > stat /p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]
            stat: cannot stat `/p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]': No such file or directory
            
            # grove-mds2 /mnt/snap > ls -Rlan PENDING/
            PENDING/:
            total 29
            drwxr-xr-x 2     0     0 2 Dec 31  1969 .
            drwxr-xr-x 2     0     0 2 Dec 31  1969 .
            drwxr-xr-x 4     0     0 2 Sep 28 15:39 ..
            drwx------ 2 37693 37693 2 Nov 28 18:21 0000000200018ee8:0000000f:00000000: 0
            
            PENDING/0000000200018ee8:0000000f:00000000: 0:
            total 16
            drwx------ 2 37693 37693 2 Nov 28 18:21 .
            drwx------ 2 37693 37693 2 Nov 28 18:21 .
            drwxr-xr-x 2     0     0 2 Dec 31  1969 ..
            

            Would it still be around a few days later?

            BTW, do you know how simul runs in during this MDS recovery? Is there any hint on whether these not found files been unlinked?

            I'm not sure off the top of my head..

            prakash Prakash Surya (Inactive) added a comment - Here you go: # grove513 /root > stat /p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0] stat: cannot stat `/p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]': No such file or directory # grove-mds2 /mnt/snap > ls -Rlan PENDING/ PENDING/: total 29 drwxr-xr-x 2 0 0 2 Dec 31 1969 . drwxr-xr-x 2 0 0 2 Dec 31 1969 . drwxr-xr-x 4 0 0 2 Sep 28 15:39 .. drwx------ 2 37693 37693 2 Nov 28 18:21 0000000200018ee8:0000000f:00000000: 0 PENDING/0000000200018ee8:0000000f:00000000: 0: total 16 drwx------ 2 37693 37693 2 Nov 28 18:21 . drwx------ 2 37693 37693 2 Nov 28 18:21 . drwxr-xr-x 2 0 0 2 Dec 31 1969 .. Would it still be around a few days later? BTW, do you know how simul runs in during this MDS recovery? Is there any hint on whether these not found files been unlinked? I'm not sure off the top of my head..
            laisiyao Lai Siyao added a comment -

            Could you help verify the fid not found really doesn't exist on MDS?

            Firstly you can use `stat <mountpoint>/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]` to verify whether fid exists.

            If this fid exists, then you can use `lfs fid2path <fsname> [0x2000182dc:0x1f2d1:0x0]` to check the path of this fid.

            Else you can mount MDS filesystem as ldiskfs, and check <mountpoint>/PENDING, is it empty?

            BTW, do you know how simul runs in during this MDS recovery? Is there any hint on whether these not found files been unlinked?

            laisiyao Lai Siyao added a comment - Could you help verify the fid not found really doesn't exist on MDS? Firstly you can use `stat <mountpoint>/.lustre/fid/ [0x2000182dc:0x1f2d1:0x0] ` to verify whether fid exists. If this fid exists, then you can use `lfs fid2path <fsname> [0x2000182dc:0x1f2d1:0x0] ` to check the path of this fid. Else you can mount MDS filesystem as ldiskfs, and check <mountpoint>/PENDING, is it empty? BTW, do you know how simul runs in during this MDS recovery? Is there any hint on whether these not found files been unlinked?

            I got a hit of this with the debug patch applied yesterday. Here's the relevant console output:

            Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 265 clients reconnect.
            LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f2d1:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd78c5050 x1419179046425207/t0(201876066869) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:06 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_close.0->[0x2000182dc:0x1f2d2:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd7fe1c50 x1419179046425214/t0(201876066900) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f380:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd80c9450 x1419179046426510/t0(201876068937) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975968 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 43 previous similar messages
            2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_lseek.0->[0x2000182dc:0x1f45a:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd675f850 x1419179046428033/t0(201876070976) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975969 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 56 previous similar messages
            2012-11-26 16:25:09 Lustre: lstest-MDT0000: Recovery over after 1:15, of 265 clients 256 recovered and 9 were evicted.
            
            prakash Prakash Surya (Inactive) added a comment - I got a hit of this with the debug patch applied yesterday. Here's the relevant console output: Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 265 clients reconnect. LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f2d1:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd78c5050 x1419179046425207/t0(201876066869) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:06 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_close.0->[0x2000182dc:0x1f2d2:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd7fe1c50 x1419179046425214/t0(201876066900) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f380:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd80c9450 x1419179046426510/t0(201876068937) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975968 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 43 previous similar messages 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_lseek.0->[0x2000182dc:0x1f45a:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd675f850 x1419179046428033/t0(201876070976) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975969 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 56 previous similar messages 2012-11-26 16:25:09 Lustre: lstest-MDT0000: Recovery over after 1:15, of 265 clients 256 recovered and 9 were evicted.

            2.1.1-13chaos (or newer) will likely start rolling out into production about 2 weeks from now.

            morrone Christopher Morrone (Inactive) added a comment - 2.1.1-13chaos (or newer) will likely start rolling out into production about 2 weeks from now.
            pjones Peter Jones added a comment -

            Chris

            When do you expect that version of chaos to be deployed into production?

            Peter

            pjones Peter Jones added a comment - Chris When do you expect that version of chaos to be deployed into production? Peter

            I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            morrone Christopher Morrone (Inactive) added a comment - I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.

            behlendorf Brian Behlendorf added a comment - I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.
            laisiyao Lai Siyao added a comment -

            The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679.

            laisiyao Lai Siyao added a comment - The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679 .

            People

              laisiyao Lai Siyao
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: