Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1353

mdt_reint_open() @@@ OPEN & CREAT not in open replay

Details

    • 3
    • 9748

    Description

      We occasionally see the message in the summary show up in the MDS console log during server recovery. What might cause this?

      Attachments

        Issue Links

          Activity

            [LU-1353] mdt_reint_open() @@@ OPEN & CREAT not in open replay

            I got a hit of this with the debug patch applied yesterday. Here's the relevant console output:

            Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 265 clients reconnect.
            LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f2d1:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd78c5050 x1419179046425207/t0(201876066869) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:06 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_close.0->[0x2000182dc:0x1f2d2:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd7fe1c50 x1419179046425214/t0(201876066900) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f380:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd80c9450 x1419179046426510/t0(201876068937) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975968 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 43 previous similar messages
            2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_lseek.0->[0x2000182dc:0x1f45a:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay.  req@ffff881fd675f850 x1419179046428033/t0(201876070976) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975969 ref 1 fl Complete:/4/0 rc 0/0
            2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 56 previous similar messages
            2012-11-26 16:25:09 Lustre: lstest-MDT0000: Recovery over after 1:15, of 265 clients 256 recovered and 9 were evicted.
            
            prakash Prakash Surya (Inactive) added a comment - I got a hit of this with the debug patch applied yesterday. Here's the relevant console output: Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 265 clients reconnect. LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f2d1:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd78c5050 x1419179046425207/t0(201876066869) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:06 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_close.0->[0x2000182dc:0x1f2d2:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd7fe1c50 x1419179046425214/t0(201876066900) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f380:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd80c9450 x1419179046426510/t0(201876068937) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975968 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 43 previous similar messages 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_lseek.0->[0x2000182dc:0x1f45a:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd675f850 x1419179046428033/t0(201876070976) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975969 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 56 previous similar messages 2012-11-26 16:25:09 Lustre: lstest-MDT0000: Recovery over after 1:15, of 265 clients 256 recovered and 9 were evicted.

            2.1.1-13chaos (or newer) will likely start rolling out into production about 2 weeks from now.

            morrone Christopher Morrone (Inactive) added a comment - 2.1.1-13chaos (or newer) will likely start rolling out into production about 2 weeks from now.
            pjones Peter Jones added a comment -

            Chris

            When do you expect that version of chaos to be deployed into production?

            Peter

            pjones Peter Jones added a comment - Chris When do you expect that version of chaos to be deployed into production? Peter

            I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            morrone Christopher Morrone (Inactive) added a comment - I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.

            behlendorf Brian Behlendorf added a comment - I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.
            laisiyao Lai Siyao added a comment -

            The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679.

            laisiyao Lai Siyao added a comment - The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679 .

            If a patch is needed to debug further, then please do work on one for b2_1.

            morrone Christopher Morrone (Inactive) added a comment - If a patch is needed to debug further, then please do work on one for b2_1.

            We see this on production servers, not during recovery testing. We can include a debug patch in our tree and wait for it to happen again. If we learn that this is a valid case under Lustre's consistency protocol then perhaps the message should not go to the console.

            nedbass Ned Bass (Inactive) added a comment - We see this on production servers, not during recovery testing. We can include a debug patch in our tree and wait for it to happen again. If we learn that this is a valid case under Lustre's consistency protocol then perhaps the message should not go to the console.
            laisiyao Lai Siyao added a comment -

            The syslog messages doesn't provide much information. IMHO a possible cause may be that a file was created, and then opened, but MDS failed and unfortunately this inode was not synced to disk yet, then during MDS recovery, client tried to replay open, but MDS couldn't find this file, and printed this error message. If it's all that has happened, current handling looks reasonable.

            Will you do such recovery test again? If so, I can provide a debug patch to print more information.

            laisiyao Lai Siyao added a comment - The syslog messages doesn't provide much information. IMHO a possible cause may be that a file was created, and then opened, but MDS failed and unfortunately this inode was not synced to disk yet, then during MDS recovery, client tried to replay open, but MDS couldn't find this file, and printed this error message. If it's all that has happened, current handling looks reasonable. Will you do such recovery test again? If so, I can provide a debug patch to print more information.

            I wasn't able to get the debug log files but the attachment has syslog messages from the MDS. The OPEN & CREAT messages appear starting at May 3 12:16:33. We don't see any problems obviously connected with this error, although we are running into other recovery-related bugs in 2.1, namely LU-1352 and LU-1368.

            nedbass Ned Bass (Inactive) added a comment - I wasn't able to get the debug log files but the attachment has syslog messages from the MDS. The OPEN & CREAT messages appear starting at May 3 12:16:33. We don't see any problems obviously connected with this error, although we are running into other recovery-related bugs in 2.1, namely LU-1352 and LU-1368 .
            nedbass Ned Bass (Inactive) added a comment - - edited

            Attaching Lustre syslog messages from MDS.

            nedbass Ned Bass (Inactive) added a comment - - edited Attaching Lustre syslog messages from MDS.

            People

              laisiyao Lai Siyao
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: