Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1353

mdt_reint_open() @@@ OPEN & CREAT not in open replay

Details

    • 3
    • 9748

    Description

      We occasionally see the message in the summary show up in the MDS console log during server recovery. What might cause this?

      Attachments

        Issue Links

          Activity

            [LU-1353] mdt_reint_open() @@@ OPEN & CREAT not in open replay
            pjones Peter Jones added a comment -

            Chris

            When do you expect that version of chaos to be deployed into production?

            Peter

            pjones Peter Jones added a comment - Chris When do you expect that version of chaos to be deployed into production? Peter

            I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            morrone Christopher Morrone (Inactive) added a comment - I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos.

            I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.

            behlendorf Brian Behlendorf added a comment - I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well.
            laisiyao Lai Siyao added a comment -

            The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679.

            laisiyao Lai Siyao added a comment - The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679 .

            If a patch is needed to debug further, then please do work on one for b2_1.

            morrone Christopher Morrone (Inactive) added a comment - If a patch is needed to debug further, then please do work on one for b2_1.

            We see this on production servers, not during recovery testing. We can include a debug patch in our tree and wait for it to happen again. If we learn that this is a valid case under Lustre's consistency protocol then perhaps the message should not go to the console.

            nedbass Ned Bass (Inactive) added a comment - We see this on production servers, not during recovery testing. We can include a debug patch in our tree and wait for it to happen again. If we learn that this is a valid case under Lustre's consistency protocol then perhaps the message should not go to the console.
            laisiyao Lai Siyao added a comment -

            The syslog messages doesn't provide much information. IMHO a possible cause may be that a file was created, and then opened, but MDS failed and unfortunately this inode was not synced to disk yet, then during MDS recovery, client tried to replay open, but MDS couldn't find this file, and printed this error message. If it's all that has happened, current handling looks reasonable.

            Will you do such recovery test again? If so, I can provide a debug patch to print more information.

            laisiyao Lai Siyao added a comment - The syslog messages doesn't provide much information. IMHO a possible cause may be that a file was created, and then opened, but MDS failed and unfortunately this inode was not synced to disk yet, then during MDS recovery, client tried to replay open, but MDS couldn't find this file, and printed this error message. If it's all that has happened, current handling looks reasonable. Will you do such recovery test again? If so, I can provide a debug patch to print more information.

            I wasn't able to get the debug log files but the attachment has syslog messages from the MDS. The OPEN & CREAT messages appear starting at May 3 12:16:33. We don't see any problems obviously connected with this error, although we are running into other recovery-related bugs in 2.1, namely LU-1352 and LU-1368.

            nedbass Ned Bass (Inactive) added a comment - I wasn't able to get the debug log files but the attachment has syslog messages from the MDS. The OPEN & CREAT messages appear starting at May 3 12:16:33. We don't see any problems obviously connected with this error, although we are running into other recovery-related bugs in 2.1, namely LU-1352 and LU-1368 .
            nedbass Ned Bass (Inactive) added a comment - - edited

            Attaching Lustre syslog messages from MDS.

            nedbass Ned Bass (Inactive) added a comment - - edited Attaching Lustre syslog messages from MDS.
            laisiyao Lai Siyao added a comment -

            In MDS recovery, all opened files (on client) needs to be opened again, this is called open replay. But in your case, the replayed open failed with -ENOENT, and open is not called with O_CREAT (it means this open should not create a new file), this is abnormal and this error message is printed. Did you see anything wrong after this?

            If you met this again, could you get /var/log/messages and dump debug log of both client and server?

            laisiyao Lai Siyao added a comment - In MDS recovery, all opened files (on client) needs to be opened again, this is called open replay. But in your case, the replayed open failed with -ENOENT, and open is not called with O_CREAT (it means this open should not create a new file), this is abnormal and this error message is printed. Did you see anything wrong after this? If you met this again, could you get /var/log/messages and dump debug log of both client and server?
            pjones Peter Jones added a comment -

            Lai

            Could you please comment on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please comment on this one? Thanks Peter

            People

              laisiyao Lai Siyao
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: