Details

    • Story
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.6
    • None
    • EL5
    • 5768

    Description

      I have received notice of an issue a user at TACC is having:

      I am having a problem where my git repositories on TACC systems get corrupted, leading to errors like this:

      login4$ git status
      fatal: index file smaller than expected
      login4$ git status
      error: object file .git/objects/01/ee9f4bfe74aaee027a3e8418d70d337e1235d3
      is empty
      fatal: loose object 01ee9f4bfe74aaee027a3e8418d70d337e1235d3 (stored in
      .git/objects/01/ee9f4bfe74aaee027a3e8418d70d337e1235d3) is corrupt
      login4$ git status
      error: object file .git/objects/8d/6083737dae5cb67906ac26702465ca2d70bc95
      is empty
      fatal: loose object 8d6083737dae5cb67906ac26702465ca2d70bc95 (stored in
      .git/objects/8d/6083737dae5cb67906ac26702465ca2d70bc95) is corrupt
      login4$ git status
      error: object file .git/objects/bc/61c57143652fbf198de898ca7bb9d5659a5de0
      is empty
      fatal: loose object bc61c57143652fbf198de898ca7bb9d5659a5de0 (stored in
      .git/objects/bc/61c57143652fbf198de898ca7bb9d5659a5de0) is corrupt
      

      Attachments

        Issue Links

          Activity

            [LU-2440] git repositories get corrupted

            Close old issue.

            adilger Andreas Dilger added a comment - Close old issue.

            Hi Richard,
            /bin/sync can be executed by others on our system.

            Best,

            Maxime

            mboisson Maxime Boissonneault added a comment - Hi Richard, /bin/sync can be executed by others on our system. Best, Maxime

            on the topic of sync: I've seen clients on some systems with /bin/sync as only executable by root.

            rhenwood Richard Henwood (Inactive) added a comment - on the topic of sync: I've seen clients on some systems with /bin/sync as only executable by root.

            Hi Richard,
            Here is the
            [mboisson@colosse2 ~]$ cat /proc/fs/lustre/version
            lustre: 1.8.8
            kernel: patchless_client
            build: jenkins-wc1-gbc88c4c-PRISTINE-2.6.18-308.16.1.el5

            The mount options are :
            mds2-ib0@o2ib,mds1-ib0@o2ib:/lustre1 on /lustre type lustre (rw,noauto,localflock)
            mds4-ib0@o2ib,mds3-ib0@o2ib:/lustre2 on /lustre2 type lustre (rw,noauto,localflock)
            10.225.16.3@o2ib0:/fs1 on /lustre3 type lustre (rw,noauto,localflock)

            How do I know if sync is enabled ?

            We can not readily reproduce the problem. First, it does not seem to happen anymore, and second, when it did happen, it was at random times.

            Best,

            Maxime

            mboisson Maxime Boissonneault added a comment - Hi Richard, Here is the [mboisson@colosse2 ~] $ cat /proc/fs/lustre/version lustre: 1.8.8 kernel: patchless_client build: jenkins-wc1-gbc88c4c-PRISTINE-2.6.18-308.16.1.el5 The mount options are : mds2-ib0@o2ib,mds1-ib0@o2ib:/lustre1 on /lustre type lustre (rw,noauto,localflock) mds4-ib0@o2ib,mds3-ib0@o2ib:/lustre2 on /lustre2 type lustre (rw,noauto,localflock) 10.225.16.3@o2ib0:/fs1 on /lustre3 type lustre (rw,noauto,localflock) How do I know if sync is enabled ? We can not readily reproduce the problem. First, it does not seem to happen anymore, and second, when it did happen, it was at random times. Best, Maxime

            Thanks for this info Maxime.

            Can you provide the version of Lustre you observed this problem with, the mount options and if sync is enabled on your machine.

            If you can readily reproduce this problem - and can share a reproducible configuration - that would be very helpful.

            cheers,
            Richard

            rhenwood Richard Henwood (Inactive) added a comment - Thanks for this info Maxime. Can you provide the version of Lustre you observed this problem with, the mount options and if sync is enabled on your machine. If you can readily reproduce this problem - and can share a reproducible configuration - that would be very helpful. cheers, Richard

            Hi,
            Just as a note, we had one user who had this problem. About 6 weeks ago, our sysadmin increased two parameters on the lustre clients :
            LRU
            MaxDirtyMegabytes

            He increased the LRU to 10 000, and MaxDirtyMegabytes to 256MB.

            Since then, our user did not get an error.

            Might be worth investigating those parameters.

            Regards,

            Maxime Boissonneault

            mboisson Maxime Boissonneault added a comment - Hi, Just as a note, we had one user who had this problem. About 6 weeks ago, our sysadmin increased two parameters on the lustre clients : LRU MaxDirtyMegabytes He increased the LRU to 10 000, and MaxDirtyMegabytes to 256MB. Since then, our user did not get an error. Might be worth investigating those parameters. Regards, Maxime Boissonneault

            I've seen a transient corruption:

            $ git fsck
            error: 6419686540529fe8937aa6a7f01989109c7be7c6: object corrupt or missing
            error: af074dd53d04eda6d5db0f2368f0390d4060ca70: object corrupt or missing
            error: ffdec01d82af9ca59bec7b1fdf941e4a8d84db2e: object corrupt or missing
            fatal: index file smaller than expected
            login2$
            
            $ diff ~/git_before.md5sum ~/git_after.md5sum
            0a1
            > 2b1bc9e225f27e10a228974b47281ff9  .git/refs/heads/master
            20a22,24
            > 122257de7cf6016e026760c7791d1d5a  .git/objects/af/074dd53d04eda6d5db0f2368f0390d4060ca70
            > be5e8c66fd02b1d9662a6deb2d4d325a  .git/objects/64/19686540529fe8937aa6a7f01989109c7be7c6
            > 9963ce604411643548921a66fc0a67d2  .git/objects/ff/dec01d82af9ca59bec7b1fdf941e4a8d84db2e
            29,32c33,36
            < 1414af68fbd29b3dafa9152a49453010  .git/logs/refs/heads/master
            < 1414af68fbd29b3dafa9152a49453010  .git/logs/HEAD
            < fde41e17523926db4a56131b9a313c54  .git/COMMIT_EDITMSG
            < 99e1f2253855d6cf020dce0fff06fdfd  .git/index
            ---
            > 1f8f81bd507eac9467924aab7cbe9995  .git/logs/refs/heads/master
            > 1f8f81bd507eac9467924aab7cbe9995  .git/logs/HEAD
            > 42bee3bb1f71aec0b3e61f0fcf4f65d6  .git/COMMIT_EDITMSG
            > a73a56d1dc1af89cbcb0abc836864f82  .git/index
            

            and then try git fsck again:

            login2$ git fsck
            login2$
            

            I notice the difference here is object corrupt or missing compared to the reported $git status ... loose object ... is corrupt.

            NOTE: These results are from a machine where sync is not available to the user.

            rhenwood Richard Henwood (Inactive) added a comment - I've seen a transient corruption: $ git fsck error: 6419686540529fe8937aa6a7f01989109c7be7c6: object corrupt or missing error: af074dd53d04eda6d5db0f2368f0390d4060ca70: object corrupt or missing error: ffdec01d82af9ca59bec7b1fdf941e4a8d84db2e: object corrupt or missing fatal: index file smaller than expected login2$ $ diff ~/git_before.md5sum ~/git_after.md5sum 0a1 > 2b1bc9e225f27e10a228974b47281ff9 .git/refs/heads/master 20a22,24 > 122257de7cf6016e026760c7791d1d5a .git/objects/af/074dd53d04eda6d5db0f2368f0390d4060ca70 > be5e8c66fd02b1d9662a6deb2d4d325a .git/objects/64/19686540529fe8937aa6a7f01989109c7be7c6 > 9963ce604411643548921a66fc0a67d2 .git/objects/ff/dec01d82af9ca59bec7b1fdf941e4a8d84db2e 29,32c33,36 < 1414af68fbd29b3dafa9152a49453010 .git/logs/refs/heads/master < 1414af68fbd29b3dafa9152a49453010 .git/logs/HEAD < fde41e17523926db4a56131b9a313c54 .git/COMMIT_EDITMSG < 99e1f2253855d6cf020dce0fff06fdfd .git/index --- > 1f8f81bd507eac9467924aab7cbe9995 .git/logs/refs/heads/master > 1f8f81bd507eac9467924aab7cbe9995 .git/logs/HEAD > 42bee3bb1f71aec0b3e61f0fcf4f65d6 .git/COMMIT_EDITMSG > a73a56d1dc1af89cbcb0abc836864f82 .git/index and then try git fsck again: login2$ git fsck login2$ I notice the difference here is object corrupt or missing compared to the reported $git status ... loose object ... is corrupt . NOTE: These results are from a machine where sync is not available to the user.

            Richard,
            "overnight" might just be the time it takes for idle DLM locks to be cancelled. What would be useful is:

            • enable full debug logging, like lctl set_param debug=-1
            • do "git update" or "git gc" or whatever is the trigger
            • dump debug logs, like lctl dk /tmp/git_update.log
            • verify repository is not corrupted
            • get checksums of all of the files under .git, like find .git -type f | xargs md5sum > git_before.md5sum
            • cancel all of the DLM locks on the client, like lctl set_param ldlm.namespaces.*.lru_size=clear
            • dump debug logs, like {{lctl dk /tmp/git_dlm_cancel.log
            • get checksums of all the .git files again (into a new file)
            • compare checksums of before and after lock cancel

            If the checksums are different, then there is some problem with the cache flushing or similar.

            However, without a more specific reproducer, it won't be very easy to isolate when this is happening.

            adilger Andreas Dilger added a comment - Richard, "overnight" might just be the time it takes for idle DLM locks to be cancelled. What would be useful is: enable full debug logging, like lctl set_param debug=-1 do "git update" or "git gc" or whatever is the trigger dump debug logs, like lctl dk /tmp/git_update.log verify repository is not corrupted get checksums of all of the files under .git, like find .git -type f | xargs md5sum > git_before.md5sum cancel all of the DLM locks on the client, like lctl set_param ldlm.namespaces.*.lru_size=clear dump debug logs, like {{lctl dk /tmp/git_dlm_cancel.log get checksums of all the .git files again (into a new file) compare checksums of before and after lock cancel If the checksums are different, then there is some problem with the cache flushing or similar. However, without a more specific reproducer, it won't be very easy to isolate when this is happening.

            Thanks for this update. I have been able to reproduce this issue, leaving a repo overnight did it in my case.

            I'm following up to see if this is also an issue with Master, and trying to shorten the time to reproduce.

            rhenwood Richard Henwood (Inactive) added a comment - Thanks for this update. I have been able to reproduce this issue, leaving a repo overnight did it in my case. I'm following up to see if this is also an issue with Master, and trying to shorten the time to reproduce.

            One of our users has what seems to be the exact same problem with git. We are running lustre clients 1.8.8, and 1.8.4 and 1.8.5 servers. The user is able to reproduce the problem every now and then by running a "git gc" in a crontab. The problem seems to appear once or twice per week.

            mboisson Maxime Boissonneault added a comment - One of our users has what seems to be the exact same problem with git. We are running lustre clients 1.8.8, and 1.8.4 and 1.8.5 servers. The user is able to reproduce the problem every now and then by running a "git gc" in a crontab. The problem seems to appear once or twice per week.

            People

              rhenwood Richard Henwood (Inactive)
              rhenwood Richard Henwood (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: