Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3019

Files are corrupted after OSS unmount.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.4
    • None
    • Rocks 5.3 cluster distributive (CentOS5 based).
    • 4
    • 7342

    Description

      Our small lustre system recently encountered a strange problem.
      Lustre version – 2.1.4, precompiled binaries from Whamcloud site (we have CentOS5 compatible system, so we must stick to 2.1.X), upgraded from 2.0.0.1 several weeks before. System contained one MDS and one OSS (single OST with size 8Tb).

      After the addition of the second, empty OSS (also 8Tb), a steady and slow migration of files began. Then the problem arose. When the second OSS was is unmounted (on a particular occasion), files which our users are were working with at the moment and many other unrelated (and untouched for months) randomly scattered files (may be those which were migrating while unmount happened?) became corrupted.

      These files appeared to be of zero size, and can only be deleted.

      If to access one of these files on the new OSS we observe messages like this:
      kernel: LustreError: 18600:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 1501826: rc -2
      (No new messages in MDS or old OSS logs.)

      Even graceful unmount — MDS, than both OSS, than MGS, leads to files corruption. (or this is a wrong way to stop the system correctly?).

      After the check, according to section 27.2 "Recovering from Corruption in the Lustre File System" of the manual, we have millions of (harmless?) messages:

      [1] zero-length orphan objid 0:8371035"

      and hundreds of kind:

      Failed to find fid [0x20000a041:0x3f99:0x0]: DB_NOTFOUND: No matching key/data pair found
      [0]: MDS FID [0x20000a041:0x3f99:0x0] object 0:984246 deleted?

      After correcting these errors with lfsck -l -c, I have checked the filesystem one more time, and received many more errors of the same type. (The system was mounted and unmounted to perform the check, no other accesses, except some reads – cd/ls/cat were done).

      The both OSS’s are in failout mode.
      Due to historical reasons the old OSS is OSS1, (half a year ago we had to migrate from degrading OSS), so new OSS became OSS0.

      For one of the corrupted files, /lustre0/users/kglukhov/Calcs/Abinit/SPS/para/fo+Cr_par/ Sn2P2S6o_DS3_WFK:

      [root@compute-0-7 fo+Cr_par]# lfs getstripe Sn2P2S6o_DS3_WFK
      Sn2P2S6o_DS3_WFK
      lmm_stripe_count: 1
      lmm_stripe_size: 1048576
      lmm_stripe_offset: 0
      obdidx objid objid group
      0 1500007 0x16e367 0

      On the corresponding (new) OSS node:
      [root@lustre-compute-0-3 mnt]# debugfs -c -R "stat O/0/d$((1500007 % 32))/1500007" /dev/md3
      debugfs 1.42.6.wc2 (10-Dec-2012)
      /dev/md3: catastrophic mode - not reading inode or group bitmaps
      O/0/d7/1500007: File not found by ext2_lookup

      Can you help us to understand, what is going on and how to tackle it?

      I attach an output of all commands of the first check (according to the manual section 27.2), an output of lfsck for first check (-n), error fixing (-l -c) and the second full check.

      Attachments

        1. chklog_after_fix.gz
          1.27 MB
        2. chklog_before_fix.gz
          1.27 MB
        3. chklog_fixing.gz
          1.26 MB
        4. first_check_logs.tar.gz
          2 kB

        Activity

          People

            wc-triage WC Triage
            indrekis Oleg Farenyuk
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: