Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3005

MDT attempted to access beyond the disk

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.4.0
    • Lustre 2.3.62 running the servers as well as the clients.
    • 4
    • 7318

    Description

      While running mdtest the file system went into read only mode and the file system reported corruption on the MDS. The error is:

      [56736.791445] attempt to access beyond end of device
      [56736.804198] md5: rw=0, want=10484831811994872656, limit=282775552
      [56736.818141] LDISKFS-fs error (device md5): ldiskfs_xattr_delete_inode: inode 22016621: block 3616446985713053033 read error
      [56736.837277] Aborting journal on device md5-8.
      [56736.860700] LDISKFS-fs (md5): Remounting filesystem read-only
      [56736.881388] LDISKFS-fs error (device md5) in ldiskfs_free_inode: Journal has aborted
      [56736.896935] LustreError: 8976:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff880414777b80 commit error: 2
      [56736.915693] LustreError: 8976:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff880113757680 commit error: 2
      [56741.072627] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) lustre-OST0000-osc-MDT0000: fail to write header for llog #0x81#0x1#00000000: rc = -30
      [56741.101191] LustreError: 9039:0:(llog_cat.c:535:llog_cat_cancel_records()) lustre-OST0000-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
      [56741.129587] LustreError: 9039:0:(osp_sync.c:720:osp_sync_process_committed()) lustre-OST0000-osc-MDT0000: can't cancel record: -30
      [56741.156530] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) lustre-OST0000-osc-MDT0000: fail to write header for llog #0x81#0x1#00000000: rc = -30
      [56741.157240] LustreError: 9041:0:(llog_cat.c:535:llog_cat_cancel_records()) lustre-OST0001-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
      [56741.157245] LustreError: 9041:0:(osp_sync.c:720:osp_sync_process_committed()) lustre-OST0001-osc-MDT0000: can't cancel record: -30
      [56741.241180] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) Skipped 2 previous similar messages

      Attachments

        Issue Links

          Activity

            [LU-3005] MDT attempted to access beyond the disk
            pjones Peter Jones added a comment -

            Let's focus discussion under the original ticket - LU-2980

            pjones Peter Jones added a comment - Let's focus discussion under the original ticket - LU-2980

            This is a duplicate of LU-2980. If I encounter this bug again what other data should I get?

            simmonsja James A Simmons added a comment - This is a duplicate of LU-2980 . If I encounter this bug again what other data should I get?

            Also a debugfs "stat <22016621>" sub-command output run on MDT mounted as ldiskfs could give more infos and help to see if corruption was on-disk or not.

            bfaccini Bruno Faccini (Inactive) added a comment - Also a debugfs "stat <22016621>" sub-command output run on MDT mounted as ldiskfs could give more infos and help to see if corruption was on-disk or not.

            First time today and no other info showed up in the logs. Rebuilt the file system and now it seems to have gone away. I have DDN 9900 attached to 4 OSS. Each OSS has 7 OSTs. The MGS has a simple sata disk and the MDS has a md device. Attached to clients with DDR Inifiniband.

            simmonsja James A Simmons added a comment - First time today and no other info showed up in the logs. Rebuilt the file system and now it seems to have gone away. I have DDN 9900 attached to 4 OSS. Each OSS has 7 OSTs. The MGS has a simple sata disk and the MDS has a md device. Attached to clients with DDR Inifiniband.

            It seems to be WAAAY past the end.

            "md5: rw=0, want=10484831811994872656, limit=282775552"
            

            Are there any other relevant md messages in your logs?

            How often have you seen this?

            Can you describe your test configuration a bit more? Can you share your mdtest values and mount info?

            keith Keith Mannthey (Inactive) added a comment - It seems to be WAAAY past the end. "md5: rw=0, want=10484831811994872656, limit=282775552" Are there any other relevant md messages in your logs? How often have you seen this? Can you describe your test configuration a bit more? Can you share your mdtest values and mount info?

            People

              bfaccini Bruno Faccini (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: