[LU-3005] MDT attempted to access beyond the disk Created: 21/Mar/13  Updated: 22/Mar/13  Resolved: 22/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: HB
Environment:

Lustre 2.3.62 running the servers as well as the clients.


Issue Links:
Duplicate
duplicates LU-2980 sanity.sh test_17b: Read-only file sy... Resolved
Related
is related to LU-2980 sanity.sh test_17b: Read-only file sy... Resolved
Severity: 4
Rank (Obsolete): 7318

 Description   

While running mdtest the file system went into read only mode and the file system reported corruption on the MDS. The error is:

[56736.791445] attempt to access beyond end of device
[56736.804198] md5: rw=0, want=10484831811994872656, limit=282775552
[56736.818141] LDISKFS-fs error (device md5): ldiskfs_xattr_delete_inode: inode 22016621: block 3616446985713053033 read error
[56736.837277] Aborting journal on device md5-8.
[56736.860700] LDISKFS-fs (md5): Remounting filesystem read-only
[56736.881388] LDISKFS-fs error (device md5) in ldiskfs_free_inode: Journal has aborted
[56736.896935] LustreError: 8976:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff880414777b80 commit error: 2
[56736.915693] LustreError: 8976:0:(osd_handler.c:636:osd_trans_commit_cb()) transaction @0xffff880113757680 commit error: 2
[56741.072627] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) lustre-OST0000-osc-MDT0000: fail to write header for llog #0x81#0x1#00000000: rc = -30
[56741.101191] LustreError: 9039:0:(llog_cat.c:535:llog_cat_cancel_records()) lustre-OST0000-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
[56741.129587] LustreError: 9039:0:(osp_sync.c:720:osp_sync_process_committed()) lustre-OST0000-osc-MDT0000: can't cancel record: -30
[56741.156530] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) lustre-OST0000-osc-MDT0000: fail to write header for llog #0x81#0x1#00000000: rc = -30
[56741.157240] LustreError: 9041:0:(llog_cat.c:535:llog_cat_cancel_records()) lustre-OST0001-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
[56741.157245] LustreError: 9041:0:(osp_sync.c:720:osp_sync_process_committed()) lustre-OST0001-osc-MDT0000: can't cancel record: -30
[56741.241180] LustreError: 9039:0:(llog.c:161:llog_cancel_rec()) Skipped 2 previous similar messages



 Comments   
Comment by Keith Mannthey (Inactive) [ 21/Mar/13 ]

It seems to be WAAAY past the end.

"md5: rw=0, want=10484831811994872656, limit=282775552"

Are there any other relevant md messages in your logs?

How often have you seen this?

Can you describe your test configuration a bit more? Can you share your mdtest values and mount info?

Comment by James A Simmons [ 21/Mar/13 ]

First time today and no other info showed up in the logs. Rebuilt the file system and now it seems to have gone away. I have DDN 9900 attached to 4 OSS. Each OSS has 7 OSTs. The MGS has a simple sata disk and the MDS has a md device. Attached to clients with DDR Inifiniband.

Comment by Bruno Faccini (Inactive) [ 21/Mar/13 ]

Also a debugfs "stat <22016621>" sub-command output run on MDT mounted as ldiskfs could give more infos and help to see if corruption was on-disk or not.

Comment by James A Simmons [ 22/Mar/13 ]

This is a duplicate of LU-2980. If I encounter this bug again what other data should I get?

Comment by Peter Jones [ 22/Mar/13 ]

Let's focus discussion under the original ticket - LU-2980

Generated at Sat Feb 10 01:30:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.