[LU-1232] Input/Output error during large lun test Created: 19/Mar/12  Updated: 21/Mar/12  Resolved: 21/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Yang Sheng
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

2.2-RC1-RHEL6 server and client


Attachments: File dmesg     File dmesg_partial     Text File large_lun.log     Text File large_lun_partial.log    
Severity: 3
Rank (Obsolete): 6430

 Description   

Running large lun with 24T OST on Juelich cluster, got this error when running llverfs in full mode on the OST ldiskfs filesystem

write filename: /mnt/ost1/dir00157/file025, current 787.039 MB/s, overall 100.624 MB/s, est 4294967248:4294967257:4294967237 left
write filename: /mnt/ost1/dir00157/file026, current 794.458 MB/s, overall 100.642 MB/s, est 4294967248:4294967252:429496724
llverfs: Open '/mnt/ost1/dir00172/file002' failed:Input/output error

Please see the attached for console log and dmesg.



 Comments   
Comment by Peter Jones [ 19/Mar/12 ]

Yangsheng

Could you please advise on tihs one?

Thanks

Peter

Comment by Yang Sheng [ 20/Mar/12 ]

This issue looks like cause by a hardware problem.

sd 6:0:27:0: rejecting I/O to offline device
LDISKFS-fs error (device dm-1): ldiskfs_find_entry: reading directory #22708225 offset 0
sd 6:0:27:0: rejecting I/O to offline device
LDISKFS-fs error (device dm-1): ldiskfs_read_inode_bitmap: Cannot read inode bitmap - block_group = 177408, inode_bitmap = 5813305600
LDISKFS-fs error (device dm-1) in ldiskfs_new_inode: IO failure
sd 6:0:27:0: rejecting I/O to offline device
LDISKFS-fs (dm-1): delayed block allocation failed for inode 22544416 at logical offset 995328 with max blocks 2048 with error -5

This should not happen!!  Data will be lost
JBD2: Detected IO errors while flushing file data on dm-1-8
Comment by Sarah Liu [ 20/Mar/12 ]

I reran this test in partial mode, failed again. Please see the attached for console log and demsg.

Comment by Yang Sheng [ 20/Mar/12 ]

From demsg_partial, it is very obvious a storage error:

Buffer I/O error on device dm-0, logical block 65598925
lost page write due to I/O error on dm-0
sd 6:0:22:0: [sdu] Unhandled error code
sd 6:0:22:0: [sdu] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 6:0:22:0: [sdu] CDB: Write(10): 2a 00 1f 61 89 08 00 00 08 00
end_request: I/O error, dev sdu, sector 526485768

I'll look into RHEL6 bugzilla trying to found if has this kind issue with this driver.

mpt2sas0: LSISAS2008: FWVersion(11.00.00.00), ChipRevision(0x03), BiosVersion(07.21.00.00)

Thanks for the whole dmesg log.

Comment by Sarah Liu [ 20/Mar/12 ]

I try to install tag-2.1.56 build again and got following error:
-------------------
BUG: unable to handle kernel NULL pointer dereference at 0000000000000006
IP: [<ffffffffa00b9f01>] ses_intf_add+0x2f1/0x5e0 [ses]
PGD 630ec3067 PUD 631ed5067 PMD 0
Oops: 0000 1 SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/device
CPU 0
Modules linked in: ses enclosure mlx4_ib ib_mad ib_core mlx4_en mlx4_core scsi_wait_scan igb iTCO_wdt i2c_i801 i2c_core i7core_edac ioatdma iTCO_vendor_support dca edac_core microcode serio_raw shpchp ext3 jbd mbcache sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ahci dm_mirror dm_region_hash dm_log dm_mod

Pid: 2080, comm: modprobe Not tainted 2.6.32-220.4.2.el6_lustre.gddd1a7c.x86_64 #1 SGI.COM C1104-2TY9/X8DTT-IBQF
RIP: 0010:[<ffffffffa00b9f01>] [<ffffffffa00b9f01>] ses_intf_add+0x2f1/0x5e0 [ses]
RSP: 0018:ffff8803340bfe38 EFLAGS: 00010246
RAX: ffff88032db03800 RBX: ffff88032f5ae800 RCX: 0000000000000017
RDX: 0000000000000000 RSI: ffffffff8126cdd0 RDI: 0000000000000000
RBP: ffff8803340bfe98 R08: ffffffff81c00280 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880637d78920
R13: 0000000000000000 R14: ffff880330ec0400 R15: ffff88063006cdc0
FS: 00007f9665162700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000006 CR3: 0000000637094000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 2080, threadinfo ffff8803340be000, task ffff880336dd74c0)
Stack:
ffff8803340bfe78 ffffffff814d3aef ffff880330048800 ffff88032f5aeb58
<0> ffff88032f5ae938 0000000000000010 0000000000000000 ffffffffa00ba460
<0> ffffffff81b01ee0 ffff8803340bfea8 0000000000000000 0000000000000000
Call Trace:
[<ffffffff814d3aef>] ? klist_next+0x7f/0xf0
[<ffffffff81347599>] class_interface_register+0xa9/0xe0
[<ffffffffa00fe000>] ? ses_init+0x0/0x3c [ses]
[<ffffffff81367366>] scsi_register_interface+0x16/0x20
[<ffffffffa00fe014>] ses_init+0x14/0x3c [ses]
[<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
[<ffffffff810af4e1>] sys_init_module+0xe1/0x250
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: 29 e1 48 85 c0 75 13 eb 51 90 48 8b 3b 48 89 c6 e8 d5 f0 29 e1 48 85 c0 74 40 8b b8 84 00 00 00 85 ff 75 e6 48 8b 90 a8 00 00 00 <f6> 42 06 40 75 d9 48 89 c6 4c 89 f7 48 89 45 b0 e8 aa fa ff ff
RIP [<ffffffffa00b9f01>] ses_intf_add+0x2f1/0x5e0 [ses]
RSP <ffff8803340bfe38>
CR2: 0000000000000006
--[ end trace c653e9e779d07a3e ]--
Kernel panic - not syncing: Fatal exception
Pid: 2080, comm: modprobe Tainted: G D ---------------- 2.6.32-220.4.2.el6_lustre.gddd1a7c.x86_64 #1
Call Trace:
[<ffffffff814ec61a>] ? panic+0x78/0x143
[<ffffffff814f07a4>] ? oops_end+0xe4/0x100
[<ffffffff8104234b>] ? no_context+0xfb/0x260
[<ffffffffffff81250984>] ?3>] ? __do_page_ft+0x49/0x60
[<f0x190
[<ffffffge_fault+0x25/0x30
[<ffffffff8126cdd0>] ? kobject_release+0x0/0x240
[<ffffffffa00b9f01>] ? ses_intf_add+0x2f1/0x5e0 [ses]
[<ffffffffa00b9f25>] ? ses_intf_add+0x315/0x5e0 [ses]
[<ffffffff814d3aef>] ? klist_next+0x7f/0xf0
[<ffffffff81347599>] ? class_interface_register+0xa9/0xe0
[<ffffffffa00fe000>] ? ses_init+0x0/0x3c [ses]
[<ffffffff81367366>] ? scsi_register_interface+0x16/0x20
[<ffffffffa00fe014>] ? ses_init+0x14/0x3c [ses]
[<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
[<ffffffff810af4e1>] ? sys_init_module+0xe1/0x250
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

Comment by Frank Heckes (Inactive) [ 21/Mar/12 ]

Hi Sarah,

you're right, two disks of the pool assigned the OSS nodes are broken:

ID 5000c50040cf7d9d /dev/sdu ST2000NM0001 (2TB disk)
ID 5000c50034003265 /dev/sdz ST33000650SS (3TB disk)

I removed them from the JBOD. Could you remove them from the autotest resource file, till we receive the spare parts?

These are too many HW failures in 3 month for such a little environment. I'll try to get in touch with our supplier whether there's a quality issue with the disks or maybe some problem with MPT driver, disk firmware or ...
I'm very sorry for the delay caused by these failures.

Comment by Peter Jones [ 21/Mar/12 ]

Thanks Frank. I am closing this ticket because it is now clear that it is not related to a Lustre software issue.

Comment by Sarah Liu [ 21/Mar/12 ]

Thanks Frank, I will remove them from the script.

Generated at Sat Feb 10 06:02:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.