[LU-2171] ZFS: sanity test 77b cpu lockup Created: 13/Oct/12  Updated: 19/Apr/13  Resolved: 28/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: NFBlocker

Severity: 3
Rank (Obsolete): 5203

 Description   

Runnins FSTYPE=zfs REFORMAT=yes SLOW=yes ONLY=77 sh sanity.sh I hit this 100% of the time:

[ 2736.335827] Lustre: DEBUG MARKER: == sanity test 77b: checksum error on clien
t write ====================== 19:00:51 (1350169251)
[ 2736.378967] Lustre: *** cfs_fail_loc=409, val=0***[ 2736.381366] LustreError: 30619:0:(ost_handler.c:1075:ost_brw_write()) client 
csum 91839737, server csum 91839736[ 2820.092007] BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_001:30619]
[ 2820.092504] Modules linked in: lustre ofd osp lod ost mdt mdd mds mgs osd_zfs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet l
ibcfs ext2 zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate jbd
 sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i
2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pat
a_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloade
d: libcfs][ 2820.092504] CPU 2
[ 2820.092504] Modules linked in: lustre ofd osp lod ost mdt mdd mds mgs osd_zfs
 lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet l
ibcfs ext2 zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate jbd
 sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i
2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pat
a_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloade
d: libcfs][ 2820.092504]
[ 2820.092504] Pid: 30619, comm: ll_ost_io00_001 Tainted: P           ---------------    2.6.32-debug #5 Bochs Bochs
[ 2820.092504] RIP: 0010:[<ffffffffa0e69568>]  [<ffffffffa0e69568>] __adler32+0x78/0x1f0 [libcfs]
[ 2820.092504] RSP: 0018:ffff88028fa11990  EFLAGS: 00010206
[ 2820.092504] RAX: ffff88026eac0000 RBX: ffff88028fa119a8 RCX: 000000000000d624
[ 2820.092504] RDX: ffff88026eac1000 RSI: ffff88026eac0000 RDI: 00000000000024b3
[ 2820.092504] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 000f00e10d2fc5cd
[ 2820.092504] R10: 00000000000015b0 R11: 0000000000001000 R12: 00000000000000ff
[ 2820.092504] R13: 0000000000000000 R14: ffff8801f2fa1308 R15: 0000000000000001
[ 2820.092504] FS:  00007f462fff2700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
[ 2820.092504] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2820.092504] CR2: ffff88026eac0000 CR3: 0000000001a25000 CR4: 00000000000006e0
[ 2820.092504] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2820.092504] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2820.092504] Process ll_ost_io00_001 (pid: 30619, threadinfo ffff88028fa10000, task ffff88028fa0a300)
[ 2820.092504] Stack:
[ 2820.092504]  ffff8801f2fa18b8 ffff8801f2fa18b8 0000000000000000 ffff88028fa119c0
[ 2820.092504] <d> ffffffffa0e69767 ffff88028fa119e0 ffff88028fa119d0 ffffffff812444b8
[ 2820.092504] <d> ffff88028fa11a20 ffffffff8124450e ffff88026eac0000 0000000000000000
[ 2820.092504] Call Trace:
[ 2820.092504]  [<ffffffffa0e69767>] ? adler32_update+0x17/0x20 [libcfs]
[ 2820.092504]  [<ffffffff812444b8>] ? crypto_shash_update+0x18/0x30
[ 2820.092504]  [<ffffffff8124450e>] ? shash_compat_update+0x3e/0x60
[ 2820.092504]  [<ffffffffa0e68bcf>] ? cfs_crypto_hash_update_page+0x3f/0x50 [libcfs]
[ 2820.092504]  [<ffffffffa070d567>] ? ost_checksum_bulk+0x127/0x6d0 [ost]
[ 2820.092504]  [<ffffffffa070ff5b>] ? ost_brw_write+0xe2b/0x15d0 [ost]
[ 2820.092504]  [<ffffffff8127ca56>] ? vsnprintf+0x2b6/0x5f0
[ 2820.092504]  [<ffffffffa11242f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[ 2820.092504]  [<ffffffffa0715250>] ? ost_handle+0x3120/0x4550 [ost]
[ 2820.092504]  [<ffffffffa0e6b464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[ 2820.092504]  [<ffffffffa1171483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[ 2820.092504]  [<ffffffffa0e5f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[ 2820.092504]  [<ffffffffa116a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[ 2820.092504]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 2820.092504]  [<ffffffffa117401a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[ 2820.092504]  [<ffffffffa1173480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2820.092504]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[ 2820.092504]  [<ffffffffa1173480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2820.092504]  [<ffffffffa1173480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2820.092504]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 2820.092504] Code: d8 45 29 d8 41 83 fb 0f 0f 8e ed 00 00 00 41 8d 5b f0 c1 eb 04 41 89 dc 4c 89 e0 48 c1 e0 04 48 8d 54 06 10 48 89 f0 0f 1f 40 00 <44> 0f b6 28 4c 01 e9 44 0f b6 68 01 48 8d 3c 39 4c 01 e9 44 0f


 Comments   
Comment by Alex Zhuravlev [ 28/Oct/12 ]

this is caused by the second call to ost_checksum_bulk() which happen after obd_commitrw() which in turn releases pages. previously it worked because the pages were pinned by bulk descriptor. given to different cache implementations we should not access pages once obd_commitrw() is called.

at the moment not sure how to fix this properly. the call to ost_checksum_bulk() doesn't seem to be required from functional point of view.

Comment by Alex Zhuravlev [ 29/Oct/12 ]

Andreas just told me this check was used to catch bugs and now can be removed. so that patch: http://review.whamcloud.com/4400

Comment by Andreas Dilger [ 20/Nov/12 ]

Alex, can you please update the patch to move the "nice" LCONSOLE_ERROR_MSG() from the deleted checksum code into the remaining checksum, since the existing error message there isn't very good.

Comment by Alex Zhuravlev [ 20/Nov/12 ]

yes, yes, I saw your message in the patch and will update it as suggested.

Comment by Alex Zhuravlev [ 28/Nov/12 ]

landed on master

Generated at Sat Feb 10 01:22:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.