[LU-1699] Test failure on test suite parallel-scale, subtest test_iorssf Created: 01/Aug/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4185

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ed8617f2-db59-11e1-9afb-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

Found following logs from client console

12:25:37:Lustre: DEBUG MARKER: == parallel-scale test iorssf: iorssf == 12:25:35 (1343762735)
12:29:27:BUG: soft lockup - CPU#0 stuck for 67s! [IOR:12426]
12:29:28:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
12:29:28:CPU 0 
12:29:28:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
12:29:28:
12:29:28:Pid: 12426, comm: IOR Not tainted 2.6.32-220.17.1.el6.x86_64 #1 Red Hat KVM
12:29:28:RIP: 0010:[<ffffffff814ef996>]  [<ffffffff814ef996>] _spin_lock+0x26/0x30
12:29:28:RSP: 0018:ffff88001c3458e8  EFLAGS: 00000206
12:29:29:RAX: 0000000000000001 RBX: ffff88001c3458e8 RCX: ffff88007252dc00
12:29:29:RDX: 0000000000000000 RSI: ffff8800769cc440 RDI: ffff88005ef1865c
12:29:29:RBP: ffffffff8100bc0e R08: ffff8800769dc6f8 R09: 0000000000000001
12:29:29:R10: ffff880040057900 R11: 0000000000000000 R12: ffff8800769dc580
12:29:29:R13: 00000000000108ff R14: ffffea0000d760d0 R15: ffff88001c345868
12:29:29:FS:  00007f5c6989d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
12:29:29:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
12:29:29:CR2: 00007f938197c000 CR3: 000000001a692000 CR4: 00000000000006f0
12:29:29:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
12:29:29:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
12:29:29:Process IOR (pid: 12426, threadinfo ffff88001c344000, task ffff8800769c7500)
12:29:29:Stack:
12:29:29: ffff88001c345948 ffffffffa0650a8f ffff880040057900 ffffffffa06fc4f1
12:29:29:<0> ffff88001c345958 ffff8800769eb990 ffff880078328b40 0000000000000100
12:29:29:<0> 0000000000000002 ffff8800769cc440 ffff8800769dcbc0 ffff880018760078
12:29:29:Call Trace:
12:29:29: [<ffffffffa0650a8f>] ? osc_trigger_group_io+0x5f/0x180 [osc]
12:29:30: [<ffffffffa06fc4f1>] ? llap_from_page_with_lockh.clone.8+0x2a1/0x1120 [lustre]
12:29:30: [<ffffffffa06736c1>] ? lov_trigger_group_io+0x121/0x530 [lov]
12:29:30: [<ffffffffa0702403>] ? ll_readpage+0x263/0x1b10 [lustre]
12:29:30: [<ffffffffa0533d22>] ? ldlm_lock_fast_match+0xc2/0x130 [ptlrpc]
12:29:30: [<ffffffff8111266c>] ? generic_file_aio_read+0x1fc/0x700
12:29:30: [<ffffffffa06dc5d6>] ? ll_file_aio_read+0xc96/0x2620 [lustre]
12:29:30: [<ffffffffa07492ce>] ? trace_put_tcd+0xe/0x10 [libcfs]
12:29:30: [<ffffffffa06de030>] ? ll_file_read+0xd0/0xf0 [lustre]
12:29:30: [<ffffffff81090d30>] ? autoremove_wake_function+0x0/0x40
12:29:30: [<ffffffffa06de48b>] ? ll_file_seek+0x9b/0x330 [lustre]
12:29:30: [<ffffffff8120c646>] ? security_file_permission+0x16/0x20
12:29:30: [<ffffffff811767fd>] ? rw_verify_area+0x5d/0xc0
12:29:30: [<ffffffff81177155>] ? vfs_read+0xb5/0x1a0
12:29:30: [<ffffffff810d4932>] ? audit_syscall_entry+0x272/0x2a0
12:29:30: [<ffffffff81177291>] ? sys_read+0x51/0x90
12:29:30: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
12:29:31:Code: 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f 1f 44 00 00 83 3f 00 <75> f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 3e 81 
12:29:31:Call Trace:


 Comments   
Comment by Peter Jones [ 02/Aug/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 06/Aug/12 ]

Hi Sarah, the backtrace shows it's 1.8 code, but this test is against 2.3? Could you tell me how the system is deployed?

Comment by Sarah Liu [ 06/Aug/12 ]

This is the interop test between b1_8 and master, servers are running master while clients are b1_8

Comment by Lai Siyao [ 08/Aug/12 ]

Client is slow or maybe deadlocked on client_obd_list_lock(&cli->cl_loi_list_lock) in osc_trigger_group_io(), but maloo didn't take backtrace of all process. I need reproduce to confirm it.

Comment by Lai Siyao [ 08/Aug/12 ]

Hey Jay, is this a known issue for 1.8 client IO?

Comment by Jinshan Xiong (Inactive) [ 08/Aug/12 ]

Hi Lai, there is a nice debug code for client_obd_list_lock() in 2.x and you can port it to 1.8. I guess you can find the problem in seconds.

Comment by Lai Siyao [ 09/Aug/12 ]

Hi Sarah, I've commited a patch to debug client_obd_list_lock, could you apply this patch and test again?

http://review.whamcloud.com/#change,3586

Comment by Sarah Liu [ 09/Aug/12 ]

ok, will keep you updated

Comment by Sarah Liu [ 10/Aug/12 ]

I reran this test two times and cannot reproduce it.

The servers are running lustre-master-tag2.2.29 and clients are running rhel6 build of http://build.whamcloud.com/job/lustre-reviews/8243/

Here is the Maloo link of the result, please tell me if you need anything else
https://maloo.whamcloud.com/test_sets/46034e2a-e2a5-11e1-9f91-52540035b04c

Comment by Lai Siyao [ 10/Aug/12 ]

I can't reproduce either, could you run it in a loop until failure for 1 day?

Comment by Sarah Liu [ 15/Aug/12 ]

Lai, running it now, will keep you updated.

Comment by Sarah Liu [ 15/Aug/12 ]

I run parallel-scale --only iorssf for one day about 20 times and cannot reproduce it

Comment by Peter Jones [ 19/Aug/12 ]

We'll see if this strikes again with the 2.2.93 tag

Comment by Sarah Liu [ 20/Aug/12 ]

due to LU-1775 parallel-scale failed on the first subtest compilebench, so didn't get a chance to run iorssf

Comment by Sarah Liu [ 21/Aug/12 ]

manually run iorssf 5 times and cannot reproduce it on the tag-2.2.93
https://maloo.whamcloud.com/test_sessions/93e96f4a-ebea-11e1-9db4-52540035b04c

Comment by Peter Jones [ 22/Aug/12 ]

Unfortunately manual runs do not help with this one because it seems to be another issue where the problem is isolated to runs on autotest. LU-1775 was an interop issue. Has LU-1699 only occurred when testing interop with 1.8.8?

Comment by Sarah Liu [ 22/Aug/12 ]

Only hit this error in interop test with 1.8.8

Comment by Peter Jones [ 22/Aug/12 ]

Only occurred once so dropping in priority

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:18:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.