[LU-1699] Test failure on test suite parallel-scale, subtest test_iorssf Created: 01/Aug/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4185 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ed8617f2-db59-11e1-9afb-52540035b04c. The sub-test test_iorssf failed with the following error:
Found following logs from client console 12:25:37:Lustre: DEBUG MARKER: == parallel-scale test iorssf: iorssf == 12:25:35 (1343762735) 12:29:27:BUG: soft lockup - CPU#0 stuck for 67s! [IOR:12426] 12:29:28:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 12:29:28:CPU 0 12:29:28:Modules linked in: lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 12:29:28: 12:29:28:Pid: 12426, comm: IOR Not tainted 2.6.32-220.17.1.el6.x86_64 #1 Red Hat KVM 12:29:28:RIP: 0010:[<ffffffff814ef996>] [<ffffffff814ef996>] _spin_lock+0x26/0x30 12:29:28:RSP: 0018:ffff88001c3458e8 EFLAGS: 00000206 12:29:29:RAX: 0000000000000001 RBX: ffff88001c3458e8 RCX: ffff88007252dc00 12:29:29:RDX: 0000000000000000 RSI: ffff8800769cc440 RDI: ffff88005ef1865c 12:29:29:RBP: ffffffff8100bc0e R08: ffff8800769dc6f8 R09: 0000000000000001 12:29:29:R10: ffff880040057900 R11: 0000000000000000 R12: ffff8800769dc580 12:29:29:R13: 00000000000108ff R14: ffffea0000d760d0 R15: ffff88001c345868 12:29:29:FS: 00007f5c6989d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 12:29:29:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 12:29:29:CR2: 00007f938197c000 CR3: 000000001a692000 CR4: 00000000000006f0 12:29:29:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 12:29:29:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 12:29:29:Process IOR (pid: 12426, threadinfo ffff88001c344000, task ffff8800769c7500) 12:29:29:Stack: 12:29:29: ffff88001c345948 ffffffffa0650a8f ffff880040057900 ffffffffa06fc4f1 12:29:29:<0> ffff88001c345958 ffff8800769eb990 ffff880078328b40 0000000000000100 12:29:29:<0> 0000000000000002 ffff8800769cc440 ffff8800769dcbc0 ffff880018760078 12:29:29:Call Trace: 12:29:29: [<ffffffffa0650a8f>] ? osc_trigger_group_io+0x5f/0x180 [osc] 12:29:30: [<ffffffffa06fc4f1>] ? llap_from_page_with_lockh.clone.8+0x2a1/0x1120 [lustre] 12:29:30: [<ffffffffa06736c1>] ? lov_trigger_group_io+0x121/0x530 [lov] 12:29:30: [<ffffffffa0702403>] ? ll_readpage+0x263/0x1b10 [lustre] 12:29:30: [<ffffffffa0533d22>] ? ldlm_lock_fast_match+0xc2/0x130 [ptlrpc] 12:29:30: [<ffffffff8111266c>] ? generic_file_aio_read+0x1fc/0x700 12:29:30: [<ffffffffa06dc5d6>] ? ll_file_aio_read+0xc96/0x2620 [lustre] 12:29:30: [<ffffffffa07492ce>] ? trace_put_tcd+0xe/0x10 [libcfs] 12:29:30: [<ffffffffa06de030>] ? ll_file_read+0xd0/0xf0 [lustre] 12:29:30: [<ffffffff81090d30>] ? autoremove_wake_function+0x0/0x40 12:29:30: [<ffffffffa06de48b>] ? ll_file_seek+0x9b/0x330 [lustre] 12:29:30: [<ffffffff8120c646>] ? security_file_permission+0x16/0x20 12:29:30: [<ffffffff811767fd>] ? rw_verify_area+0x5d/0xc0 12:29:30: [<ffffffff81177155>] ? vfs_read+0xb5/0x1a0 12:29:30: [<ffffffff810d4932>] ? audit_syscall_entry+0x272/0x2a0 12:29:30: [<ffffffff81177291>] ? sys_read+0x51/0x90 12:29:30: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b 12:29:31:Code: 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 3e 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f 1f 44 00 00 83 3f 00 <75> f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 3e 81 12:29:31:Call Trace: |
| Comments |
| Comment by Peter Jones [ 02/Aug/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 06/Aug/12 ] |
|
Hi Sarah, the backtrace shows it's 1.8 code, but this test is against 2.3? Could you tell me how the system is deployed? |
| Comment by Sarah Liu [ 06/Aug/12 ] |
|
This is the interop test between b1_8 and master, servers are running master while clients are b1_8 |
| Comment by Lai Siyao [ 08/Aug/12 ] |
|
Client is slow or maybe deadlocked on client_obd_list_lock(&cli->cl_loi_list_lock) in osc_trigger_group_io(), but maloo didn't take backtrace of all process. I need reproduce to confirm it. |
| Comment by Lai Siyao [ 08/Aug/12 ] |
|
Hey Jay, is this a known issue for 1.8 client IO? |
| Comment by Jinshan Xiong (Inactive) [ 08/Aug/12 ] |
|
Hi Lai, there is a nice debug code for client_obd_list_lock() in 2.x and you can port it to 1.8. I guess you can find the problem in seconds. |
| Comment by Lai Siyao [ 09/Aug/12 ] |
|
Hi Sarah, I've commited a patch to debug client_obd_list_lock, could you apply this patch and test again? |
| Comment by Sarah Liu [ 09/Aug/12 ] |
|
ok, will keep you updated |
| Comment by Sarah Liu [ 10/Aug/12 ] |
|
I reran this test two times and cannot reproduce it. The servers are running lustre-master-tag2.2.29 and clients are running rhel6 build of http://build.whamcloud.com/job/lustre-reviews/8243/ Here is the Maloo link of the result, please tell me if you need anything else |
| Comment by Lai Siyao [ 10/Aug/12 ] |
|
I can't reproduce either, could you run it in a loop until failure for 1 day? |
| Comment by Sarah Liu [ 15/Aug/12 ] |
|
Lai, running it now, will keep you updated. |
| Comment by Sarah Liu [ 15/Aug/12 ] |
|
I run parallel-scale --only iorssf for one day about 20 times and cannot reproduce it |
| Comment by Peter Jones [ 19/Aug/12 ] |
|
We'll see if this strikes again with the 2.2.93 tag |
| Comment by Sarah Liu [ 20/Aug/12 ] |
|
due to |
| Comment by Sarah Liu [ 21/Aug/12 ] |
|
manually run iorssf 5 times and cannot reproduce it on the tag-2.2.93 |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
Unfortunately manual runs do not help with this one because it seems to be another issue where the problem is isolated to runs on autotest. |
| Comment by Sarah Liu [ 22/Aug/12 ] |
|
Only hit this error in interop test with 1.8.8 |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
Only occurred once so dropping in priority |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |