[LU-4544] parallel-scale test write_disjoint: Oops: IP: cl_lock_cancel0+0x60/0x160 [obdclass] Created: 26/Jan/14  Updated: 01/Jun/15  Resolved: 01/Jun/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/16/
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=2


Severity: 3
Rank (Obsolete): 12421

 Description   

While running parallel-scale test write_disjoint with MDSCOUNT=2, one of the two client nodes crashed:

21:21:45:Lustre: DEBUG MARKER: == parallel-scale test write_disjoint: write_disjoint == 20:49:14 (1390625354)
21:21:46:BUG: unable to handle kernel paging request at fffffffffffffff8
21:21:46:IP: [<ffffffffa0593920>] cl_lock_cancel0+0x60/0x160 [obdclass]
21:21:46:PGD 1a87067 PUD 1a88067 PMD 0 
21:21:47:Oops: 0000 [#1] SMP 
21:21:47:last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.0/infiniband/mlx4_0/ports/1/gids/0
21:21:47:CPU 1 
21:21:47:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic crc32c_intel nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 igb ptp pps_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
21:21:47:
21:21:49:Pid: 15022, comm: ldlm_bl_01 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Supermicro X8DTT/X8DTT
21:21:49:RIP: 0010:[<ffffffffa0593920>]  [<ffffffffa0593920>] cl_lock_cancel0+0x60/0x160 [obdclass]
21:21:49:RSP: 0018:ffff880324ba5d60  EFLAGS: 00010286
21:21:49:RAX: 0000000000000000 RBX: ffff880304bb3f08 RCX: 0000000000000000
21:21:49:RDX: 000000000000128e RSI: ffff8802e1ec8cc0 RDI: ffffffffa09f2e00
21:21:49:RBP: ffff880324ba5d80 R08: ffff8802dd94cb28 R09: 0000000000000001
21:21:50:R10: 0000000000000000 R11: 0000000000000400 R12: ffff8802ebdbd8f0
21:21:51:R13: ffffffffffffffe8 R14: ffff8802e1ec8cc0 R15: ffff880324ba5dd0
21:21:52:FS:  0000000000000000(0000) GS:ffff880032e20000(0000) knlGS:0000000000000000
21:21:52:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
21:21:52:CR2: fffffffffffffff8 CR3: 0000000304bed000 CR4: 00000000000007e0
21:21:52:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
21:21:53:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
21:21:53:Process ldlm_bl_01 (pid: 15022, threadinfo ffff880324ba4000, task ffff88032018d540)
21:21:53:Stack:
21:21:53: ffff8802ebdbd8e8 ffff8802ebdbd8e8 ffff880304bb3f08 ffff880304bb3f08
21:21:53:<d> ffff880324ba5da0 ffffffffa05944eb ffff88021ca2ed80 ffff8802e1ec8cc0
21:21:53:<d> ffff880324ba5e10 ffffffffa09c17aa ffff880324ba5e10 ffff8802ebdbd8e8
21:21:55:Call Trace:
21:21:55: [<ffffffffa05944eb>] cl_lock_cancel+0x13b/0x140 [obdclass]
21:21:55: [<ffffffffa09c17aa>] osc_ldlm_blocking_ast+0x13a/0x350 [osc]
21:21:55: [<ffffffffa084a020>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
21:21:55: [<ffffffffa084a551>] ldlm_bl_thread_main+0x261/0x3c0 [ptlrpc]
21:21:56: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
21:21:56: [<ffffffffa084a2f0>] ? ldlm_bl_thread_main+0x0/0x3c0 [ptlrpc]
21:21:56: [<ffffffff81096a36>] kthread+0x96/0xa0
21:21:57: [<ffffffff8100c0ca>] child_rip+0xa/0x20
21:21:57: [<ffffffff810969a0>] ? kthread+0x0/0xa0
21:21:57: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
21:21:57:Code: 00 a8 01 75 48 48 83 c8 01 49 89 84 24 b8 00 00 00 49 8b 44 24 10 49 83 c4 08 49 39 c4 4c 8d 68 e8 74 2a 0f 1f 84 00 00 00 00 00 <49> 8b 45 10 48 8b 40 30 48 85 c0 74 08 4c 89 ee 48 89 df ff d0 
21:21:57:RIP  [<ffffffffa0593920>] cl_lock_cancel0+0x60/0x160 [obdclass]
21:21:57: RSP <ffff880324ba5d60>
21:21:57:CR2: fffffffffffffff8

Maloo report: https://maloo.whamcloud.com/test_sets/ae2ace56-85c4-11e3-8903-52540035b04c



 Comments   
Comment by Jian Yu [ 26/Jan/14 ]

This is a regression introduced by Lustre b2_5 build #16, or #15(not tested), or #14(not tested). The failure did not occur on Lustre b2_5 build #13 and previous builds.

Comment by Jinshan Xiong (Inactive) [ 27/Jan/14 ]

Only a few patches between these two build:

4d0e47b LU-4222 mdt: extra checking for getattr RPC.
c8297ef LU-4360 Fix use after free in ksocknal_send
428cd4b LU-3680 ptlrpc: Fix assertion failure of null_alloc_rs()
9e36160 LU-4221 osd: add case LCFG_PARAM to osd_process_config

Looks unrelated.

Comment by Jian Yu [ 28/Jan/14 ]

The failure did not occur regularly on Lustre b2_5 branch. Two more test sessions showed that the same test passed on build #16 and #15 with MDSCOUNT=2:
https://maloo.whamcloud.com/test_sets/f0c41e44-8762-11e3-bab6-52540035b04c (build #16)
https://maloo.whamcloud.com/test_sets/8200d444-877a-11e3-bc1c-52540035b04c (build #15)

Comment by Jian Yu [ 10/Feb/14 ]

The same test also passed on Lustre b2_5 build #17 and #19 with MDSCOUNT=2:
https://maloo.whamcloud.com/test_sets/4bf29958-8ebb-11e3-8d06-52540035b04c
https://maloo.whamcloud.com/test_sets/5062e0e4-9111-11e3-91ee-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

This problem may be a duplication of LU-4591. I will take a look at it.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

not related to DNE

Comment by Jinshan Xiong (Inactive) [ 01/Jun/15 ]

this issue has been open for a long time.

Generated at Sat Feb 10 01:43:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.