[LU-5460] Lustre client crash Created: 07/Aug/14  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Daire Byrne (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 15209

 Description   

Hi,

One of our clients which exports Lustre over NFS crashed, dumped and rebooted overnight. I'm including the vmcore-dmesg here in case there is anything useful for you. I don't think we've seen this one before so it must be rare. Full vmcore available on request.

<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/system/node/node1/numastat
<4>CPU 21 
<4>Modules linked in: tcp_diag inet_diag mptctl mptbase ipmi_devintf dell_rbu nfsd exportfs autofs4 lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding 8021q garp stp llc ipv6 uinput raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx power_meter sg bnx2x libcrc32c mdio bnx2 dcdbas microcode serio_raw iTCO_wdt iTCO_vendor_support i7core_edac edac_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 17387, comm: ldlm_bl_40 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/0F0XJ6
<4>RIP: 0010:[<ffffffffa058ba3e>]  [<ffffffffa058ba3e>] cl_lock_mutex_get+0x2e/0xd0 [obdclass]
<4>RSP: 0018:ffff88046ec63c30  EFLAGS: 00010203
<4>RAX: 5a5a5a5a5a5a5a5a RBX: ffff880572210d10 RCX: ffff880a789650b8
<4>RDX: ffff8808e88c5448 RSI: ffff880a58758a18 RDI: ffff880572210d10
<4>RBP: ffff88046ec63c50 R08: ffffffffa05ab7ee R09: 0000000000000000
<4>R10: 5a5a5a5a5a5a5a5a R11: 5a5a5a5a5a5a5a5a R12: ffff880a58758a18
<4>R13: ffff88079d568678 R14: ffff880952923b70 R15: ffff880a58758a18
<4>FS:  00007ff150e51700(0000) GS:ffff880028340000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 00007ffbd6c4a9d4 CR3: 0000000c235b4000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ldlm_bl_40 (pid: 17387, threadinfo ffff88046ec62000, task ffff880bb62b0aa0)
<4>Stack:
<4> ffff880b4f421740 ffff880572210d10 ffff880a58758a18 ffff88079d568678
<4><d> ffff88046ec63c70 ffffffffa0a014b9 ffff880572210d10 ffff8808e88c5420
<4><d> ffff88046ec63cc0 ffffffffa0a01c39 ffff880b4f421740 ffff8808e88c5448
<4>Call Trace:
<4> [<ffffffffa0a014b9>] lovsub_parent_lock+0x49/0x120 [lov]
<4> [<ffffffffa0a01c39>] lovsub_lock_state+0x79/0x1b0 [lov]
<4> [<ffffffffa0589718>] cl_lock_state_signal+0x68/0x160 [obdclass]
<4> [<ffffffffa0589865>] cl_lock_state_set+0x55/0x190 [obdclass]
<4> [<ffffffffa058a8b3>] cl_lock_delete0+0x53/0x1d0 [obdclass]
<4> [<ffffffffa058ab83>] cl_lock_delete+0x153/0x1a0 [obdclass]
<4> [<ffffffffa0968ac6>] osc_ldlm_blocking_ast+0x146/0x350 [osc]
<4> [<ffffffffa06b91bc>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
<4> [<ffffffffa06d341a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
<4> [<ffffffffa06d670e>] ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc]
<4> [<ffffffffa06dc1b0>] ldlm_bl_thread_main+0x100/0x3d0 [ptlrpc]
<4> [<ffffffff81063410>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa06dc0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffffa06dc0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffffa06dc0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
<4>Code: e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 65 48 8b 04 25 c0 cb 00 00 48 39 86 90 00 00 00 48 89 fb 49 89 f4 74 56 48 8b 46 28 <4c> 8b 28 e8 ba 58 ff ff 41 0f b6 b5 96 00 00 00 85 f6 74 23 8b 
<1>RIP  [<ffffffffa058ba3e>] cl_lock_mutex_get+0x2e/0xd0 [obdclass]
<4> RSP <ffff88046ec63c30>


 Comments   
Comment by Jinshan Xiong (Inactive) [ 07/Aug/14 ]

Try this patch: http://review.whamcloud.com/#/c/9876/

Comment by Daire Byrne (Inactive) [ 21/Aug/14 ]

We have not yet had an opportunity to try this patch but it looks like we just hit this again. A more complete vmcore-dmesg this time

<3>LustreError: 16371:0:(dir.c:433:ll_get_dir_page()) read cache page: [0x200009459:0x129a8:0x0] at 3437219588190122278: rc -5
<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:06:03.0/local_cpus
<4>CPU 19 
<4>Modules linked in: tcp_diag inet_diag mptctl mptbase ipmi_devintf dell_rbu nfsd exportfs autofs4 lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding 8021q garp stp llc ipv6 uinput power_meter sg bnx2x libcrc32c mdio bnx2 dcdbas microcode serio_raw iTCO_wdt iTCO_vendor_support i7core_edac edac_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 5694, comm: ldlm_bl_24 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R610/01W9FG
<4>RIP: 0010:[<ffffffffa053ba3e>]  [<ffffffffa053ba3e>] cl_lock_mutex_get+0x2e/0xd0 [obdclass]
<4>RSP: 0018:ffff8809688ffdf0  EFLAGS: 00010203
<4>RAX: 5a5a5a5a5a5a5a5a RBX: ffff88060a790338 RCX: 0000000000000000
<4>RDX: 0000000000000981 RSI: ffff880be73c6858 RDI: ffff88060a790338
<4>RBP: ffff8809688ffe10 R08: 0000000000000001 R09: 00000000ffffffff
<4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff880be73c6858
<4>R13: ffff88060a790338 R14: ffff88074323d6c0 R15: ffff8809688ffe40
<4>FS:  00007f2c87cec700(0000) GS:ffff880028320000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 00007f3b43a71000 CR3: 00000006048f9000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process ldlm_bl_24 (pid: 5694, threadinfo ffff8809688fe000, task ffff8808b1bbeaa0)
<4>Stack:
<4> ffff8809688ffe10 ffff880aee93b6c0 ffff88074323d6c0 ffff88060a790338
<4><d> ffff8809688ffe80 ffffffffa09189fa ffff8808b1bbf058 ffff8809688fffd8
<4><d> ffff880be73c6858 00000001b1bbf058 ffff880600000001 0000000000000000
<4>Call Trace:
<4> [<ffffffffa09189fa>] osc_ldlm_blocking_ast+0x7a/0x350 [osc]
<4> [<ffffffffa068bde0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
<4> [<ffffffffa068c331>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
<4> [<ffffffff81063410>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa068c0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffffa068c0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffffa068c0b0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
<4>Code: e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 65 48 8b 04 25 c0 cb 00 00 48 39 86 90 00 00 00 48 89 fb 49 89 f4 74 56 48 8b 46 28 <4c> 8b 28 e8 ba 58 ff ff 41 0f b6 b5 96 00 00 00 85 f6 74 23 8b 
<1>RIP  [<ffffffffa053ba3e>] cl_lock_mutex_get+0x2e/0xd0 [obdclass]
<4> RSP <ffff8809688ffdf0>
<0>LustreError: 21790:0:(lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed: 
<0>LustreError: 21790:0:(lovsub_lock.c:103:lovsub_lock_state()) LBUG
<4>Pid: 21790, comm: ldlm_bl_76
<4>
<4>Call Trace:
<4> [<ffffffffa03b7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa03b7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa09b1d61>] lovsub_lock_state+0x1a1/0x1b0 [lov]
<4> [<ffffffffa03ccd94>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
<4> [<ffffffffa0539718>] cl_lock_state_signal+0x68/0x160 [obdclass]
<4> [<ffffffffa0539865>] cl_lock_state_set+0x55/0x190 [obdclass]
<4> [<ffffffffa053a8b3>] cl_lock_delete0+0x53/0x1d0 [obdclass]
<4> [<ffffffffa053ab83>] cl_lock_delete+0x153/0x1a0 [obdclass]
<4> [<ffffffffa0918ac6>] osc_ldlm_blocking_ast+0x146/0x350 [osc]
<4> [<ffffffffa06691bc>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
Comment by Daire Byrne (Inactive) [ 03/Dec/14 ]

We have not hit this again since applying the suggested patch. You can resolve the ticket. Cheers.

Generated at Sat Feb 10 01:51:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.