[LU-604] 1.8<->2.1 interop: RIP: ptlrpc:lustre_msg_buf+0x4/0x90 Created: 18/Aug/11  Updated: 31/May/12  Resolved: 12/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 1.8.8

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Clients:
Tag: 1.8.6-wc1
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.12.1.el5)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el5,ib_stack=ofa/
Network: IB (OFED 1.5.3.1)

Lustre Servers:
Branch: master
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5_lustre.gd4ea36c)
Build: http://newbuild.whamcloud.com/job/lustre-master/262/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
Network: IB (inkernel OFED)


Severity: 3
Rank (Obsolete): 6568

 Description   

While running racer test, Lustre 1.8.6-wc1 client (fat-amd-3) hit kernel panic as follows:

Lustre: DEBUG MARKER: -----============= acceptance-small: racer ============----- Thu Aug 18 02:28:59 PDT 2011
general protection fault: 0000 [1] SMP 
last sysfs file: /block/lloop15/range
CPU 7 
Modules linked in: llite_lloop(U) lustre(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand powernow_k8 freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio iw_cxgb3(U) cxgb3(U) libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi rds(U) ib_sdp(U) ib_ipoib(U) ipoib_helper(U) rdma_ucm(U) rdma_cm(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) ib_cm(U) iw_cm(U) ib_addr(U) ipv6 xfrm_nalgo crypto_api ib_sa(U) loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp mlx4_ib(U) ib_mad(U) ib_core(U) igb 8021q tpm_tis tpm k10temp tpm_bios i2c_piix4 serio_raw sg dca hwmon pcspkr i2c_core mlx4_core(U) amd64_edac_mod edac_mc dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 27776, comm: rm Tainted: G      2.6.18-238.12.1.el5 #1
RIP: 0010:[]  [] :ptlrpc:lustre_msg_buf+0x4/0x90
RSP: 0018:ffff8100cdb65cc8  EFLAGS: 00010292
RAX: ffff81041e84f8e8 RBX: ffff81021f43b680 RCX: ffff81021adfe940
RDX: 00000000000000a8 RSI: 0000000000000002 RDI: 5a5a5a5a5a5a5a5a
RBP: ffff81021adfe940 R08: 0000000000000000 R09: 0000000000000000
R10: ffff810214393c00 R11: 0000000000000248 R12: ffff810214393c00
R13: ffff8100cdb65df8 R14: ffff81041a2c7b80 R15: ffff81041a2c7c50
FS:  00002afe193106e0(0000) GS:ffff810123aac2c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000001770393c CR3: 00000002158c0000 CR4: 00000000000006e0
Process rm (pid: 27776, threadinfo ffff8100cdb64000, task ffff8100d39e7080)
Stack:  ffff8100d1098000 ffff81041a2c7ce0 0000000200000400 0000000000000a03
 0000000200000400 ffffffff88b074e7 ffff81021f43b680 ffff81021e5a5540
 ffff81021adfe940 ffff8100cdb65df8 0000000000000000 ffffffff88b0a4a3
Call Trace:

 [] :lustre:ll_och_fill+0x67/0x100
 [] :lustre:ll_local_open+0xe3/0x190
 [] :libcfs:cfs_alloc+0x68/0xc0
 [] :lustre:ll_file_open+0x956/0xd10
 [] :lustre:ll_file_open+0x0/0xd10
 [] __dentry_open+0xd9/0x1dc
 [] do_filp_open+0x2a/0x38
 [] do_sys_open+0x44/0xbe
 [] tracesys+0xd5/0xe0

Code: 8b 47 08 3d d0 0b d0 0b 74 09 3d d3 0b d0 0b 75 1b eb 0e 83 
RIP  [] :ptlrpc:lustre_msg_buf+0x4/0x90
 RSP 
 <0>Kernel panic - not syncing: Fatal exception
 <7>APIC error on CPU13: 00(04)
[root@fat-amd-3 ~]# gdb /lib/modules/2.6.18-238.12.1.el5/updates/kernel/fs/lustre/ptlrpc.ko
(gdb) l *(lustre_msg_buf+0x4)
0x47584 is in lustre_msg_buf (/var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/ofa/BUILD/BUILD/lustre-1.8.6/lustre/ptlrpc/pack_generic.c:603).
598     /var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/ofa/BUILD/BUILD/lustre-1.8.6/lustre/ptlrpc/pack_generic.c: No such file or directory.
        in /var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/ofa/BUILD/BUILD/lustre-1.8.6/lustre/ptlrpc/pack_generic.c
(gdb) 

[root@fat-amd-3 ~]# vi /usr/src/lustre-1.8.6/lustre/ptlrpc/pack_generic.c
    601 void *lustre_msg_buf(struct lustre_msg *m, int n, int min_size)
    602 {
    603         switch (m->lm_magic) {
    604         case LUSTRE_MSG_MAGIC_V1:
    605                 return lustre_msg_buf_v1(m, n - 1, min_size);
    606         case LUSTRE_MSG_MAGIC_V2:
    607                 return lustre_msg_buf_v2(m, n, min_size);
    608         default:
    609                 CERROR("incorrect message magic: %08x\n", m->lm_magic);
    610                 return NULL;
    611         }
    612 }

Maloo report: https://maloo.whamcloud.com/test_sets/4468ed3a-c97f-11e0-8d02-52540025f9af



 Comments   
Comment by Peter Jones [ 18/Aug/11 ]

Hongchao is looking into this one

Comment by Hongchao Zhang [ 22/Aug/11 ]

http://review.whamcloud.com/#change,1271

this patch has been tested 5 times in Toro successfully.

Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Peter Jones [ 12/Dec/11 ]

Landed to b1_8. Please reopen if an equivalent fix is also needed for master

Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #160
LU-604 open non-exist object should return ENOENT (Revision 2fef7fd122f3f97bbf12339da70c4025bceb336e)

Result = SUCCESS
Johann Lombardi : 2fef7fd122f3f97bbf12339da70c4025bceb336e
Files :

  • lustre/llite/file.c
Comment by Peter Jones [ 12/Dec/11 ]

As per Hongchao - not needed on master

Comment by Jay Lan (Inactive) [ 30/May/12 ]

After NASA upgraded our Lustre servers to 2.1.1, a front end node, running 1.8.6 client, hit this problem.

Comment by Jay Lan (Inactive) [ 31/May/12 ]

We got hit four times already on three different front-end nodes since yesterday. Do you have confidence in this patch? Is it safe to take the commit?

Comment by Peter Jones [ 31/May/12 ]

Jay

The fix is included in our 1.8.8-wc1 release so I would say that we are confident in it.

Peter

Generated at Sat Feb 10 01:08:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.