[LU-1020] 1.8.7 <-->2.1.54 racer OOPS Created: 20/Jan/12  Updated: 11/Aug/15  Resolved: 11/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sarah Liu Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Client-1.8.7 rhel5-x86_64
Server-lustre-master build #423(2.1.54) rhel6-x86_64


Attachments: File log    
Severity: 3
Rank (Obsolete): 9736

 Description   

Lustre: DEBUG MARKER: ----============= acceptance-small: racer ============---- Fri Jan 20 18:54:29 PST 2012
general protection fault: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 3
Modules linked in: nfs nfs_acl mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf be2iscsi iscsi_tcp bnx2i cnic uio cxgb3i libcxgbi iw_cxgb3 cxgb3 libiscsi_tcp ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en sg igb mlx4_core shpchp joydev i2c_i801 i7core_edac i2c_core tpm_tis tpm edac_mc 8021q tpm_bios pcspkr serio_raw dca dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 17088, comm: rm Tainted: G ---- 2.6.18-274.3.1.el5 #1
RIP: 0010:[<ffffffff88924584>] [<ffffffff88924584>] :ptlrpc:lustre_msg_buf+0x4/0x90
RSP: 0000:ffff8102fce0fcc8 EFLAGS: 00010292
RAX: ffff8102f9e29d20 RBX: ffff81033f80d9c0 RCX: 0000000000000001
RDX: 00000000000000a8 RSI: 0000000000000002 RDI: 5a5a5a5a5a5a5a5a
RBP: ffff810321a09dc0 R08: 4f1a28e800000003 R09: 0000000000000177
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8102fe497800
R13: ffff8102fce0fdf8 R14: ffff81032032f400 R15: ffff81032032f4d0
FS: 00002b7833e086e0(0000) GS:ffff81010b7636c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006b32a0 CR3: 000000031cc0a000 CR4: 00000000000006a0
Process rm (pid: 17088, threadinfo ffff8102fce0e000, task ffff810320dc2820)
Stack: ffff810327bec000 ffff81032032f560 0000000200000401 000000000000da06
0000000200000401 ffffffff88b094d7 ffff81033f80d9c0 ffff81032c7e1a80
ffff810321a09dc0 ffff8102fce0fdf8 0000000000000000 ffffffff88b0c493
Call Trace:
[<ffffffff88b094d7>] :lustre:ll_och_fill+0x67/0x100
[<ffffffff88b0c493>] :lustre:ll_local_open+0xe3/0x190
[<ffffffff887b8378>] :libcfs:cfs_alloc+0x68/0xc0
[<ffffffff88b0dae6>] :lustre:ll_file_open+0x956/0xd10
[<ffffffff88b0d190>] :lustre:ll_file_open+0x0/0xd10
[<ffffffff8001ec04>] __dentry_open+0xd9/0x1dc
[<ffffffff80027729>] do_filp_open+0x2a/0x38
[<ffffffff800eb454>] do_rmdir+0xcd/0xde
[<ffffffff8001a089>] do_sys_open+0x44/0xbe
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Code: 8b 47 08 3d d0 0b d0 0b 74 09 3d d3 0b d0 0b 75 1b eb 0e 83
RIP [<ffffffff88924584>] :ptlrpc:lustre_msg_buf+0x4/0x90
RSP <ffff8102fce0fcc8>
<0>Kernel panic - not syncing: Fatal exception



 Comments   
Comment by Peter Jones [ 06/Feb/12 ]

Hongchao

Please can you look into this one

Peter

Comment by Hongchao Zhang [ 08/Feb/12 ]

the ptlrpc_request stored in lookup_intent->d.lustre.it_data is freed(the DISP_ENQ_OPEN_REF flag is set in lookup_intent!),
which cause this general protection fault, it's difficult to know where the request is freed by investigating the code line,
will try to create a debug patch to get more debug info.

Comment by Jian Yu [ 15/Feb/12 ]

Lustre Clients:
Tag: 1.8.7-wc1
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.3.1.el5)
Build: http://build.whamcloud.com/job/lustre-b1_8/171/
Network: TCP (1GigE)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_1_1_0_RC2
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5_lustre.g4554b65)
Build: http://build.whamcloud.com/job/lustre-b2_1/41/
Network: TCP (1GigE)

The same issue occurred while running racer test: https://maloo.whamcloud.com/test_sets/1c5098c0-579b-11e1-99fa-5254004bbbd3

Comment by Hongchao Zhang [ 16/Feb/12 ]

the debug patch is tracked at http://review.whamcloud.com/#change,2152

Comment by Andreas Dilger [ 16/Feb/12 ]

Please note that it is important when fixing 1.8/2.x interop bugs that the fix almost always needs to be done on the 2.x server side, since it is not possible to retroactively fix the existing 1.8.x releases.

Comment by Sarah Liu [ 16/Feb/12 ]

Please find the log in the attached. If you need any other information please just let me know.

Comment by Hongchao Zhang [ 16/Feb/12 ]

the attached log is a little confused, and there is a kernel panic at __audit_syscall_exit, then the system
is rebooted and it encountered OOM during loading the modules at boot. it's much more like issues related to
the linux system environment.

Comment by Jay Lan (Inactive) [ 30/May/12 ]

How does this differ from LU-604? Does the client include the fix of LU-604?
We at NASA got hit by this after we upgraded servers to 2.1.1 last week. The stack
trace of this looks similar to that of LU-604...

Comment by Hongchao Zhang [ 14/Aug/12 ]

yes, it does look similar to LU-604, and the patch could not be included as per the comment in LU-604 (included in 1.8.8-wc1)

Generated at Sat Feb 10 01:12:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.