[LU-2970] ASSERTION( !list_empty(&h->loh_layers) ) failed, followed by a kernel panic Created: 15/Mar/13 Updated: 28/Mar/13 Resolved: 28/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | ETHz Support (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64) |
||
| Severity: | 3 |
| Rank (Obsolete): | 7239 |
| Description |
|
One of our lustre clients crashed yesterday with the following kernel panic: 2013-03-14T16:24:25+01:00 brutus3 LustreError: 4488:0:(lu_object.h:759:lu_object_top()) ASSERTION( !list_empty(&h->loh_layers) ) failed: Unfortunately i have no idea what process caused the panic: The affected node is a login node and there were about 50 people logged in, so i have no easy way to reproduce the crash :-/ The lustre kernel module was compiled from the v2_3_61_0 git tag. |
| Comments |
| Comment by Peter Jones [ 17/Mar/13 ] |
|
Adrian Do I understand correctly that you are running a pre-release version of 2.4 in production? Peter |
| Comment by Adrian Ulrich (Inactive) [ 17/Mar/13 ] |
|
Yes, our compute nodes/clients are running git-versions of the lustre client (The servers are running stock 2.2.0 - we will ugprade them to 2.4.0 after the release) I am aware of the fact that this might not be a good idea (well, someone has to test it |
| Comment by Peter Jones [ 17/Mar/13 ] |
|
Adrian As long as you are aware of the risks of running pre-release software then of course I am delighted that we are able to get feedback from a real production environment - 2.4 will be a better release for it. While the focus of feature releases is always the new features provided, we also include all known bugfixes and the vast majority of the issues exposed by sites running 2.x releases have been issues in the underlying 2.0 code that we have built upon, rather than regressions associated with the new features. So, while I am disappointed to hear that you have had poor stability with 2.2 and 2.3 (others have reported a far better experience), I am not surprised to hear that things have been improving. Do you mind if I mention publicly (on updates to the mailing lists, in presentations about Lustre 2.4) that ETHZ is doing this? Oleg Could you please review this report and advise next steps? Is there enough to work with here? If not, can you advise what Adrian should collect in the event of a future reoccurence? Thanks Peter |
| Comment by Adrian Ulrich (Inactive) [ 17/Mar/13 ] |
|
> Do you mind if I mention publicly (on updates to the mailing lists, in presentations about Lustre 2.4) that ETHZ is doing this? I don't mind: That's fine with me. |
| Comment by Peter Jones [ 17/Mar/13 ] |
|
Great - thanks Adrian! |
| Comment by Andreas Dilger [ 18/Mar/13 ] |
|
Jinshan, can you please take a look at this to see if anything is obvious? |
| Comment by Jinshan Xiong (Inactive) [ 18/Mar/13 ] |
|
Obviously the object was already freed when this issue happened. Hmm.. did you set up crashdump on the machine or it's impossible to collect lustre log? |
| Comment by Oleg Drokin [ 19/Mar/13 ] |
|
I hit a very similar bug last Sunday. have crashdump in /exports/crashdumps/192.168.10.218-2013-03-17-21\:29\:49/ [363112.577950] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [363112.578318] last sysfs file: /sys/devices/system/cpu/possible [363112.578589] CPU 1 [363112.578637] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclas s lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcac he jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net vi rtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_ha sh dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libisc si scsi_transport_iscsi [last unloaded: libcfs] [363112.580600] [363112.580600] Pid: 451, comm: ldlm_bl_45 Not tainted 2.6.32-debug #6 Bochs Boch s [363112.580600] RIP: 0010:[<ffffffffa0f9c90b>] [<ffffffffa0f9c90b>] cl_object_to p+0x1b/0x150 [obdclass] [363112.580600] RSP: 0018:ffff88009e0edb90 EFLAGS: 00010206 [363112.580600] RAX: 000130b38d4c0000 RBX: ffff88000bfb1db0 RCX: ffff880080abef60 [363112.580600] RDX: 000130b38d4c0000 RSI: ffffffffa04b1940 RDI: ffff88004ee8deb0 [363112.580600] RBP: ffff88009e0edba0 R08: 0000000000000000 R09: 0000000000000000 [363112.580600] R10: 0000000000000003 R11: 000000000000000f R12: ffff8800a89e6f30 [363112.580600] R13: ffff88003e04df50 R14: ffff88004ee8deb0 R15: ffff8800790bbc18 [363112.580600] FS: 00007f8c05205700(0000) GS:ffff880006280000(0000) knlGS:0000000000000000 [363112.580600] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [363112.580600] CR2: 00007ff2a83e2cf6 CR3: 0000000072edd000 CR4: 00000000000006e0 [363112.580600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [363112.580600] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [363112.587593] Process ldlm_bl_45 (pid: 451, threadinfo ffff88009e0ec000, task ffff880088694440) [363112.587593] Stack: [363112.587593] ffff88009e0edbb0 ffff88000bfb1db0 ffff88009e0edbb0 ffffffffa0f9ca4e [363112.587593] <d> ffff88009e0edbf0 ffffffffa0488788 0000000000000002 ffff88003e04df50 [363112.587593] <d> ffff88003e04df50 ffff8800a89e6f30 ffff8800a89e6f30 ffff88009e0edca0 [363112.587593] Call Trace: [363112.587593] [<ffffffffa0f9ca4e>] cl_object_attr_lock+0xe/0x20 [obdclass] [363112.587593] [<ffffffffa0488788>] osc_lock_detach+0xe8/0x1a0 [osc] [363112.587593] [<ffffffffa0488888>] osc_lock_delete+0x48/0xc0 [osc] [363112.587593] [<ffffffffa0fa4ce5>] cl_lock_delete0+0xb5/0x1d0 [obdclass] [363112.587593] [<ffffffffa0fa4f53>] cl_lock_delete+0x153/0x1a0 [obdclass] [363112.587593] [<ffffffffa048a4f6>] osc_ldlm_blocking_ast+0x146/0x350 [osc] [363112.587593] [<ffffffffa10c906c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc] [363112.587593] [<ffffffffa10e30da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] [363112.587593] [<ffffffffa10e7bd0>] ldlm_cli_cancel+0x60/0x360 [ptlrpc] [363112.587593] [<ffffffffa0488ede>] osc_lock_cancel+0xfe/0x1c0 [osc] [363112.587593] [<ffffffffa0fa37c5>] cl_lock_cancel0+0x75/0x160 [obdclass] [363112.587593] [<ffffffffa0fa436b>] cl_lock_cancel+0x13b/0x140 [obdclass] [363112.587593] [<ffffffffa048a4ea>] osc_ldlm_blocking_ast+0x13a/0x350 [osc] [363112.587593] [<ffffffffa10eb970>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [363112.587593] [<ffffffffa10ebec9>] ldlm_bl_thread_main+0x289/0x3e0 [ptlrpc] [363112.587593] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [363112.587593] [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc] [363112.587593] [<ffffffff8100c14a>] child_rip+0xa/0x20 [363112.587593] [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc] [363112.587593] [<ffffffffa10ebc40>] ? ldlm_bl_thread_main+0x0/0x3e0 [ptlrpc] [363112.587593] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [363112.587593] Code: c7 a0 e2 fe a0 e8 e6 95 e9 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 07 0f 1f 80 00 00 00 00 48 89 c2 <48> 8b 80 b0 00 00 00 48 85 c0 75 f1 48 8b 42 48 48 83 c2 48 48 |
| Comment by Oleg Drokin [ 19/Mar/13 ] |
|
Adrian, with panics once per week, any other interesting panics you happen to have that you can share with us? |
| Comment by Adrian Ulrich (Inactive) [ 20/Mar/13 ] |
|
@ Jinshan Xiong Unfortunately, crashdump was not enabled on this kind of node. It is now enabled and i should be able to provide a crashdump if it happens again @ Oleg Drokin No, i don't have any other interesting panics right now |
| Comment by Jinshan Xiong (Inactive) [ 20/Mar/13 ] |
|
Hi Adrian, I'm going to work out a debug patch. From the symptom so far, the top object was freed while a sublock was still being canceled. This must be race but I need more information. |
| Comment by Jinshan Xiong (Inactive) [ 21/Mar/13 ] |
|
I;ve known the root cause of this problem, will compose a patch. |
| Comment by Jinshan Xiong (Inactive) [ 22/Mar/13 ] |
|
patch is at: http://review.whamcloud.com/5812 |
| Comment by Adrian Ulrich (Inactive) [ 25/Mar/13 ] |
|
Thanks! I'll rebuild our client-RPM with the patch included ASAP. |
| Comment by Peter Jones [ 28/Mar/13 ] |
|
Landed for 2.4 |