[LU-1280] kernel BUG at .../lustre-2.1.0/lustre/lvfs/fsfilt-ldiskfs.c:978 Created: 03/Apr/12  Updated: 22/May/12  Resolved: 22/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

kernel 2.6.32-220.7.1.3chaos.ch5.x86_64
https://github.com/chaos/lustre/branches/2.1.0-llnl


Severity: 3
Rank (Obsolete): 4617

 Description   

We hit this assertion in ldiskfs_ext_new_extent_cb() on a production OSS.

 978         LASSERTF(i == path->p_depth ||
 979                  EXT_GENERATION(base) != path[0].p_generation,
 980                  "base vs path extent depth:%d != %d, generation:%lu == %lu\n",
 981                  i, path->p_depth, EXT_GENERATION(base), path[0].p_generation);

Here is the stack trace. We didn't capture the message from LASSERTF, and the crash dump failed.

2012-04-02 16:06:58 kernel BUG at /builddir/build/BUILD/lustre-2.1.0/lustre/lvfs/fsfilt-ldiskfs.c:978!
2012-04-02 16:06:58 invalid opcode: 0000 [#1] SMP
2012-04-02 16:06:58 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
2012-04-02 16:06:58 CPU 4
2012-04-02 16:06:58 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) lustre(U) lov(U) osc(U)
lquota(U) m
2012-04-02 16:06:58
2012-04-02 16:06:58 Pid: 8224, comm: ll_ost_io_146 Not tainted 2.6.32-220.7.1.3chaos.ch5.x86_64 #1 Supermicro
X8DTH-i/6/iF/6F/X8
2012-04-02 16:06:58 RIP: 0010:[<ffffffffa0aab4c1>]  [<ffffffffa0aab4c1>] ldiskfs_ext_new_extent_cb+0x5a1/0x660
[fsfilt_ldiskfs]
2012-04-02 16:06:59 RSP: 0018:ffff880270add5d0  EFLAGS: 00010246
2012-04-02 16:06:59 RAX: 0000000000000038 RBX: ffff880270add730 RCX: 0000000000000008
2012-04-02 16:06:59 RDX: 0000000000000001 RSI: ffff8804cd11aec0 RDI: ffff88012cf7a5c0
2012-04-02 16:06:59 RBP: ffff880270add670 R08: ffff88012cf7a4c8 R09: ffff8804cd11aec0
2012-04-02 16:06:59 R10: ffff88023b5ec000 R11: 0000000000000fff R12: ffff880270add6b0
2012-04-02 16:06:59 R13: ffff88012cf7a5c0 R14: ffff8804cd11aec0 R15: ffff88012cf7a5c0
2012-04-02 16:06:59 FS:  00002aaaab06eb20(0000) GS:ffff88034ac00000(0000) knlGS:0000000000000000
2012-04-02 16:06:59 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2012-04-02 16:06:59 CR2: 00002aaaac9ce000 CR3: 0000000001a85000 CR4: 00000000000006e0
2012-04-02 16:06:59 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2012-04-02 16:06:59 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2012-04-02 16:06:59 Process ll_ost_io_146 (pid: 8224, threadinfo ffff880270adc000, task ffff880270adb580)
2012-04-02 16:06:59 Stack:
2012-04-02 16:06:59  ffff880270add620 ffffffff81273646 0000000000000282 ffff88012cf7a5b0
2012-04-02 16:06:59 <0> ffff88012cf7a4c8 000000000000216c ffff8804cd11aec0 ffff88012cf7a4c8
2012-04-02 16:06:59 <0> 0000000000000000 ffff88012cf7a5c0 ffff880270add670 ffffffff81278c08
2012-04-02 16:06:59 Call Trace:
2012-04-02 16:06:59  [<ffffffff81273646>] ? rwsem_wake+0x76/0x170
2012-04-02 16:06:59  [<ffffffff81278c08>] ? call_rwsem_wake+0x18/0x30
2012-04-02 16:06:59  [<ffffffffa04d639f>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs]
2012-04-02 16:06:59  [<ffffffffa0aaaf20>] ? ldiskfs_ext_new_extent_cb+0x0/0x660 [fsfilt_ldiskfs]
2012-04-02 16:06:59  [<ffffffffa0aaac83>] fsfilt_map_nblocks+0xa3/0xd0 [fsfilt_ldiskfs]
2012-04-02 16:06:59  [<ffffffffa0aaadc3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x1e0 [fsfilt_ldiskfs]
2012-04-02 16:06:59  [<ffffffff811285e7>] ? ____pagevec_lru_add+0x167/0x180
2012-04-02 16:06:59  [<ffffffffa0aaaf15>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
2012-04-02 16:06:59  [<ffffffff81111778>] ? add_to_page_cache_lru+0x68/0x80
2012-04-02 16:06:59  [<ffffffffa0ae62d9>] filter_do_bio+0xec9/0x14a0 [obdfilter]
2012-04-02 16:06:59  [<ffffffff81112388>] ? find_or_create_page+0x68/0xb0
2012-04-02 16:06:59  [<ffffffff81012b59>] ? read_tsc+0x9/0x20
2012-04-02 16:06:59  [<ffffffffa0ade8bb>] filter_preprw_write+0x12db/0x1870 [obdfilter]
2012-04-02 16:06:59  [<ffffffffa040bc21>] ? lnet_ni_send+0x51/0xe0 [lnet]
2012-04-02 16:06:59  [<ffffffffa06d9193>] ? null_alloc_rs+0x173/0x2b0 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa0adf9f8>] filter_preprw+0x68/0x80 [obdfilter]
2012-04-02 16:06:59  [<ffffffffa0a7d33e>] obd_preprw+0x11e/0x2f0 [ost]
2012-04-02 16:06:59  [<ffffffffa0a84827>] ost_brw_write+0x807/0x1420 [ost]
2012-04-02 16:06:59  [<ffffffffa06a6544>] ? lustre_msg_get_version+0x54/0x90 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa06a65f8>] ? lustre_msg_check_version+0x78/0x90 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa0a880e8>] ost_handle+0x2608/0x3950 [ost]
2012-04-02 16:06:59  [<ffffffffa06a6944>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa06b27a1>] ptlrpc_main+0xcd1/0x1690 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-04-02 16:06:59  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-04-02 16:06:59  [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-04-02 16:06:59  [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-04-02 16:06:59  [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Oleg Drokin [ 03/Apr/12 ]

Do you have this patch from LU-216: http://review.whamcloud.com/491 ?

Comment by Ned Bass [ 03/Apr/12 ]

Hi Oleg,

Yes we do have that patch.

Comment by Alex Zhuravlev [ 03/Apr/12 ]

I'm sorry - this is my fault, I left this LASSERTF for debugging purposes to make sure the race really happens
and forgot to remove from the final patch.

ldiskfs_ext_walk_space() takes the semaphore protecting the tree, finds an extent for required offset, makes
a copy of the extent and releases the semaphore. then process the extent and call the callback. at this point
the tree is not owned by this thread, so another thread can change it. the callback will grab the semaphore
again if allocation is required. once the semaphore is taken, the callback have to check whether the tree is
the same, if not - repeat search.

the solution is to remove LASSERTF(). sorry again.

Comment by Alex Zhuravlev [ 03/Apr/12 ]

btw, is it possible to find what LASSERTF() did print ?

Comment by Ned Bass [ 03/Apr/12 ]

Good to know there's an easy explanation.

I did look for the LASSERTF message in the console log and on our syslog server, but somehow it seems to have gotten lost.

Comment by Peter Jones [ 03/Apr/12 ]

Yu Jian

Could you please make the necessary correction here?

Thanks

Peter

Comment by Jian Yu [ 05/Apr/12 ]

Patch for master branch is in http://review.whamcloud.com/2452.
The patch is also needed on b2_1 and b2_2 branches.

Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,client,el6,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,server,el5,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,client,el5,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 29/Apr/12 ]

Integrated in lustre-master » i686,server,el6,ofa #497
LU-1280 ldiskfs: remove LASSERTF from ext3_ext_new_extent_cb() (Revision b86b6eedc2e0d125238b6f8cbc176c19697477eb)

Result = SUCCESS
Oleg Drokin : b86b6eedc2e0d125238b6f8cbc176c19697477eb
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Jian Yu [ 03/May/12 ]

Patch for b2_1 branch: http://review.whamcloud.com/2639
Patch for b2_2 branch: http://review.whamcloud.com/2640

Comment by Ned Bass [ 03/May/12 ]

Actually it appears that we were hitting the EXT_ASSERT on lustre/lvfs/fsfilt_ext3.c:977

976         i = EXT_DEPTH(base);
977         EXT_ASSERT(path[i].p_hdr);                                     

This explains why we never saw the LASSERTF message. We hit this again after applying http://review.whamcloud.com/2639.

Comment by Alex Zhuravlev [ 15/May/12 ]

Ned, thanks for the report. this assertion is invalid as well (since new locking was introduced in ext4_ext_walk_space())

Comment by Jian Yu [ 17/May/12 ]

this assertion is invalid as well (since new locking was introduced in ext4_ext_walk_space())

Patch for master branch: http://review.whamcloud.com/2827
Patch for b2_1 branch: http://review.whamcloud.com/2828
Patch for b2_2 branch was also updated.

Comment by Peter Jones [ 22/May/12 ]

Landed for 2.1.2 and 2.3

Generated at Sat Feb 10 01:15:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.