[LU-481] sanity test_119d fails (ASSERTION((struct cl_page *)vmpage->private != slice->cpl_page) failed) Created: 04/Jul/11  Updated: 13/Jul/11  Resolved: 13/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4261

 Description   

This issue was created by maloo for bobijam <bobijam@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/fe768762-a677-11e0-bd2a-52540025f9af.

11:29:15:Lustre: DEBUG MARKER: == sanity test 119d: The DIO path should try to send a new rpc once one is completed ================= 11:29:13 (1309804153)
11:29:18:LustreError: 23078:0:(vvp_page.c:77:vvp_page_fini()) ASSERTION((struct cl_page *)vmpage->private != slice->cpl_page) failed
11:29:18:LustreError: 23078:0:(vvp_page.c:77:vvp_page_fini()) LBUG
11:29:18:Pid: 23078, comm: cat
11:29:18:
11:29:18:Call Trace:
11:29:18: [<ffffffff887815f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
11:29:18: [<ffffffff88781b2a>] lbug_with_loc+0x7a/0xd0 [libcfs]
11:29:18: [<ffffffff8878cd70>] cfs_tracefile_init+0x0/0x10a [libcfs]
11:29:18: [<ffffffff88be49c0>] vvp_page_fini+0x30/0x40 [lustre]
11:29:18: [<ffffffff888ac53e>] cl_page_free+0x32e/0x490 [obdclass]
11:29:18: [<ffffffff88b34e7a>] lov_page_init+0xaa/0xd0 [lov]
11:29:18: [<ffffffff888ad3c4>] cl_page_find0+0x544/0x8a0 [obdclass]
11:29:18: [<ffffffff88be471d>] vvp_page_disown+0x6d/0x90 [lustre]
11:29:19: [<ffffffff888ac84b>] cl_page_put+0x1ab/0x3c0 [obdclass]
11:29:19: [<ffffffff88bbb4a6>] ll_cl_init+0x336/0x440 [lustre]
11:29:19: [<ffffffff800ce6ab>] zone_statistics+0x3e/0x6d
11:29:19: [<ffffffff80154f48>] radix_tree_node_alloc+0x18/0x57
11:29:19: [<ffffffff88bbbac3>] ll_readpage+0x83/0x1d0 [lustre]
11:29:19: [<ffffffff8000c8cd>] add_to_page_cache+0xaa/0xc1
11:29:20: [<ffffffff8000c4bb>] do_generic_mapping_read+0x20d/0x359
11:29:20: [<ffffffff8000d279>] file_read_actor+0x0/0x159
11:29:20: [<ffffffff8000c753>] __generic_file_aio_read+0x14c/0x198
11:29:20: [<ffffffff800c8b79>] generic_file_readv+0x8f/0xa8
11:29:20: [<ffffffff800a28ec>] autoremove_wake_function+0x0/0x2e
11:29:20: [<ffffffff888bb638>] lu_ref_del+0x118/0x270 [obdclass]
11:29:20: [<ffffffff8000ba7c>] touch_atime+0x67/0xaa
11:29:20: [<ffffffff88be706b>] vvp_io_read_start+0x2bb/0x400 [lustre]
11:29:20: [<ffffffff888b429f>] cl_wait+0x19f/0x240 [obdclass]
11:29:20: [<ffffffff888b577c>] cl_io_start+0xbc/0x160 [obdclass]
11:29:21: [<ffffffff888b916d>] cl_io_loop+0xad/0x1a0 [obdclass]
11:29:21: [<ffffffff80064604>] __down_read+0x12/0x92
11:29:21: [<ffffffff88b903ce>] ll_file_io_generic+0x36e/0x4b0 [lustre]
11:29:21: [<ffffffff80008ff9>] __handle_mm_fault+0x890/0x1039
11:29:21: [<ffffffff888a8ec5>] cl_env_get+0x25/0x300 [obdclass]
11:29:21: [<ffffffff88b9096d>] ll_file_readv+0x1dd/0x280 [lustre]
11:29:21: [<ffffffff888a9002>] cl_env_get+0x162/0x300 [obdclass]
11:29:21: [<ffffffff88b9f0a4>] ll_file_read+0x164/0x210 [lustre]
11:29:22: [<ffffffff8000e2a5>] do_mmap_pgoff+0x615/0x780
11:29:22: [<ffffffff8000b78d>] vfs_read+0xcb/0x171
11:29:22: [<ffffffff80011d34>] sys_read+0x45/0x6e
11:29:23: [<ffffffff8005d28d>] tracesys+0xd5/0xe0



 Comments   
Comment by Peter Jones [ 05/Jul/11 ]

Niu will look into this one

Comment by Niu Yawei (Inactive) [ 05/Jul/11 ]

In cl_page_alloc(), when o->co_ops->coo_page_init() failed, we call cl_page_free() to free the cl_page without calling cl_page_delete0() prior, which could cause this ASSERT in cl_page_free(), since the linkage between cl_page and vmpage hasn't been broken by cl_page_delete0() yet.

I think adding cl_page_delete0() before cl_page_free() in cl_page_alloc() should simply fix this ASSERT error, however, when I look into the code to see why coo_page_init() failed, following piece of code in cl_page_find0() confused me:

                        if (page->cp_type == CPT_TRANSIENT &&
                            type == CPT_CACHEABLE) {
                                /* XXX: We should make sure that inode sem
                                 * keeps being held in the lifetime of
                                 * transient pages, so it is impossible to
                                 * have conflicting transient pages.
                                 */
                                cfs_spin_unlock(&hdr->coh_page_guard);
                                cl_page_put(env, page);
                                cfs_spin_lock(&hdr->coh_page_guard);
                                page = ERR_PTR(-EBUSY);
                        }

I don't see why we should return error here, in my opinion, it should be a legal race for concurrent dio and bufferred read. Xiong, any comment? Thank you.

Comment by Alex Zhuravlev [ 05/Jul/11 ]

does master branch hit this as well?

Comment by Niu Yawei (Inactive) [ 05/Jul/11 ]

Hi, Alex

I'm not sure, from the code, looks master should have this problem.

Comment by Niu Yawei (Inactive) [ 05/Jul/11 ]

Yes, master hit this as well, https://maloo.whamcloud.com/test_sets/9918c09c-a384-11e0-a0cf-52540025f9af

Comment by Jinshan Xiong (Inactive) [ 07/Jul/11 ]

Hi Alex,
If you hit this issue at orion as well, can you please post a lustre log right here?

Maloo didn't catch any log for this case.

Comment by Niu Yawei (Inactive) [ 07/Jul/11 ]

Hi, Xiong

The reason of this ASSERT is explained in my previous comment:

  • cl_page_delete0() is missed in cl_page_alloc();
  • 'transient' page and cached page are stored in same radix tree, so when a dio inserted a 'transient' page for a given offset, a concurrent bufferred read try to search the radix tree to find a cached page, however, a 'transient' page is found, so it return error.

The first is easy to fix, the second I think we need the fix in LU-485.

Comment by Niu Yawei (Inactive) [ 08/Jul/11 ]

The patch is at: http://review.whamcloud.com/1072 (the fix of LU-485 is merged)

Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,server,el5,ofa #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,client,el5,ofa #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/obdclass/cl_page.c
  • lustre/lov/lov_page.c
Comment by Build Master (Inactive) [ 13/Jul/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #205
LU-481 Don't store 'transient' page in radix tree

Oleg Drokin : 9e213f7975423b69eae06b1e561516e6b26a2c72
Files :

  • lustre/lov/lov_page.c
  • lustre/obdclass/cl_page.c
Comment by Peter Jones [ 13/Jul/11 ]

Landed for 2.1

Generated at Sat Feb 10 01:07:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.