[LU-130] Kernel crash on lustre 2.0 client (page fault in ll_file_read, NULL pointer dereference) Created: 16/Mar/11  Updated: 10/Sep/12  Resolved: 03/May/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Patrick Valentin (Inactive) Assignee: Johann Lombardi (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:
  1. uname -r
    2.6.32-71.14.1.el6.Bull.19.x86_64
  2. rpm -qa lustre*
    lustre-modules-2.0.0.1-2.6.32_71.14.1.el6.Bull.18.x86_64_Bull.petaflop.1.052.el6.x86_64
    lustre-2.0.0.1-Bull.petaflop.1.052.el6.x86_64
    lustre_e2fsprogs-1.41.10.sun2-Bull.petaflop.1.052.el6.20110128.x86_64

Severity: 3
Rank (Obsolete): 5057

 Description   

Hello,

As suggested by Johann Lombardi, we open a Jira ticket for this issue which occurs more and more frequently at CEA site (Bull Customer).

Bellow is an extract of dmesg output and a stack trace:

=== from dmesg output:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
IP: [<ffffffffa05f7cd7>] cl_vmpage_page+0x57/0x1e0 [obdclass]
PGD 3f7fc5067 PUD 3b3d2d067 PMD 0
Oops: 0000 1 SMP

=== backtrace:

PID: 27785 TASK: ffff8803b3e95240 CPU: 5 COMMAND: "fortcom"
#0 [ffff8803f7e63510] machine_kexec at ffffffff8102e77b
#1 [ffff8803f7e63570] crash_kexec at ffffffff810a6cb8
#2 [ffff8803f7e63640] oops_end at ffffffff8146a770
#3 [ffff8803f7e63670] no_context at ffffffff810378db
#4 [ffff8803f7e636c0] __bad_area_nosemaphore at ffffffff81037b65
#5 [ffff8803f7e63710] bad_area at ffffffff81037c8e
#6 [ffff8803f7e63740] do_page_fault at ffffffff8146c2e8
#7 [ffff8803f7e63790] page_fault at ffffffff81469ae5
[exception RIP: cl_vmpage_page+87]
RIP: ffffffffa05f7cd7 RSP: ffff8803f7e63848 RFLAGS: 00010202
RAX: 0000000000001218 RBX: 0000000000000002 RCX: ffffea001167e190
RDX: 0000000000001218 RSI: ffff8803b339aa48 RDI: ffff8803b339aa08
RBP: ffff8803f7e63898 R8: 0000000000000001 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8803b339aa48
R13: ffff8803b339aa08 R14: ffff8803b3e32c98 R15: ffffea001167e190
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff8803f7e638a0] cl_page_find0 at ffffffffa05fa4ad
#9 [ffff8803f7e63980] cl_page_find at ffffffffa05fabc1
#10 [ffff8803f7e63990] ll_cl_init at ffffffffa096e4d0
#11 [ffff8803f7e63a60] ll_readpage at ffffffffa096e88a
#12 [ffff8803f7e63ad0] generic_file_aio_read at ffffffff810fb470
#13 [ffff8803f7e63bb0] vvp_io_read_start at ffffffffa0999efb
#14 [ffff8803f7e63c60] cl_io_start at ffffffffa0601be8
#15 [ffff8803f7e63cc0] cl_io_loop at ffffffffa0605710
#16 [ffff8803f7e63d30] ll_file_io_generic at ffffffffa0942f72
#17 [ffff8803f7e63dd0] ll_file_aio_read at ffffffffa094322c
#18 [ffff8803f7e63e60] ll_file_read at ffffffffa0949811
#19 [ffff8803f7e63ef0] vfs_read at ffffffff81158a45
#20 [ffff8803f7e63f30] sys_read at ffffffff81158b81
#21 [ffff8803f7e63f80] system_call_fastpath at ffffffff8100c172
RIP: 0000003c15ad41cd RSP: 00007fff46bcd1d8 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffffff8100c172 RCX: 00007fff46bcd2a0
RDX: 0000000000028d8c RSI: 00002b747695d000 RDI: 0000000000000003
RBP: 0000000000428d8c R8: 0000000000428d8c R9: 0000000004f36aa0
R10: 00002b74768defa0 R11: 0000000000000293 R12: 0000000000028d8c
R13: 0000000000400000 R14: 0000000004f369c0 R15: 0000000000400000
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b



 Comments   
Comment by Johann Lombardi (Inactive) [ 16/Mar/11 ]

> IP: [<ffffffffa05f7cd7>] cl_vmpage_page+0x57/0x1e0 [obdclass]

Could you please try to map this address and also to dump the struct page associated with vmpage?
Another customer reports that ->private can sometimes be clobbered on RHEL6 and i wonder if this is related.

Comment by Peter Jones [ 16/Mar/11 ]

Johann is looking into this one

Comment by Johann Lombardi (Inactive) [ 16/Mar/11 ]

From my discussion with Bruno, page->private is equal to 2, as in LU-93.
Bruno is going to provide more information soon.

Potential culprits could be:

  • Huge Pages
  • Page Migration

Those two features seem to use page->private, although it is not clear to me how we can end up in this situation.

In bug LU-93, the customer reported that the bug disappeared after a BIOS upgrade.
Bull, could you please tell us what BIOS version you use on the nodes?

Comment by Bruno Faccini (Inactive) [ 16/Mar/11 ]

I will try to check more occurences (there are several per-day at T100 !!) of this crash in order to confirm that page->private concerned value is always 2 ...

Also, about Huge-Pages, they don't seem to be used as per our "live" check on Client-nodes, but I will confirm it asap from the crash-dumps too.

You speak about Page Migration as a possible "man in the middle" and I will also investigate this possibility.

Comment by Johann Lombardi (Inactive) [ 16/Mar/11 ]

free_hot_cold_page() seems to store the migrate type in page->private. MIGRATE_MOVABLE (= 2) could be a good candidate.
That said, prep_new_page() should initialize ->private to 0 when a new page is allocated, so it is still not clear how we could end up with an non-initialized page even if Page Migration is the "man in the middle".

Comment by Lai Siyao [ 16/Mar/11 ]

Johann, could you explain how you find page->private equals 2?
I can't get it from the logs.

Comment by Bruno Faccini (Inactive) [ 16/Mar/11 ]

The wrong dereferenced address is 0xa which comes from RBX/page->private/0x2 + 0x8 computation ...

Comment by Johann Lombardi (Inactive) [ 16/Mar/11 ]

Lai, to be clear, i was on the phone earlier today with Bruno who analyzed the crash dump and told me that page->private = 2. I am no magician

Comment by Michael Hebensteit (Inactive) [ 16/Mar/11 ]

We had a similar issue resolved by updating the BIOS

Motherboard:
Base Board Information
        Manufacturer: Supermicro
        Product Name: X8DTN+-B-IN001-O
        Version: 2.0

BIOS Information
        Vendor: American Megatrends Inc.
        Version: 4.6.3
        Release Date: 04/24/2010
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 1024 kB
        Characteristics:
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
        BIOS Revision: 4.6


CPU: Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz

Comment by Michael Hebensteit (Inactive) [ 16/Mar/11 ]

When trying to debug the 93 issue I created a special kernel that went over all the mm code inserting statements "if(page->private == 2)

{printdebug()}

". That gave 2 results

a) page->private was most likely set in free_hot_cold_page(). I could not find any other location.
b) the page was some time present in the page cache before it was accessed by Lustre in a ll_readpage() call; a number of times unlock_page() was executed on the page with page->private already set to 2. The very first occurrence of this setting appeared in the __set_page_dirty_nobuffers() function.

Together with the fact that the issue resolved after a BIOS upgrade and was not present on similar systems with a completely different BIOS my gut feeling says it's an issue with a CPU mechanism like TLB or GDT that was fixed via the BIOS update.

Comment by Bruno Faccini (Inactive) [ 17/Mar/11 ]

Just an update to answer Johann's question left in my vmail today, NO the "PG_Private" flag is not set for the concerned "struct page" in our crash-dumps @Tera-100 !!...

Comment by Patrick Valentin (Inactive) [ 17/Mar/11 ]

> IP: [<ffffffffa05f7cd7>] cl_vmpage_page+0x57/0x1e0 [obdclass]
> Could you please try to map this address and also to dump the struct page associated with vmpage?

The cluster is in maintenance (some blades no longer booting). As soon as it's restarted, I run
the crash command on the dump to provide the "page struct".

Comment by Sebastien Buisson (Inactive) [ 23/Mar/11 ]

Hi,

Michael I have a question for you. At Bull we are trying to identify which microcode can solve this issue. For our MESCA machines we are testing a new BIOS that integrates the microcode update codename M04206E6_00000008.
Do you think you could retrieve the microcode revision brought by the BIOS update of your SuperMicro machines? That would be very helpful for us!

And now a question for everybody
People from the Kernel Team here at Bull pointed us a kernel bug that is fixed in 2.6.32 vanilla and RHEL6.1 beta, dealing with TLB entries:
https://patchwork.kernel.org/patch/564801/
Do you think it could be related to the present bug, and fix this issue?

TIA,
Sebastien.

Comment by Johann Lombardi (Inactive) [ 25/Mar/11 ]

> People from the Kernel Team here at Bull pointed us a kernel bug that is fixed in 2.6.32
> vanilla and RHEL6.1 beta, dealing with TLB entries:
> https://patchwork.kernel.org/patch/564801/
> Do you think it could be related to the present bug, and fix this issue?

Yes, it might be. Do you know if RedHat plans to ship an RHEL6 errata kernel with this patch included?

Comment by Patrick Valentin (Inactive) [ 25/Mar/11 ]

I got a copy of the dump from "Kay" cluster, and the dump of the page structure is avalable below.

  • PG_private flag is not set
  • private filed is set to 2

struct page pointer by R15.

crash> rd -64 ffffea0015673008 7
ffffea0015673008:  1800000000000061 ffffffff00000002   a...............
ffffea0015673018:  0000000000000002 ffff8803baf2f990   ................
ffffea0015673028:  000000000000000e ffffea0015822668   ........h&......
ffffea0015673038:  ffffea0015591580                    ..Y.....


crash> struct page ffffea0015673008
struct page {
  flags = 1729382256910270561,
  _count = {
    counter = 2
  },
  {
    _mapcount = {
      counter = -1
    },
    {
      inuse = 65535,
      objects = 65535
    }
  },
  {
    {
      private = 2,
      mapping = 0xffff8803baf2f990
    },
    ptl = {
      raw_lock = {
        slock = 2
      }
    },
    slab = 0x2,
    first_page = 0x2
  },
  {
    index = 14,
    freelist = 0xe
  },
  lru = {
    next = 0xffffea0015822668,
    prev = 0xffffea0015591580
  }
}
Comment by Johann Lombardi (Inactive) [ 31/Mar/11 ]

hm, the address_space_operations has 3 new operations in RHEL6:

/* migrate the contents of a page to the specified target */
int (*migratepage) (struct address_space *, struct page *, struct page *);
int (*launder_page) (struct page *);
int (*error_remove_page)(struct address_space *, struct page *);

It seems that NFS implements all of them ...

Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » server,el6-i686 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Johann Lombardi (Inactive) [ 04/Apr/11 ]

I might have found a code path which could explain this problem.
A patch is available here: http://review.whamcloud.com/399

Bull, could you please give this patch a try?

Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » client,el5-x86_64 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » client,el6-x86_64 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » server,el5-x86_64 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » client,el6-i686 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » client,ubuntu-x86_64 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » server,el5-i686 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 04/Apr/11 ]

Integrated in lustre-reviews » client,el5-i686 #61
LU-130 disable page migration

Johann Lombardi : ae03fcb9e831319e0607040a6ced85d604a9b28a
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » server,el6-i686 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » client,el5-x86_64 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » client,el6-x86_64 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » client,ubuntu-x86_64 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » client,el6-i686 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » server,el5-i686 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » server,el5-x86_64 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » server,el6-x86_64 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 07/Apr/11 ]

Integrated in lustre-reviews » client,el5-i686 #113
LU-130 disable page migration

Johann Lombardi : b03dca93b492dcb14ae24cbb7707fcbf730e27fc
Files :

  • lustre/llite/rw26.c
Comment by Patrick Valentin (Inactive) [ 12/Apr/11 ]

An EFIX containing the patch disabling "page migration" (http://review.whamcloud.com/399) was produced yesterday and should be installed today at CEA site.

Comment by Peter Jones [ 21/Apr/11 ]

Update from CEA- patch rolled out on April 19th and no reoccurences since

Comment by Johann Lombardi (Inactive) [ 03/May/11 ]

Bull, could you please test it with transparent huge pages enabled? IIRC, you used to reproduce it quickly when this feature was turned on. Thanks in advance.

Comment by Sebastien Buisson (Inactive) [ 03/May/11 ]

Johann,

As I explained on the phone, we are not able to run with transparent huge pages activated, because it generates a lot of crashes at various levels, from NFS to MPI.

Sebastien.

Comment by Peter Jones [ 03/May/11 ]

Landed for 2.1. Please reopen if any further instances of this issue occur with this patch in place

Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 03/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #102
LU-130 disable page migration

Oleg Drokin : 31eea565c0f94f23455d6f9d2bb926a4a53f5b6c
Files :

  • lustre/llite/rw26.c
Comment by Sebastien Buisson (Inactive) [ 17/May/11 ]

The customer has been testing for several weeks a backport of this patch in 2.0.0.1, now it considers the problem as fixed.

Comment by Johann Lombardi (Inactive) [ 17/May/11 ]

Thanks for the feedback Sébastien.

XXX please note that a proper page migration handler would need to be implemented if someone wants to enable transparent huge pages one day.

Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #94
LU-130 disable page migration

Johann Lombardi : cfaa163b5f52d3a89aecaf257581c35b67771887
Files :

  • lustre/llite/rw26.c
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Build Master (Inactive) [ 28/Jun/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #96
LU-130 Add changelog entry

Johann Lombardi : a73a0cf7d6ada798b08815aa1824dcdb51f1c5a9
Files :

  • lustre/ChangeLog
Comment by Vladimir V. Saveliev [ 10/Sep/12 ]

Johann Lombardi said on 16/Mar/11 4:14 AM:

> Potential culprits could be:
>
> Huge Pages
> Page Migration
> Those two features seem to use page->private, although it is not clear to me how we can end up in this situation.

Johann Lombardi said on 16/Mar/11 5:41 AM:

> That said, prep_new_page() should initialize ->private to 0 when a new page is allocated, so it is still not clear how we could
> end up with an non-initialized page even if Page Migration is the "man in the middle".

Couldn't it be something like the below:

migrate_pages() used to allocate new pages with a function passed as a
parameter.

int migrate_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
bool sync)

In most cases allocating function passed to migrate_pages() ends up
with __alloc_pages() and eventually with get_page_from_freelist()
>..> prep_new_page() which sets page->private to 0.

There is one exception however. In case of compact_zone() (introduced
in rhel6 kernels) allocating function is compaction_alloc().

This function seems to avoid traditional page allocation path, it
takes free pages from isolated free lists and page->private does not
get set to 0.

Then migrate_page_move_mapping() puts that new page into mapping's
page tree where lustre's ll_read_ahead_page() finds nonprivate page
with page->private != 0 and oops-es.

Does that make sense?

Generated at Sat Feb 10 05:12:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.