[LU-89] racer: Oom (both i686 and x86_64) Created: 18/Feb/11  Updated: 28/Jun/11  Resolved: 14/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 1.8.6

Type: Bug Priority: Major
Reporter: Zhenyu Xu Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 24,421
Rank (Obsolete): 10625

 Description   

INTERCONNECT=tcp
SERVERIMAGE=/lol/var/cache/cfs/PACKAGE/rpm/lustre/b1_8/18a-012311/sles11sp1/x86_64
CLIENTIMAGE=/lol/var/cache/cfs/PACKAGE/rpm/lustre/b1_8/18a-012311/sles11sp1/i686

racer

sfire7 (client) syslog:
Jan 23 15:43:05 sfire7 kernel: [16012.048214] sort invoked oom-killer: gfp_mask=0x201da, order=0,
oom_adj=0
Jan 23 15:43:05 sfire7 kernel: [16012.055140] sort cpuset=/ mems_allowed=0
Jan 23 15:43:05 sfire7 kernel: [16012.059162] Pid: 8257, comm: sort Tainted: G N
2.6.32.19-0.2_lustre.20110123155700-default #1
Jan 23 15:43:05 sfire7 kernel: [16012.068726] Call Trace:
Jan 23 15:43:05 sfire7 kernel: [16012.071253] [<c02069e1>] try_stack_unwind+0x1b1/0x1f0
Jan 23 15:43:05 sfire7 kernel: [16012.076528] [<c02059bf>] dump_trace+0x3f/0xe0
Jan 23 15:43:05 sfire7 kernel: [16012.081055] [<c02065eb>] show_trace_log_lvl+0x4b/0x60
Jan 23 15:43:05 sfire7 kernel: [16012.086270] [<c0206618>] show_trace+0x18/0x20
Jan 23 15:43:05 sfire7 kernel: [16012.090778] [<c05216d9>] dump_stack+0x6d/0x74
Jan 23 15:43:05 sfire7 kernel: [16012.095338] [<c0299f7f>] oom_kill_process+0x9f/0x2b0
Jan 23 15:43:05 sfire7 kernel: [16012.100532] [<c029a63e>] __out_of_memory+0x4e/0xb0
Jan 23 15:43:05 sfire7 kernel: [16012.105550] [<c029a6f4>] out_of_memory+0x54/0xb0
Jan 23 15:43:05 sfire7 kernel: [16012.110433] [<c029d567>] __alloc_pages_slowpath+0x3a7/0x470
Jan 23 15:43:05 sfire7 kernel: [16012.116272] [<c029d748>] __alloc_pages_nodemask+0x118/0x120
Jan 23 15:43:05 sfire7 kernel: [16012.122211] [<c029fa2d>] __do_page_cache_readahead+0xdd/0x1f0
Jan 23 15:43:05 sfire7 kernel: [16012.128206] [<c029fb67>] ra_submit+0x27/0x40
Jan 23 15:43:05 sfire7 kernel: [16012.132629] [<c0298242>] filemap_fault+0x382/0x390
Jan 23 15:43:05 sfire7 kernel: [16012.137586] [<c02af7e6>] __do_fault+0x46/0x470
Jan 23 15:43:05 sfire7 kernel: [16012.142181] [<c02b09a4>] handle_mm_fault+0x154/0x3d0
Jan 23 15:43:05 sfire7 kernel: [16012.147295] [<c0526164>] do_page_fault+0x174/0x370
Jan 23 15:43:05 sfire7 kernel: [16012.152241] [<c0524126>] error_code+0x66/0x70
Jan 23 15:43:05 sfire7 kernel: [16012.156754] [<b7791ec5>] 0xb7791ec5

http://lts-head.central.sun.com:8080/display_report.pl?report_id=515817



 Comments   
Comment by Lai Siyao [ 21/Feb/11 ]

According to previous test results this OOM can be triggered in different places, so 'racer' may not be the cause. And I've run 'racer' test several rounds, it looks okay.

It seems acceptance test may not trigger this OOM always, I will run full acceptance and try to narrow down the test which triggers it. This will take some time.

Comment by Johann Lombardi (Inactive) [ 22/Feb/11 ]

How much memory do you have on the client side? AFAIK, this problem only shows up in low memory condition (1GB of RAM). The problem is that it used to pass, so i would like to understand why it is failing now.

Comment by Lai Siyao [ 22/Feb/11 ]

Hmm, I test with 1GB memory, and client and servers on the same host. I am also wondering how and why it will fail, and I've told Bobijam and Niu, if they meet with this OOM in their test, they will let me know.

Comment by Johann Lombardi (Inactive) [ 22/Feb/11 ]

Elena thinks that this is a problem in the racer test. Please check the bugzilla ticket.

Comment by Lai Siyao [ 24/Feb/11 ]

This is believed be a test script problem: 'racer' is called recursively, and uses up all memory eventually.

Comment by Lai Siyao [ 09/Mar/11 ]

Peter, can we close this?

Comment by Peter Jones [ 14/Mar/11 ]

As per bz this was fixed by correcting the test in 24451

Generated at Sat Feb 10 01:03:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.