Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.2.0, Lustre 2.1.1
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
4744

Description

Application is OOM-killed during page-fault resolution on its binary over Lustre when there is plenty of memory available

Hi,

At CEA customer site, they experience a lot of OOM-killed application on compute nodes when the application binary in over a Lustre FS.
Below is the detailled description provided by on site support:

#BAS release (file /etc/bull-infos):
#context: 
#consequences: 
#details:

This may be a new Lustre binary mmapp'ing bug or some regression introduced by patches for the earlier problems/BullForges
(314424,314191,312352,313872,...) for the same family of issues ??...

Starting about 2 weeks ago, T100 users started to report stranges+numerous aborts of jobs without any real symptoms.

Having a look to the concerned nodes logs, we found there was OOM-related Console traces/sequences like the following :
=============================================================================
active_anon:36553 inactive_anon:59163 isolated_anon:0
 active_file:6660103 inactive_file:469717 isolated_file:0
 unevictable:0 dirty:7 writeback:0 unstable:0
 free:7118715 slab_reclaimable:69877 slab_unreclaimable:1674904
 mapped:2646 shmem:89519 pagetables:1224 bounce:0
Node 0 DMA free:15656kB min:56kB low:68kB high:84kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15076kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1956 16096 16096
Node 0 DMA32 free:826236kB min:7940kB low:9924kB high:11908kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2003948kB mlocked:0kB dirty:0kB writeback:0kB
mapped:0kB shmem:0kB slab_reclaimable:23392kB slab_unreclaimable:706044kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 14140 14140
Node 0 Normal free:226340kB min:57388kB low:71732kB high:86080kB active_anon:27772kB inactive_anon:61352kB
active_file:10406816kB inactive_file:529712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14479360kB
mlocked:0kB dirty:4kB writeback:0kB mapped:2836kB shmem:85912kB slab_reclaimable:83324kB slab_unreclaimable:2686040kB
kernel_stack:3712kB pagetables:868kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:81972kB min:65584kB low:81980kB high:98376kB active_anon:19088kB inactive_anon:26444kB
active_file:12816744kB inactive_file:677836kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16547840kB
mlocked:0kB dirty:0kB writeback:0kB mapped:2136kB shmem:41824kB slab_reclaimable:92644kB slab_unreclaimable:2641664kB
kernel_stack:384kB pagetables:812kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 2 Normal free:11825828kB min:65584kB low:81980kB high:98376kB active_anon:58216kB inactive_anon:93980kB
active_file:3229572kB inactive_file:561204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16547840kB
mlocked:0kB dirty:20kB writeback:0kB mapped:2076kB shmem:141812kB slab_reclaimable:29836kB slab_unreclaimable:514240kB
kernel_stack:496kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 3 Normal free:15498828kB min:65580kB low:81972kB high:98368kB active_anon:41136kB inactive_anon:54876kB
active_file:187280kB inactive_file:110116kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16546368kB
mlocked:0kB dirty:4kB writeback:0kB mapped:3536kB shmem:88528kB slab_reclaimable:50312kB slab_unreclaimable:151628kB
kernel_stack:800kB pagetables:1144kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15656kB
Node 0 DMA32: 557*4kB 269*8kB 126*16kB 58*32kB 29*64kB 36*128kB 6*256kB 4*512kB 7*1024kB 5*2048kB 193*4096kB = 826236kB
Node 0 Normal: 4917*4kB 3513*8kB 1164*16kB 766*32kB 1064*64kB 432*128kB 32*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB
= 226588kB
Node 1 Normal: 369*4kB 278*8kB 174*16kB 101*32kB 649*64kB 192*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB
= 81972kB
Node 2 Normal: 2407*4kB 7680*8kB 4049*16kB 15661*32kB 11762*64kB 4419*128kB 1433*256kB 1201*512kB 990*1024kB 716*2048kB
1565*4096kB = 11827532kB
Node 3 Normal: 1169*4kB 2643*8kB 2279*16kB 1017*32kB 807*64kB 1362*128kB 624*256kB 406*512kB 304*1024kB 332*2048kB
3374*4096kB = 15499564kB
7219337 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 20480460kB
Total swap = 20480460kB
16777215 pages RAM
305108 pages reserved
6979572 pages shared
2276089 pages non-shared
Out of memory: kill process 36456 (APPN12) score 2008678400 or a child
Killed process 36456 (APPN12) vsz:245200kB, anon-rss:200kB, file-rss:456kB

=============================================================================

where it appears (see per-node/socket "Normal free" stats) that there was no memory contention at the time
of the OOM/contention !!!

BTW, the fact that the user-application was directly choosen to be killed may come from the wheight applicated by CEA
tools to avoid/limit a best as possible system daemons to be choosen/killed (as it is still a missing work even
if reported/requested since "ages" with my BF #311991 !!...).

At this time we decided to set "panic_on_oom" and try to reproduce with an application known to trigger problem
frequently and finally got a crash-dump !!

But unfortunatelly its (still on-going !) analysis does not really help to understand when/where/why the Kernel VM-code
decided to return "VM_FAULT_OOM", but on the other-hand it definitelly helped to determine that the problem
occurs upon page-fault resolution on application's binary when beeing installed over a Lustre filesystem, and with the
following stack-trace :
========================================================================
PID: 48880  TASK: ffff880ad4086080  CPU: 9   COMMAND: "brut"
 #0 [ffff880ad4089cf0] machine_kexec at ffffffff8102e77b
 #1 [ffff880ad4089d50] crash_kexec at ffffffff810a6cd8
 #2 [ffff880ad4089e20] panic at ffffffff81466b1b
 #3 [ffff880ad4089ea0] pagefault_out_of_memory at ffffffff810fda9a
 #4 [ffff880ad4089ec0] mm_fault_error at ffffffff81037d0e
 #5 [ffff880ad4089f00] do_page_fault at ffffffff8146c5da
 #6 [ffff880ad4089f50] page_fault at ffffffff81469e45
    RIP: 00000000006f13e1  RSP: 00007fcda061c5a0  RFLAGS: 00010202
    RAX: 0000000000000008  RBX: 00007fcd8e513c10  RCX: 0000000000000016
    RDX: 0000000000488280  RSI: 00007fcd8978e150  RDI: 00007fcd8e516190
    RBP: 00007fcda061c6f0   R8: 00007fcd8e515a10   R9: 00007fcd8e515290
    R10: 00007fcd8e514b10  R11: 00007fcd8e514390  R12: 00007fffde8959c8
    R13: 0000000000034bc0  R14: 00007fffde895970  R15: 00000000000000ef
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
========================================================================

At this time, the customer decided to switch-back to its "binary-braodcast" work-around already used for earlier
issues with "over-Lustre binary/lib exec()/mmap()" family of problems, and it fixed.

Since they also started suffer the same problem at TGCC, we also forced a crash-dump via "panic_on_oom" which
is available on-line.

Since there are multiple cases in Kernel's VM-code where VM_FAULT_OOM can be returned, we may plan to enable/instrument
some traces in the Kernel ...

Additional information provided by on line support:

Expéditeur: Bruno Faccini

 In my submission text, I reported there are multiple cases where VM_FAULT_OOM can be reported, and I started to try
to list/write them down. During this process I found this code-path which may be of interest :
====================================
Possible cases of VM_FAULT_OOM return-code :

1) in filemap_fault(), page_cache_read() may have returned -ENOMEM within the current stack call-sequence (found for
others application's tasks also in page-fault handling at the time of the OOM/crash ...) :

 #0 [ffff880ad408d898] schedule at ffffffff81467211
 #1 [ffff880ad408d960] io_schedule at ffffffff814679a3
 #2 [ffff880ad408d980] sync_page at ffffffff810fa35d
 #3 [ffff880ad408d990] __wait_on_bit_lock at ffffffff814680ba
 #4 [ffff880ad408d9e0] __lock_page at ffffffff810fa2f7
 #5 [ffff880ad408da40] find_lock_page at ffffffff810fbc23
 #6 [ffff880ad408da70] filemap_fault at ffffffff810fbe69
 #7 [ffff880ad408dae0] vvp_io_fault_start at ffffffffa07e4b1f
 #8 [ffff880ad408db80] cl_io_start at ffffffffa04649c8
 #9 [ffff880ad408dbe0] cl_io_loop at ffffffffa0468ad0
#10 [ffff880ad408dc50] ll_fault0 at ffffffffa07c8a76
#11 [ffff880ad408dce0] ll_fault at ffffffffa07c8e68
#12 [ffff880ad408dd50] __do_fault at ffffffff811237b4
#13 [ffff880ad408dde0] handle_pte_fault at ffffffff81123cfd
#14 [ffff880ad408deb0] handle_mm_fault at ffffffff811248dd
#15 [ffff880ad408df00] do_page_fault at ffffffff8146c3d3
#16 [ffff880ad408df50] page_fault at ffffffff81469e45


        /*
         * An error return from page_cache_read can result if the
         * system is low on memory, or a problem occurs while trying
         * to schedule I/O.
         */


then/thus page_cache_read() may relay -ENOMEM from/when :

       _ if page_cache_alloc_cold() returned NULL
       _ from add_to_page_cache_lru()
       _ from mapping-&gt;a_ops-&gt;readpage()

.....

====================================

To be investigated further ...

Recently, they tried to reproduce and catched traces (syslog and lctl df). Below is the dscription their tests and how the OOM occurs:

 It tooks some time to reproduce the problem. In fact it is not systematic. At our first try (just running a binary with
mpirun), we were not able to break the system. We had to get the job submission script of the user to be able to reproduce
the issue. The script doesn't do much things, but run 2 ldd on the binary, then run the binary with mpirun. Even with
the script, time to time, the jobs works without troubles.

Whatever, I have attached an extract of the syslog (only the latest oom is relevant), and the content of a lctl dk on
the node which produced the oom (echo -1 > .../debug)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log2
22/Sep/11 10:58 AM
57 kB
Hongchao Zhang

Issue Links

Trackbacks

Changelog 2.1 Changes from version 2.1.0 to version 2.1.1 Server support for kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) 2.6.32.360....

Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....

Application is OOM-killed during page-fault resolution on its binary over Lustre when there is plenty of memory available

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates