Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
4744
Description
Application is OOM-killed during page-fault resolution on its binary over Lustre when there is plenty of memory available
Hi,
At CEA customer site, they experience a lot of OOM-killed application on compute nodes when the application binary in over a Lustre FS.
Below is the detailled description provided by on site support:
#BAS release (file /etc/bull-infos): #context: #consequences: #details: This may be a new Lustre binary mmapp'ing bug or some regression introduced by patches for the earlier problems/BullForges (314424,314191,312352,313872,...) for the same family of issues ??... Starting about 2 weeks ago, T100 users started to report stranges+numerous aborts of jobs without any real symptoms. Having a look to the concerned nodes logs, we found there was OOM-related Console traces/sequences like the following : ============================================================================= active_anon:36553 inactive_anon:59163 isolated_anon:0 active_file:6660103 inactive_file:469717 isolated_file:0 unevictable:0 dirty:7 writeback:0 unstable:0 free:7118715 slab_reclaimable:69877 slab_unreclaimable:1674904 mapped:2646 shmem:89519 pagetables:1224 bounce:0 Node 0 DMA free:15656kB min:56kB low:68kB high:84kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15076kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 1956 16096 16096 Node 0 DMA32 free:826236kB min:7940kB low:9924kB high:11908kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2003948kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:23392kB slab_unreclaimable:706044kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 14140 14140 Node 0 Normal free:226340kB min:57388kB low:71732kB high:86080kB active_anon:27772kB inactive_anon:61352kB active_file:10406816kB inactive_file:529712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14479360kB mlocked:0kB dirty:4kB writeback:0kB mapped:2836kB shmem:85912kB slab_reclaimable:83324kB slab_unreclaimable:2686040kB kernel_stack:3712kB pagetables:868kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:81972kB min:65584kB low:81980kB high:98376kB active_anon:19088kB inactive_anon:26444kB active_file:12816744kB inactive_file:677836kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:0kB dirty:0kB writeback:0kB mapped:2136kB shmem:41824kB slab_reclaimable:92644kB slab_unreclaimable:2641664kB kernel_stack:384kB pagetables:812kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 2 Normal free:11825828kB min:65584kB low:81980kB high:98376kB active_anon:58216kB inactive_anon:93980kB active_file:3229572kB inactive_file:561204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:0kB dirty:20kB writeback:0kB mapped:2076kB shmem:141812kB slab_reclaimable:29836kB slab_unreclaimable:514240kB kernel_stack:496kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 3 Normal free:15498828kB min:65580kB low:81972kB high:98368kB active_anon:41136kB inactive_anon:54876kB active_file:187280kB inactive_file:110116kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16546368kB mlocked:0kB dirty:4kB writeback:0kB mapped:3536kB shmem:88528kB slab_reclaimable:50312kB slab_unreclaimable:151628kB kernel_stack:800kB pagetables:1144kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15656kB Node 0 DMA32: 557*4kB 269*8kB 126*16kB 58*32kB 29*64kB 36*128kB 6*256kB 4*512kB 7*1024kB 5*2048kB 193*4096kB = 826236kB Node 0 Normal: 4917*4kB 3513*8kB 1164*16kB 766*32kB 1064*64kB 432*128kB 32*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 226588kB Node 1 Normal: 369*4kB 278*8kB 174*16kB 101*32kB 649*64kB 192*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 81972kB Node 2 Normal: 2407*4kB 7680*8kB 4049*16kB 15661*32kB 11762*64kB 4419*128kB 1433*256kB 1201*512kB 990*1024kB 716*2048kB 1565*4096kB = 11827532kB Node 3 Normal: 1169*4kB 2643*8kB 2279*16kB 1017*32kB 807*64kB 1362*128kB 624*256kB 406*512kB 304*1024kB 332*2048kB 3374*4096kB = 15499564kB 7219337 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 20480460kB Total swap = 20480460kB 16777215 pages RAM 305108 pages reserved 6979572 pages shared 2276089 pages non-shared Out of memory: kill process 36456 (APPN12) score 2008678400 or a child Killed process 36456 (APPN12) vsz:245200kB, anon-rss:200kB, file-rss:456kB ============================================================================= where it appears (see per-node/socket "Normal free" stats) that there was no memory contention at the time of the OOM/contention !!! BTW, the fact that the user-application was directly choosen to be killed may come from the wheight applicated by CEA tools to avoid/limit a best as possible system daemons to be choosen/killed (as it is still a missing work even if reported/requested since "ages" with my BF #311991 !!...). At this time we decided to set "panic_on_oom" and try to reproduce with an application known to trigger problem frequently and finally got a crash-dump !! But unfortunatelly its (still on-going !) analysis does not really help to understand when/where/why the Kernel VM-code decided to return "VM_FAULT_OOM", but on the other-hand it definitelly helped to determine that the problem occurs upon page-fault resolution on application's binary when beeing installed over a Lustre filesystem, and with the following stack-trace : ======================================================================== PID: 48880 TASK: ffff880ad4086080 CPU: 9 COMMAND: "brut" #0 [ffff880ad4089cf0] machine_kexec at ffffffff8102e77b #1 [ffff880ad4089d50] crash_kexec at ffffffff810a6cd8 #2 [ffff880ad4089e20] panic at ffffffff81466b1b #3 [ffff880ad4089ea0] pagefault_out_of_memory at ffffffff810fda9a #4 [ffff880ad4089ec0] mm_fault_error at ffffffff81037d0e #5 [ffff880ad4089f00] do_page_fault at ffffffff8146c5da #6 [ffff880ad4089f50] page_fault at ffffffff81469e45 RIP: 00000000006f13e1 RSP: 00007fcda061c5a0 RFLAGS: 00010202 RAX: 0000000000000008 RBX: 00007fcd8e513c10 RCX: 0000000000000016 RDX: 0000000000488280 RSI: 00007fcd8978e150 RDI: 00007fcd8e516190 RBP: 00007fcda061c6f0 R8: 00007fcd8e515a10 R9: 00007fcd8e515290 R10: 00007fcd8e514b10 R11: 00007fcd8e514390 R12: 00007fffde8959c8 R13: 0000000000034bc0 R14: 00007fffde895970 R15: 00000000000000ef ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b ======================================================================== At this time, the customer decided to switch-back to its "binary-braodcast" work-around already used for earlier issues with "over-Lustre binary/lib exec()/mmap()" family of problems, and it fixed. Since they also started suffer the same problem at TGCC, we also forced a crash-dump via "panic_on_oom" which is available on-line. Since there are multiple cases in Kernel's VM-code where VM_FAULT_OOM can be returned, we may plan to enable/instrument some traces in the Kernel ...
Additional information provided by on line support:
Expéditeur: Bruno Faccini In my submission text, I reported there are multiple cases where VM_FAULT_OOM can be reported, and I started to try to list/write them down. During this process I found this code-path which may be of interest : ==================================== Possible cases of VM_FAULT_OOM return-code : 1) in filemap_fault(), page_cache_read() may have returned -ENOMEM within the current stack call-sequence (found for others application's tasks also in page-fault handling at the time of the OOM/crash ...) : #0 [ffff880ad408d898] schedule at ffffffff81467211 #1 [ffff880ad408d960] io_schedule at ffffffff814679a3 #2 [ffff880ad408d980] sync_page at ffffffff810fa35d #3 [ffff880ad408d990] __wait_on_bit_lock at ffffffff814680ba #4 [ffff880ad408d9e0] __lock_page at ffffffff810fa2f7 #5 [ffff880ad408da40] find_lock_page at ffffffff810fbc23 #6 [ffff880ad408da70] filemap_fault at ffffffff810fbe69 #7 [ffff880ad408dae0] vvp_io_fault_start at ffffffffa07e4b1f #8 [ffff880ad408db80] cl_io_start at ffffffffa04649c8 #9 [ffff880ad408dbe0] cl_io_loop at ffffffffa0468ad0 #10 [ffff880ad408dc50] ll_fault0 at ffffffffa07c8a76 #11 [ffff880ad408dce0] ll_fault at ffffffffa07c8e68 #12 [ffff880ad408dd50] __do_fault at ffffffff811237b4 #13 [ffff880ad408dde0] handle_pte_fault at ffffffff81123cfd #14 [ffff880ad408deb0] handle_mm_fault at ffffffff811248dd #15 [ffff880ad408df00] do_page_fault at ffffffff8146c3d3 #16 [ffff880ad408df50] page_fault at ffffffff81469e45 /* * An error return from page_cache_read can result if the * system is low on memory, or a problem occurs while trying * to schedule I/O. */ then/thus page_cache_read() may relay -ENOMEM from/when : _ if page_cache_alloc_cold() returned NULL _ from add_to_page_cache_lru() _ from mapping->a_ops->readpage() ..... ==================================== To be investigated further ...
Recently, they tried to reproduce and catched traces (syslog and lctl df). Below is the dscription their tests and how the OOM occurs:
It tooks some time to reproduce the problem. In fact it is not systematic. At our first try (just running a binary with mpirun), we were not able to break the system. We had to get the job submission script of the user to be able to reproduce the issue. The script doesn't do much things, but run 2 ldd on the binary, then run the binary with mpirun. Even with the script, time to time, the jobs works without troubles. Whatever, I have attached an extract of the syslog (only the latest oom is relevant), and the content of a lctl dk on the node which produced the oom (echo -1 > .../debug)
Attachments
Issue Links
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.0 to version 2.1.1 Server support for kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.12.1.el5 (RHEL5) 2.6.32220.el6 (RHEL6) 2.6.32.360....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....