[LU-3400] 2.1.5<->2.4.0 interop: performance-sanity test 7: rank 0: unlink(f530412) error: Interrupted system call Created: 27/May/13 Updated: 15/Apr/14 Resolved: 15/Apr/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Yang Sheng |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | yuc2 | ||
| Environment: |
Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 8416 |
| Description |
|
performance-sanity test 7 failed as follows: 0: client-26vm5.lab.whamcloud.com starting at Sat May 25 23:13:16 2013 [client-26vm6.lab.whamcloud.com:27516] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics [client-26vm6.lab.whamcloud.com:27516] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages rank 0: unlink(f530412) error: Interrupted system call MDS (client-26vm3) hit oom issue: Lustre: DEBUG MARKER: == mdsrate-stat-small mdsrate-stat-small.sh: test complete, cleaning up == 23:13:15 (1369548795) sendmail invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 sendmail cpuset=/ mems_allowed=0 Pid: 1639, comm: sendmail Not tainted 2.6.32-358.6.2.el6_lustre.g97b2c86.x86_64 #1 Call Trace: [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0 [<ffffffff810e92ce>] ? __delayacct_freepages_end+0x2e/0x30 [<ffffffff8121d3bc>] ? security_real_capable_noaudit+0x3c/0x70 [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0 [<ffffffff8111d16e>] ? select_bad_process+0x9e/0x120 [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0 [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0 [<ffffffff81160a5a>] ? alloc_pages_vma+0x9a/0x150 [<ffffffff811547a2>] ? read_swap_cache_async+0xf2/0x150 [<ffffffff811552b9>] ? valid_swaphandles+0x69/0x150 [<ffffffff81154887>] ? swapin_readahead+0x87/0xc0 [<ffffffff81143d7b>] ? handle_pte_fault+0x70b/0xb50 [<ffffffff8105e203>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff811443fa>] ? handle_mm_fault+0x23a/0x310 [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480 [<ffffffff81171ed1>] ? __mem_cgroup_uncharge_common+0x81/0x300 [<ffffffff811721db>] ? mem_cgroup_uncharge_swapcache+0x2b/0x60 [<ffffffff81277731>] ? cpumask_any_but+0x31/0x50 [<ffffffff8104c7e8>] ? flush_tlb_page+0x48/0xb0 [<ffffffff8104b7cd>] ? ptep_set_access_flags+0x6d/0x70 [<ffffffff811759e3>] ? swap_cgroup_record+0xa3/0xc0 [<ffffffff815139ce>] ? do_page_fault+0x3e/0xa0 [<ffffffff81510d85>] ? page_fault+0x25/0x30 [<ffffffff812824d6>] ? copy_user_generic_unrolled+0x86/0xb0 [<ffffffff81010bde>] ? copy_user_generic+0xe/0x20 [<ffffffff811970b9>] ? set_fd_set+0x49/0x60 [<ffffffff8119856c>] ? core_sys_select+0x1bc/0x2c0 [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480 [<ffffffff8103c7b8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff8103b8ac>] ? kvm_clock_read+0x1c/0x20 [<ffffffff8103b8b9>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff811988c7>] ? sys_select+0x47/0x110 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 30 active_anon:0 inactive_anon:3 isolated_anon:0 active_file:208092 inactive_file:207909 isolated_file:480 unevictable:0 dirty:459 writeback:1 unstable:0 free:13241 slab_reclaimable:13176 slab_unreclaimable:10179 mapped:1 shmem:0 pagetables:787 bounce:0 Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:12kB active_file:3748kB inactive_file:3352kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:244kB slab_unreclaimable:16kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10982 all_unreclaimable? yes lowmem_reserve[]: 0 2003 2003 2003 Node 0 DMA32 free:44620kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:828620kB inactive_file:828284kB unevictable:0kB isolated(anon):0kB isolated(file):1920kB present:2052064kB mlocked:0kB dirty:1836kB writeback:4kB mapped:4kB shmem:0kB slab_reclaimable:52460kB slab_unreclaimable:40700kB kernel_stack:1496kB pagetables:3148kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1506607 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 50*4kB 32*8kB 25*16kB 20*32kB 15*64kB 6*128kB 2*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8344kB Node 0 DMA32: 9623*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44620kB 69416 total pagecache pages 3 pages in swap cache Swap cache stats: add 33445, delete 33442, find 17374/22783 Free swap = 4105544kB Total swap = 4128760kB 524284 pages RAM 43668 pages reserved 69933 pages shared 392724 pages non-shared [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name [ 472] 0 472 2726 0 0 -17 -1000 udevd [ 787] 0 787 2660 0 0 -17 -1000 udevd [ 1131] 0 1131 23299 1 0 -17 -1000 auditd [ 1147] 0 1147 62367 1 0 0 0 rsyslogd [ 1189] 32 1189 4743 1 0 0 0 rpcbind [ 1201] 0 1201 49811 1 0 0 0 sssd [ 1202] 0 1202 57001 1 0 0 0 sssd_be [ 1203] 0 1203 50720 1 0 0 0 sssd_nss [ 1204] 0 1204 48273 1 0 0 0 sssd_pam [ 1221] 29 1221 6355 1 0 0 0 rpc.statd [ 1347] 81 1347 5869 1 0 0 0 dbus-daemon [ 1386] 0 1386 1019 0 0 0 0 acpid [ 1395] 68 1395 6794 1 0 0 0 hald [ 1396] 0 1396 4525 1 0 0 0 hald-runner [ 1424] 0 1424 5054 1 0 0 0 hald-addon-inpu [ 1434] 68 1434 4450 1 0 0 0 hald-addon-acpi [ 1454] 0 1454 168269 1 0 0 0 automount [ 1496] 0 1496 26826 0 0 0 0 rpc.rquotad [ 1500] 0 1500 5413 0 0 0 0 rpc.mountd [ 1549] 0 1549 6290 1 0 0 0 rpc.idmapd [ 1592] 498 1592 58372 1 0 0 0 munged [ 1607] 0 1607 16028 0 0 -17 -1000 sshd [ 1615] 0 1615 5523 1 0 0 0 xinetd [ 1623] 38 1623 7005 1 0 0 0 ntpd [ 1639] 0 1639 22194 0 0 0 0 sendmail [ 1647] 51 1647 19539 0 0 0 0 sendmail [ 1669] 0 1669 29303 1 0 0 0 crond [ 1680] 0 1680 5363 0 0 0 0 atd [ 1705] 0 1705 1018 1 0 0 0 agetty [ 1707] 0 1707 1015 1 0 0 0 mingetty [ 1709] 0 1709 1015 1 0 0 0 mingetty [ 1711] 0 1711 1015 1 0 0 0 mingetty [ 1713] 0 1713 1015 1 0 0 0 mingetty [ 1714] 0 1714 2725 0 0 -17 -1000 udevd [ 1716] 0 1716 1015 1 0 0 0 mingetty [ 1718] 0 1718 1015 1 0 0 0 mingetty Maloo report: https://maloo.whamcloud.com/test_sets/2581003a-c624-11e2-9bf1-52540035b04c |
| Comments |
| Comment by Peter Jones [ 27/May/13 ] |
|
Yang Sheng Could you please look into this issue? Thanks Peter |
| Comment by Yang Sheng [ 29/May/13 ] |
|
Another instance: https://maloo.whamcloud.com/sub_tests/e28be6c2-95dc-11e2-9abb-52540035b04c |
| Comment by Jian Yu [ 29/May/13 ] |
|
This is not a new issue. I found the oom issue occurred before: b2_1 client with b2_3 server: b2_1 client with b2_2 server: Because console logs on MDS nodes were not gathered completely by autotest, we thought the failures were |
| Comment by Jian Yu [ 03/Jun/13 ] |
|
Lustre Client: 2.1.6 RC1 Lustre Server: 2.3.0 The oom issue occurred again: |
| Comment by Yang Sheng [ 04/Jun/13 ] |
|
It's very useful got /proc/slabinfo while OOM present. |
| Comment by Jian Yu [ 04/Jun/13 ] |
I tried to make autotest only run performance-sanity for 2 times on Lustre b2_1 clients with master servers but failed to hit the issue. I'll try to run more tests with performance-sanity to see whether the issue can be reproduced manually or not. |
| Comment by Jian Yu [ 25/Jun/13 ] |
|
Unfortunately, I could not reproduce the issue manually. It occurs again in autotest run: Lustre Client: 2.1.6 RC2 Lustre Server: 2.3.0 https://maloo.whamcloud.com/test_sets/04eee310-dd0c-11e2-8c97-52540035b04c |
| Comment by Yang Sheng [ 25/Jun/13 ] |
|
Ok, I'll commit debug patch to autotest then. Hope get slabinfo while OOM hit again. |
| Comment by Yang Sheng [ 24/Jul/13 ] |
|
Got a hit with my debug patch: https://maloo.whamcloud.com/test_sets/cffedf6e-f3f6-11e2-bc23-52540035b04c Lustre: DEBUG MARKER: buffer_head 85795 319458 104 37 1 : tunables 120 60 0 : slabdata 8634 8634 0 Lustre: DEBUG MARKER: dentry 40740 40740 192 20 1 : tunables 120 60 0 : slabdata 2037 2037 0 Lustre: DEBUG MARKER: size-32 27245 27440 32 112 1 : tunables 120 60 0 : slabdata 245 245 0 Lustre: DEBUG MARKER: size-64 8637 12567 64 59 1 : tunables 120 60 0 : slabdata 213 213 0 Lustre: DEBUG MARKER: sysfs_dir_cache 8107 8127 144 27 1 : tunables 120 60 0 : slabdata 301 301 0 Lustre: DEBUG MARKER: selinux_inode_security 7163 7208 72 53 1 : tunables 120 60 0 : slabdata 136 136 0 Lustre: DEBUG MARKER: inode_cache 4418 4428 592 6 1 : tunables 54 27 0 : slabdata 738 738 0 Lustre: DEBUG MARKER: size-128 2573 3780 128 30 1 : tunables 120 60 0 : slabdata 126 126 0 Lustre: DEBUG MARKER: vm_area_struct 2809 2869 200 19 1 : tunables 120 60 0 : slabdata 151 151 0 Lustre: DEBUG MARKER: anon_vma_chain 2325 2387 48 77 1 : tunables 120 60 0 : slabdata 31 31 0 |
| Comment by Jian Yu [ 14/Aug/13 ] |
|
Lustre client build: http://build.whamcloud.com/job/lustre-b2_1/215/ (2.1.6) large-scale test hit the same oom issue: |
| Comment by Yang Sheng [ 12/Feb/14 ] |
|
Hi, Yujian, Does this issue still present recently? |
| Comment by Jian Yu [ 13/Feb/14 ] |
Hi Yang Sheng, I did not hit oom issue on Lustre 2.4.2 release testing. For the current b2_5 branch, I hit several oom failures reported in |
| Comment by Yang Sheng [ 15/Apr/14 ] |
|
As Yujian's comment. So close this one. |