[LU-3400] 2.1.5<->2.4.0 interop: performance-sanity test 7: rank 0: unlink(f530412) error: Interrupted system call Created: 27/May/13  Updated: 15/Apr/14  Resolved: 15/Apr/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Yang Sheng
Resolution: Cannot Reproduce Votes: 0
Labels: yuc2
Environment:

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1508
Distro/Arch: RHEL6.4/x86_64


Attachments: Text File client-26vm3.console.log    
Severity: 3
Rank (Obsolete): 8416

 Description   

performance-sanity test 7 failed as follows:

0: client-26vm5.lab.whamcloud.com starting at Sat May 25 23:13:16 2013
[client-26vm6.lab.whamcloud.com:27516] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[client-26vm6.lab.whamcloud.com:27516] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
rank 0: unlink(f530412) error: Interrupted system call

MDS (client-26vm3) hit oom issue:

Lustre: DEBUG MARKER: == mdsrate-stat-small mdsrate-stat-small.sh: test complete, cleaning up == 23:13:15 (1369548795)
sendmail invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
sendmail cpuset=/ mems_allowed=0
Pid: 1639, comm: sendmail Not tainted 2.6.32-358.6.2.el6_lustre.g97b2c86.x86_64 #1
Call Trace:
 [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
 [<ffffffff810e92ce>] ? __delayacct_freepages_end+0x2e/0x30
 [<ffffffff8121d3bc>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff8111d16e>] ? select_bad_process+0x9e/0x120
 [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
 [<ffffffff81160a5a>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff811547a2>] ? read_swap_cache_async+0xf2/0x150
 [<ffffffff811552b9>] ? valid_swaphandles+0x69/0x150
 [<ffffffff81154887>] ? swapin_readahead+0x87/0xc0
 [<ffffffff81143d7b>] ? handle_pte_fault+0x70b/0xb50
 [<ffffffff8105e203>] ? perf_event_task_sched_out+0x33/0x80
 [<ffffffff811443fa>] ? handle_mm_fault+0x23a/0x310
 [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480
 [<ffffffff81171ed1>] ? __mem_cgroup_uncharge_common+0x81/0x300
 [<ffffffff811721db>] ? mem_cgroup_uncharge_swapcache+0x2b/0x60
 [<ffffffff81277731>] ? cpumask_any_but+0x31/0x50
 [<ffffffff8104c7e8>] ? flush_tlb_page+0x48/0xb0
 [<ffffffff8104b7cd>] ? ptep_set_access_flags+0x6d/0x70
 [<ffffffff811759e3>] ? swap_cgroup_record+0xa3/0xc0
 [<ffffffff815139ce>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81510d85>] ? page_fault+0x25/0x30
 [<ffffffff812824d6>] ? copy_user_generic_unrolled+0x86/0xb0
 [<ffffffff81010bde>] ? copy_user_generic+0xe/0x20
 [<ffffffff811970b9>] ? set_fd_set+0x49/0x60
 [<ffffffff8119856c>] ? core_sys_select+0x1bc/0x2c0
 [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff8103c7b8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103b8ac>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103b8b9>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff811988c7>] ? sys_select+0x47/0x110
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  30
active_anon:0 inactive_anon:3 isolated_anon:0
 active_file:208092 inactive_file:207909 isolated_file:480
 unevictable:0 dirty:459 writeback:1 unstable:0
 free:13241 slab_reclaimable:13176 slab_unreclaimable:10179
 mapped:1 shmem:0 pagetables:787 bounce:0
Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:12kB active_file:3748kB inactive_file:3352kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:244kB slab_unreclaimable:16kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10982 all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003 2003
Node 0 DMA32 free:44620kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:828620kB inactive_file:828284kB unevictable:0kB isolated(anon):0kB isolated(file):1920kB present:2052064kB mlocked:0kB dirty:1836kB writeback:4kB mapped:4kB shmem:0kB slab_reclaimable:52460kB slab_unreclaimable:40700kB kernel_stack:1496kB pagetables:3148kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1506607 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 50*4kB 32*8kB 25*16kB 20*32kB 15*64kB 6*128kB 2*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8344kB
Node 0 DMA32: 9623*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44620kB
69416 total pagecache pages
3 pages in swap cache
Swap cache stats: add 33445, delete 33442, find 17374/22783
Free swap  = 4105544kB
Total swap = 4128760kB
524284 pages RAM
43668 pages reserved
69933 pages shared
392724 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  472]     0   472     2726        0   0     -17         -1000 udevd
[  787]     0   787     2660        0   0     -17         -1000 udevd
[ 1131]     0  1131    23299        1   0     -17         -1000 auditd
[ 1147]     0  1147    62367        1   0       0             0 rsyslogd
[ 1189]    32  1189     4743        1   0       0             0 rpcbind
[ 1201]     0  1201    49811        1   0       0             0 sssd
[ 1202]     0  1202    57001        1   0       0             0 sssd_be
[ 1203]     0  1203    50720        1   0       0             0 sssd_nss
[ 1204]     0  1204    48273        1   0       0             0 sssd_pam
[ 1221]    29  1221     6355        1   0       0             0 rpc.statd
[ 1347]    81  1347     5869        1   0       0             0 dbus-daemon
[ 1386]     0  1386     1019        0   0       0             0 acpid
[ 1395]    68  1395     6794        1   0       0             0 hald
[ 1396]     0  1396     4525        1   0       0             0 hald-runner
[ 1424]     0  1424     5054        1   0       0             0 hald-addon-inpu
[ 1434]    68  1434     4450        1   0       0             0 hald-addon-acpi
[ 1454]     0  1454   168269        1   0       0             0 automount
[ 1496]     0  1496    26826        0   0       0             0 rpc.rquotad
[ 1500]     0  1500     5413        0   0       0             0 rpc.mountd
[ 1549]     0  1549     6290        1   0       0             0 rpc.idmapd
[ 1592]   498  1592    58372        1   0       0             0 munged
[ 1607]     0  1607    16028        0   0     -17         -1000 sshd
[ 1615]     0  1615     5523        1   0       0             0 xinetd
[ 1623]    38  1623     7005        1   0       0             0 ntpd
[ 1639]     0  1639    22194        0   0       0             0 sendmail
[ 1647]    51  1647    19539        0   0       0             0 sendmail
[ 1669]     0  1669    29303        1   0       0             0 crond
[ 1680]     0  1680     5363        0   0       0             0 atd
[ 1705]     0  1705     1018        1   0       0             0 agetty
[ 1707]     0  1707     1015        1   0       0             0 mingetty
[ 1709]     0  1709     1015        1   0       0             0 mingetty
[ 1711]     0  1711     1015        1   0       0             0 mingetty
[ 1713]     0  1713     1015        1   0       0             0 mingetty
[ 1714]     0  1714     2725        0   0     -17         -1000 udevd
[ 1716]     0  1716     1015        1   0       0             0 mingetty
[ 1718]     0  1718     1015        1   0       0             0 mingetty

Maloo report: https://maloo.whamcloud.com/test_sets/2581003a-c624-11e2-9bf1-52540035b04c
The console log for client-26vm3 is attached.



 Comments   
Comment by Peter Jones [ 27/May/13 ]

Yang Sheng

Could you please look into this issue?

Thanks

Peter

Comment by Yang Sheng [ 29/May/13 ]

Another instance: https://maloo.whamcloud.com/sub_tests/e28be6c2-95dc-11e2-9abb-52540035b04c

Comment by Jian Yu [ 29/May/13 ]

This is not a new issue. I found the oom issue occurred before:

b2_1 client with b2_3 server:
https://maloo.whamcloud.com/test_sets/6bc8e158-937e-11e2-89cc-52540035b04c

b2_1 client with b2_2 server:
https://maloo.whamcloud.com/test_sets/3eb99b7c-c7c6-11e2-9f90-52540035b04c

Because console logs on MDS nodes were not gathered completely by autotest, we thought the failures were LU-798.

Comment by Jian Yu [ 03/Jun/13 ]

Lustre Client: 2.1.6 RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/208

Lustre Server: 2.3.0
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/41

The oom issue occurred again:
https://maloo.whamcloud.com/test_sets/7bcb8202-cc0b-11e2-9cc0-52540035b04c

Comment by Yang Sheng [ 04/Jun/13 ]

It's very useful got /proc/slabinfo while OOM present.

Comment by Jian Yu [ 04/Jun/13 ]

It's very useful got /proc/slabinfo while OOM present.

I tried to make autotest only run performance-sanity for 2 times on Lustre b2_1 clients with master servers but failed to hit the issue. I'll try to run more tests with performance-sanity to see whether the issue can be reproduced manually or not.

Comment by Jian Yu [ 25/Jun/13 ]

Unfortunately, I could not reproduce the issue manually.

It occurs again in autotest run:

Lustre Client: 2.1.6 RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/215

Lustre Server: 2.3.0
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/41

https://maloo.whamcloud.com/test_sets/04eee310-dd0c-11e2-8c97-52540035b04c

Comment by Yang Sheng [ 25/Jun/13 ]

Ok, I'll commit debug patch to autotest then. Hope get slabinfo while OOM hit again.

Comment by Yang Sheng [ 24/Jul/13 ]

Got a hit with my debug patch: https://maloo.whamcloud.com/test_sets/cffedf6e-f3f6-11e2-bc23-52540035b04c
The results point out buffer_head spend too much memory

Lustre: DEBUG MARKER: buffer_head 85795 319458 104 37 1 : tunables 120 60 0 : slabdata 8634 8634 0
Lustre: DEBUG MARKER: dentry 40740 40740 192 20 1 : tunables 120 60 0 : slabdata 2037 2037 0
Lustre: DEBUG MARKER: size-32 27245 27440 32 112 1 : tunables 120 60 0 : slabdata 245 245 0
Lustre: DEBUG MARKER: size-64 8637 12567 64 59 1 : tunables 120 60 0 : slabdata 213 213 0
Lustre: DEBUG MARKER: sysfs_dir_cache 8107 8127 144 27 1 : tunables 120 60 0 : slabdata 301 301 0
Lustre: DEBUG MARKER: selinux_inode_security 7163 7208 72 53 1 : tunables 120 60 0 : slabdata 136 136 0
Lustre: DEBUG MARKER: inode_cache 4418 4428 592 6 1 : tunables 54 27 0 : slabdata 738 738 0
Lustre: DEBUG MARKER: size-128 2573 3780 128 30 1 : tunables 120 60 0 : slabdata 126 126 0
Lustre: DEBUG MARKER: vm_area_struct 2809 2869 200 19 1 : tunables 120 60 0 : slabdata 151 151 0
Lustre: DEBUG MARKER: anon_vma_chain 2325 2387 48 77 1 : tunables 120 60 0 : slabdata 31 31 0
Comment by Jian Yu [ 14/Aug/13 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_1/215/ (2.1.6)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/29/

large-scale test hit the same oom issue:
https://maloo.whamcloud.com/test_sets/cb3ccd62-047a-11e3-a8e9-52540035b04c

Comment by Yang Sheng [ 12/Feb/14 ]

Hi, Yujian, Does this issue still present recently?

Comment by Jian Yu [ 13/Feb/14 ]

Hi, Yujian, Does this issue still present recently?

Hi Yang Sheng, I did not hit oom issue on Lustre 2.4.2 release testing. For the current b2_5 branch, I hit several oom failures reported in LU-4432, LU-4357 and LU-3366.

Comment by Yang Sheng [ 15/Apr/14 ]

As Yujian's comment. So close this one.

Generated at Sat Feb 10 01:33:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.