[LU-850] Test failure on test suite ost-pools, subtest test_23,dd did not fail with EDQUOT Created: 15/Nov/11  Updated: 10/Apr/12  Resolved: 10/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File debug.tar.gz    
Severity: 3
Rank (Obsolete): 6523

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/901ac12c-0f89-11e1-aad7-52540025f9af.

The sub-test test_23 failed with the following error:

dd did not fail with EDQUOT.

Info required for matching: ost-pools 23



 Comments   
Comment by Peter Jones [ 18/Nov/11 ]

Niu

Would LU745 have any bearing on this?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 20/Nov/11 ]

The test log shows that the quota limit isn't set on OST6:

Disk quotas for user quota_usr (uid 60000):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
/mnt/lustre/d0.ost-pools/d23/dir
                   1024    8192    8192       -       1  5000040 5250042       -
lustre-MDT0000_UUID
                      0       -    1024       -       1       -    5120       -
lustre-OST0000_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0001_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0002_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0003_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0004_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0005_UUID
                      0       -    1024       -       -       -       -       -
lustre-OST0006_UUID
                   1024       -       0       -       -       -       -       -

But I can't tell why it's not set from the log. Sarah, could you reproduce it with D_TRACE & D_QUOTA enabled? Thank you.

Comment by Sarah Liu [ 21/Nov/11 ]

sure, will keep you updated.

Comment by Sarah Liu [ 21/Nov/11 ]

I reran this test and got oom:
------
running: ost-pools ONLY=23
run_suite ost-pools /usr/lib64/lustre/tests/ost-pools.sh
Lustre: DEBUG MARKER: ----============= acceptance-small: ost-pools ============---- Mon Nov 21 16:05:09 PST 2011
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: 5368:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: 5368:0:(debug.c:326:libcfs_debug_str2mask()) Skipped 3 previous similar messages
Lustre: DEBUG MARKER: only running test 23
Lustre: DEBUG MARKER: excepting tests:
Lustre: DEBUG MARKER: == ost-pools test 23: OST pools and quota == 16:05:17 (1321920317)
LustreError: 5998:0:(quota_ctl.c:328:client_quota_ctl()) ptlrpc_queue_wait failed, rc: -114
LustreError: 11-0: an error occurred while communicating with 192.168.4.131@o2ib. The ost_write operation failed with -122
LustreError: 6006:0:(vvp_io.c:1017:vvp_io_commit_write()) Write page 280 of inode e72d2980 failed -122
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks osc stop
Lustre: 2443:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1321920490/real 1321920490] req@c2a2e400 x1386133843884950/t0(0) o400->MGC192.168.4.2@o2ib@192.168.4.2@o2ib:26/25 lens 192/192 e 0 to 1 dl 1321920497 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 166-1: MGC192.168.4.2@o2ib: Connection to service MGS via nid 192.168.4.2@o2ib was lost; in progress operations using this service will fail.
Lustre: lustre-MDT0000-mdc-c33e2400: Connection to service lustre-MDT0000 via nid 192.168.4.2@o2ib was lost; in progress operations using this service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with 192.168.4.131@o2ib. The obd_ping operation failed with -107
Lustre: lustre-OST0000-osc-c33e2400: Connection to service lustre-OST0000 via nid 192.168.4.131@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Lustre: 2441:0:(import.c:849:ptlrpc_connect_interpret()) MGS@192.168.4.2@o2ib changed server handle from 0x749fb9a0e2f2648d to 0x749fb9a0e2f26ef8
Lustre: MGC192.168.4.2@o2ib: Reactivating import
Lustre: MGC192.168.4.2@o2ib: Connection restored to service MGS using nid 192.168.4.2@o2ib.
LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
LustreError: 2445:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@e708ec00 x1386133843884962/t0(0) o4->lustre-OST0000-osc-c33e2400@192.168.4.131@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
LustreError: 6047:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@e82aed14[2 3 0 1 1 00000000] W(2):[0, 4294967295]@[0x100000000:0x2:0x0]

{ LustreError: 6047:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@e81425a0: [0 e6d3a848 W(2):[256, 4294967295]@[0x200000400:0xc:0x0]] LustreError: 6047:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@e8bd2ea4: e6d01800 40100001 0xeded2fbddbaf7554 3 e737d098 size: 0 mtime: 18446744071562154368 atime: 18446744071562154368 ctime: 18446744071562154368 blocks: 0 LustreError: 6047:0:(osc_lock.c:809:osc_ldlm_completion_ast()) }

lock@e82aed14
LustreError: 6047:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
LustreError: 6047:0:(ldlm_resource.c:750:ldlm_resource_complain()) Namespace lustre-OST0000-osc-c33e2400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 6047:0:(ldlm_resource.c:756:ldlm_resource_complain()) Resource: e73f0a80 (2/0/0/0) (rc: 1)
LustreError: Skipped 5 previous similar messages
LustreError: 2443:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@e71a3c00 x1386133843884986/t0(0) o4->lustre-OST0003-osc-c33e2400@192.168.4.131@o2ib:6/4 lens 456/416 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
Lustre: lustre-OST0003-osc-c33e2400: Connection restored to service lustre-OST0003 using nid 192.168.4.131@o2ib.
Lustre: Skipped 6 previous similar messages
LustreError: 6030:0:(cl_lock.c:1408:cl_unuse_try()) result = -5, this is unlikely!
LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) lock@e82aed94[1 3 -5 1 0 00000006] W(2):[256, 4294967295]@[0x200000400:0xc:0x0]

{ LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) vvp@e6c54504: LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) lov@e6d3a848: 1 LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) 0 0: --- LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) }

lock@e82aed94
LustreError: 6030:0:(cl_lock.c:1426:cl_unuse_locked()) unuse return -5
irqbalance invoked oom-killer: gfp_mask=0x80d0, order=0, oom_adj=0
irqbalance cpuset=/ mems_allowed=0
Pid: 1486, comm: irqbalance Not tainted 2.6.32-131.6.1.el6.i686 #1
Call Trace:
[<c04df510>] ? oom_kill_process+0xb0/0x2d0
[<c04dfbba>] ? __out_of_memory+0x4a/0x90
[<c04dfc55>] ? out_of_memory+0x55/0xb0
[<c04eda4b>] ? __alloc_pages_nodemask+0x7fb/0x810
[<c04eda74>] ? __get_free_pages+0x14/0x30
[<c058315a>] ? sysfs_read_file+0x10a/0x190
[<c0506723>] ? mmap_region+0x1f3/0x4e0
[<c059d18c>] ? security_file_permission+0xc/0x10
[<c0583050>] ? sysfs_read_file+0x0/0x190
[<c05285fd>] ? vfs_read+0x9d/0x190
[<c04adc5c>] ? audit_syscall_entry+0x21c/0x240
[<c0528731>] ? sys_read+0x41/0x70
[<c0409bdf>] ? sysenter_do_call+0x12/0x28
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 84
CPU 2: hi: 186, btch: 31 usd: 174
CPU 3: hi: 186, btch: 31 usd: 154
HighMem per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 17
CPU 2: hi: 186, btch: 31 usd: 21
CPU 3: hi: 186, btch: 31 usd: 25
active_anon:3944 inactive_anon:448 isolated_anon:0
active_file:2010 inactive_file:818797 isolated_file:0
unevictable:0 dirty:2 writeback:22 unstable:0
free:2083650 slab_reclaimable:3908 slab_unreclaimable:127194
mapped:3522 shmem:46 pagetables:452 bounce:0
DMA free:3476kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15792kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:156kB slab_unreclaimable:4544kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 863 12159 12159
Normal free:4164kB min:3724kB low:4652kB high:5584kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:883912kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:15476kB slab_unreclaimable:504232kB kernel_stack:2168kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 90370 90370
HighMem free:8326960kB min:512kB low:12700kB high:24888kB active_anon:15776kB inactive_anon:1792kB active_file:7956kB inactive_file:3275176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11567412kB mlocked:0kB dirty:8kB writeback:88kB mapped:14084kB shmem:184kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1808kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 3*4kB 1*8kB 0*16kB 2*32kB 1*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3476kB
Normal: 73*4kB 7*8kB 2*16kB 2*32kB 2*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4156kB
HighMem: 70*4kB 31*8kB 9*16kB 7*32kB 9*64kB 11*128kB 8*256kB 6*512kB 6*1024kB 1*2048kB 2029*4096kB = 8326976kB
820864 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 14565368kB
Total swap = 14565368kB
3145712 pages RAM
2918914 pages HighMem
62162 pages reserved
834421 pages shared
169306 pages non-shared
Out of memory: kill process 1875 (qmgr) score 2582 or a child
Killed process 1875 (qmgr) vsz:10328kB, anon-rss:436kB, file-rss:1664kB
irqbalance invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
irqbalance cpuset=/ mems_allowed=0
Pid: 1486, comm: irqbalance Not tainted 2.6.32-131.6.1.el6.i686 #1
Call Trace:
[<c04df510>] ? oom_kill_process+0xb0/0x2d0
[<c04dfbba>] ? __out_of_memory+0x4a/0x90
[<c04dfc55>] ? out_of_memory+0x55/0xb0
[<c04eda4b>] ? __alloc_pages_nodemask+0x7fb/0x810
[<c0519a5c>] ? cache_alloc_refill+0x2bc/0x510
[<c0519df2>] ? __kmalloc+0x142/0x180
[<c057c927>] ? stat_open+0x47/0x90
[<c057c927>] ? stat_open+0x47/0x90
[<c05748b4>] ? proc_reg_open+0x94/0x180
[<c05340eb>] ? __link_path_walk+0x42b/0x6b0
[<c0543ba0>] ? single_release+0x0/0x30
[<c05256bc>] ? __dentry_open+0xdc/0x2a0
[<c0525971>] ? nameidata_to_filp+0x51/0x70
[<c0574820>] ? proc_reg_open+0x0/0x180
[<c0535ab1>] ? do_filp_open+0x5c1/0xb30
[<c0478d42>] ? hrtimer_try_to_cancel+0x32/0xb0
[<c0519982>] ? cache_alloc_refill+0x1e2/0x510
[<c0478dd1>] ? hrtimer_cancel+0x11/0x20
[<c0525448>] ? do_sys_open+0x58/0x130
[<c04adc5c>] ? audit_syscall_entry+0x21c/0x240
[<c052559c>] ? sys_open+0x2c/0x40
[<c0409bdf>] ? sysenter_do_call+0x12/0x28
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 26
CPU 1: hi: 186, btch: 31 usd: 149
CPU 2: hi: 186, btch: 31 usd: 178
CPU 3: hi: 186, btch: 31 usd: 103
HighMem per-cpu:
CPU 0: hi: 186, btch: 31 usd: 51
CPU 1: hi: 186, btch: 31 usd: 17
CPU 2: hi: 186, btch: 31 usd: 3
CPU 3: hi: 186, btch: 31 usd: 21
active_anon:3841 inactive_anon:448 isolated_anon:0
active_file:2012 inactive_file:823715 isolated_file:0
unevictable:0 dirty:1 writeback:9 unstable:0
free:2078702 slab_reclaimable:3910 slab_unreclaimable:127269
mapped:3476 shmem:46 pagetables:441 bounce:0
DMA free:3508kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15792kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:172kB slab_unreclaimable:4496kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 863 12159 12159
Normal free:3668kB min:3724kB low:4652kB high:5584kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:883912kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:15468kB slab_unreclaimable:504580kB kernel_stack:2168kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:359 all_unreclaimable? yes
lowmem_reserve[]: 0 0 90370 90370
HighMem free:8307632kB min:512kB low:12700kB high:24888kB active_anon:15364kB inactive_anon:1792kB active_file:7964kB inactive_file:3294860kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11567412kB mlocked:0kB dirty:4kB writeback:36kB mapped:13900kB shmem:184kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1764kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 3*4kB 7*8kB 1*16kB 1*32kB 1*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3508kB
Normal: 19*4kB 1*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3668kB
HighMem: 36*4kB 30*8kB 9*16kB 7*32kB 9*64kB 11*128kB 9*256kB 6*512kB 5*1024kB 2*2048kB 2024*4096kB = 8307632kB
825784 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 14565368kB
Total swap = 14565368kB
3145712 pages RAM
2918914 pages HighMem
62162 pages reserved
815056 pages shared
169309 pages non-shared
Out of memory: kill process 1874 (pickup) score 2565 or a child
Killed process 1874 (pickup) vsz:10260kB, anon-rss:440kB, file-rss:1632kB
irqbalance invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
irqbalance cpuset=/ mems_allowed=0
Pid: 1486, comm: irqbalance Not tainted 2.6.32-131.6.1.el6.i686 #1

Comment by Niu Yawei (Inactive) [ 23/Nov/11 ]

hmm, I'm not sure why such test could cause OOM. Hi, Sahara, I think the OOM is not reproducible, right? Is the "dd did not fail with EDQUOT" reproducible? Is it possible to get the D_QUOTA & D_TRACE debug log for it? Thanks.

Comment by Sarah Liu [ 24/Nov/11 ]

the OOM can be reproducible and the attached is the debug log. Since it's OOM every time in my test, I am not sure if the "dd error" is reproducible.

Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

LU-851 is another OOM issue.

Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

Ah, the client is i686. Has this test ever passed (w/o OOM) with 2.1?

Comment by Peter Jones [ 24/Nov/11 ]

We are deprecating i686 clients for 2.2 so if this problem is limited to i686 then we should not worry about it

Comment by Niu Yawei (Inactive) [ 24/Nov/11 ]

Ah, the client is i686. Has this test ever passed (w/o OOM) with 2.1?

Hi, Johann, any reason that the OOM only happened on 32bit arch?

Comment by Johann Lombardi (Inactive) [ 25/Nov/11 ]

The "Normal" zone (vs HighMem) is rather limited on i686 and i wonder if this combined with async journal commit (which can pin a significant amount of pages, see bugzilla 23529) is causing this problem. That's why i asked Sarah to try to reproduce with async journal commit disabled in LU-851.

Comment by Sarah Liu [ 30/Nov/11 ]

After disable async journal commit, got oom again.

[root@fat-amd-2 ~]# lctl get_param obdfilter.lustre-OST*.sync_journal
obdfilter.lustre-OST0000.sync_journal=0
obdfilter.lustre-OST0001.sync_journal=0
obdfilter.lustre-OST0002.sync_journal=0
obdfilter.lustre-OST0003.sync_journal=0
obdfilter.lustre-OST0004.sync_journal=0
obdfilter.lustre-OST0005.sync_journal=0

Out of memory: kill process 1758 (sssd) score 310 or a child
Killed process 1785 (sssd_pam) vsz:9436kB, anon-rss:284kB, file-rss:1900kB
sssd invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0
sssd cpuset=/ mems_allowed=0
Pid: 1758, comm: sssd Tainted: G ---------------- T 2.6.32-131.6.1.el6.i686 #1
Call Trace:
[<c04df510>] ? oom_kill_process+0xb0/0x2d0
[<c04dfbba>] ? __out_of_memory+0x4a/0x90
[<c04dfc55>] ? out_of_memory+0x55/0xb0
[<c04eda4b>] ? __alloc_pages_nodemask+0x7fb/0x810
[<c04eda74>] ? __get_free_pages+0x14/0x30
[<c0436e7c>] ? pgd_alloc+0x6c/0x2b0
[<c043a5a7>] ? sched_slice+0x57/0xb0
[<c04502c9>] ? mm_init+0xb9/0xf0
[<c0450893>] ? dup_mm+0x93/0x420
[<c0466f37>] ? lock_task_sighand+0x37/0x70
[<c044ad2b>] ? sched_autogroup_fork+0x4b/0x80
[<c0451655>] ? copy_process+0x9e5/0xf80
[<c0451c6a>] ? do_fork+0x7a/0x3e0
[<c04563a7>] ? sys_wait4+0x97/0xe0
[<c047d886>] ? getnstimeofday+0x46/0xf0
[<c04adc5c>] ? audit_syscall_entry+0x21c/0x240
[<c0408383>] ? sys_clone+0x33/0x40
[<c0409bdf>] ? sysenter_do_call+0x12/0x28

Comment by Johann Lombardi (Inactive) [ 01/Dec/11 ]

To disable async journal commit, obdfilter.lustre-OST*.sync_journal must be set to 1 in order to flush the journal.

Comment by Bryon Neitzel (Inactive) [ 06/Dec/11 ]

i686 is not a valid platform for 2.2. Can this ticket be closed, or could it also affect x86_64?

Comment by Oleg Drokin [ 03/Jan/12 ]

the "OOM with asyn journal" could happen with x86_64, but it should be a separate issue as that's quite a bit of development work to properly overcome that situation. Believed to only happen if there is little memory on the client and a lot of streaming writes, but actually that might happen more often than we suspect too, just in a somewhat hidden way.

Comment by Niu Yawei (Inactive) [ 10/Apr/12 ]

i686 isn't supportted anymore.

Generated at Sat Feb 10 01:11:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.