[LU-2818] Failure on test suite parallel-scale-nfsv4 test_compilebench: (mdt_lvb.c:126:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed Created: 15/Feb/13  Updated: 17/Mar/21  Resolved: 11/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-6472 OI scrub causes MDS reboot Resolved
Related
is related to LU-2748 OSD uses kmalloc with high order to a... Resolved
Severity: 3
Rank (Obsolete): 6822

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/25d39028-755e-11e2-bf59-52540035b04c.

The sub-test test_compilebench failed with the following error:

compilebench failed: 1

MDS console:

12:44:26:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 12:44:25 (1360701865)
12:44:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2         -r 2 --makej
12:44:27:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
13:06:42:mdt00_003: page allocation failure. order:5, mode:0x50
13:06:42:Pid: 22716, comm: mdt00_003 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1
13:06:42:Call Trace:
13:06:42: [<ffffffff811231ff>] ? __alloc_pages_nodemask+0x77f/0x940
13:06:42: [<ffffffff8115d1a2>] ? kmem_getpages+0x62/0x170
13:06:42: [<ffffffff8115ddba>] ? fallback_alloc+0x1ba/0x270
13:06:42: [<ffffffff8115d80f>] ? cache_grow+0x2cf/0x320
13:06:42: [<ffffffff8115db39>] ? ____cache_alloc_node+0x99/0x160
13:06:42: [<ffffffffa04d8b60>] ? cfs_alloc+0x30/0x60 [libcfs]
13:06:42: [<ffffffff8115e909>] ? __kmalloc+0x189/0x220
13:06:42: [<ffffffffa04d8b60>] ? cfs_alloc+0x30/0x60 [libcfs]
13:06:42: [<ffffffffa0d3b69e>] ? osd_key_init+0x1e/0x670 [osd_ldiskfs]
13:06:42: [<ffffffffa066622f>] ? keys_fill+0x6f/0x190 [obdclass]
13:06:42: [<ffffffffa0669e6b>] ? lu_context_init+0xab/0x260 [obdclass]
13:06:42: [<ffffffffa066a03e>] ? lu_env_init+0x1e/0x30 [obdclass]
13:06:42: [<ffffffffa0e775ff>] ? mdt_lvbo_fill+0x1af/0x800 [mdt]
13:06:42: [<ffffffffa0e77450>] ? mdt_lvbo_fill+0x0/0x800 [mdt]
13:06:42: [<ffffffffa07c6591>] ? ldlm_handle_enqueue0+0x641/0x1080 [ptlrpc]
13:06:42: [<ffffffffa0e4e376>] ? mdt_enqueue+0x46/0x110 [mdt]
13:06:42: [<ffffffffa0e42fb8>] ? mdt_handle_common+0x628/0x1620 [mdt]
13:06:42: [<ffffffffa0e7a5c5>] ? mds_regular_handle+0x15/0x20 [mdt]
13:06:42: [<ffffffffa07f800c>] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
13:06:42: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
13:06:42: [<ffffffffa07ef739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
13:06:42: [<ffffffff81052223>] ? __wake_up+0x53/0x70
13:06:42: [<ffffffffa07f9556>] ? ptlrpc_main+0xb76/0x1870 [ptlrpc]
13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:42: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:42: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:42: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
13:06:42:Mem-Info:
13:06:42:Node 0 DMA per-cpu:
13:06:42:CPU    0: hi:    0, btch:   1 usd:   0
13:06:42:Node 0 DMA32 per-cpu:
13:06:42:CPU    0: hi:  186, btch:  31 usd: 152
13:06:42:active_anon:4078 inactive_anon:1377 isolated_anon:0
13:06:42: active_file:109040 inactive_file:162532 isolated_file:0
13:06:42: unevictable:0 dirty:14064 writeback:512 unstable:0
13:06:42: free:14243 slab_reclaimable:32585 slab_unreclaimable:116609
13:06:42: mapped:3106 shmem:41 pagetables:752 bounce:0
13:06:42:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:184kB inactive_file:3376kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:556kB slab_unreclaimable:3268kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
13:06:42:lowmem_reserve[]: 0 2003 2003 2003
13:06:42:Node 0 DMA32 free:48624kB min:44720kB low:55900kB high:67080kB active_anon:16312kB inactive_anon:5508kB active_file:435976kB inactive_file:646752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:56256kB writeback:2048kB mapped:12424kB shmem:164kB slab_reclaimable:129784kB slab_unreclaimable:463168kB kernel_stack:1960kB pagetables:3008kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
13:06:42:lowmem_reserve[]: 0 0 0 0
13:06:42:Node 0 DMA: 33*4kB 5*8kB 3*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8348kB
13:06:42:Node 0 DMA32: 5780*4kB 2228*8kB 4*16kB 28*32kB 3*64kB 3*128kB 2*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48624kB
13:06:42:195649 total pagecache pages
13:06:42:0 pages in swap cache
13:06:42:Swap cache stats: add 0, delete 0, find 0/0
13:06:42:Free swap  = 4128760kB
13:06:42:Total swap = 4128760kB
13:06:42:524284 pages RAM
13:06:42:43608 pages reserved
13:06:42:207968 pages shared
13:06:42:267301 pages non-shared
13:06:42:LustreError: 22716:0:(mdt_lvb.c:126:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed: 
13:06:42:LustreError: 22716:0:(mdt_lvb.c:126:mdt_lvbo_fill()) LBUG
13:06:42:Pid: 22716, comm: mdt00_003
13:06:42:
13:06:42:Call Trace:
13:06:42: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
13:06:42: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
13:06:42: [<ffffffffa0e77bd5>] mdt_lvbo_fill+0x785/0x800 [mdt]
13:06:42: [<ffffffffa0e77450>] ? mdt_lvbo_fill+0x0/0x800 [mdt]
13:06:42: [<ffffffffa07c6591>] ldlm_handle_enqueue0+0x641/0x1080 [ptlrpc]
13:06:42: [<ffffffffa0e4e376>] mdt_enqueue+0x46/0x110 [mdt]
13:06:43: [<ffffffffa0e42fb8>] mdt_handle_common+0x628/0x1620 [mdt]
13:06:43: [<ffffffffa0e7a5c5>] mds_regular_handle+0x15/0x20 [mdt]
13:06:43: [<ffffffffa07f800c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
13:06:43: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
13:06:43: [<ffffffffa07ef739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
13:06:43: [<ffffffff81052223>] ? __wake_up+0x53/0x70
13:06:43: [<ffffffffa07f9556>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:43: [<ffffffff8100c0ca>] child_rip+0xa/0x20
13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:43: [<ffffffffa07f89e0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
13:06:43: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
13:06:43:
13:06:43:Kernel panic - not syncing: LBUG


 Comments   
Comment by Hongchao Zhang [ 18/Feb/13 ]

this bug can be mitigated greatly by the patches(http://review.whamcloud.com/#change,5323, http://review.whamcloud.com/#change,5444) in LU-2748.
and I am also creating a patch to let "mdt_lvbo_fill" to find the corresponding "lu_env" in the current ptlrpc_thread to eliminate the need to
allocate a new "lu_env".

Comment by Hongchao Zhang [ 21/Feb/13 ]

the additional patch is tracked at http://review.whamcloud.com/#change,5497

Comment by Hongchao Zhang [ 18/Mar/13 ]

status update:

the patch which adding "lu_env" to "LDLM" modules is not ready yet for its complexity, and it affects lots of funtions.
it's better to test&merge the patches in LU-2748, then this ticket can be dropped as a blocker.

Comment by Alex Zhuravlev [ 18/Mar/13 ]

we're not allocating large chunks for osd_thread_info anymore, so this isn't a blocker already.

Comment by Hongchao Zhang [ 27/Mar/13 ]

the patch passing the "env" into ldlm module has been pushed to Gerrit.

Comment by Oleg Drokin [ 23/Jul/13 ]

I am now frequently hitting this again while running sanity in a loop, this didi not happen bfore in this sort of a test on same config node, so we probably need to take another look at this?

<4>[13497.371686] mdt00_004: page allocation failure. order:1, mode:0x40
<4>[13497.372009] Pid: 11831, comm: mdt00_004 Not tainted 2.6.32-rhe6.4-debug #2
<4>[13497.372016] Call Trace:
<4>[13497.372016]  [<ffffffff8112a666>] ? __alloc_pages_nodemask+0x7c6/0x980
<4>[13497.372016]  [<ffffffff814fe0ee>] ? _spin_unlock+0xe/0x10
<4>[13497.372016]  [<ffffffff811658f2>] ? kmem_getpages+0x62/0x170
<4>[13497.372016]  [<ffffffff8116834a>] ? fallback_alloc+0x1ba/0x270
<4>[13497.372016]  [<ffffffff81167bf7>] ? cache_grow+0x4d7/0x520
<4>[13497.372016]  [<ffffffff81168038>] ? ____cache_alloc_node+0xa8/0x200
<4>[13497.372016]  [<ffffffff81168943>] ? kmem_cache_alloc_trace+0x1c3/0x250
<4>[13497.372016]  [<ffffffffa0b4f2a5>] ? osd_key_init+0x25/0x4e0 [osd_ldiskfs]
<4>[13497.372016]  [<ffffffffa0b4f2a5>] ? osd_key_init+0x25/0x4e0 [osd_ldiskfs]
<4>[13497.372016]  [<ffffffffa04f8f8f>] ? keys_fill+0x6f/0x190 [obdclass]
<4>[13497.372016]  [<ffffffffa04fd00e>] ? lu_context_init+0x4e/0x240 [obdclass]
<4>[13497.372016]  [<ffffffffa04fd063>] ? lu_context_init+0xa3/0x240 [obdclass]
<4>[13497.372016]  [<ffffffffa04fd21e>] ? lu_env_init+0x1e/0x30 [obdclass]
<4>[13497.372016]  [<ffffffffa0bf131b>] ? mdt_lvbo_fill+0x1ab/0x840 [mdt]
<4>[13497.372016]  [<ffffffffa0bf1170>] ? mdt_lvbo_fill+0x0/0x840 [mdt]
<4>[13497.372016]  [<ffffffffa0664033>] ? ldlm_handle_enqueue0+0x623/0x10c0 [ptlrpc]
<4>[13497.372016]  [<ffffffffa0bb6a96>] ? mdt_enqueue+0x46/0xe0 [mdt]
<4>[13497.372016]  [<ffffffffa0bbcc57>] ? mdt_handle_common+0x647/0x16d0 [mdt]
<4>[13497.372016]  [<ffffffffa0bf54a5>] ? mds_regular_handle+0x15/0x20 [mdt]
<4>[13497.372016]  [<ffffffffa0693898>] ? ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc]
<4>[13497.372016]  [<ffffffffa0a2155e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>[13497.372016]  [<ffffffffa0a32b9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4>[13497.372016]  [<ffffffffa068afb1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
<4>[13497.372016]  [<ffffffff81054613>] ? __wake_up+0x53/0x70
<4>[13497.372016]  [<ffffffffa0694b82>] ? ptlrpc_main+0xa22/0x1650 [ptlrpc]
<4>[13497.372016]  [<ffffffffa0694160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
<4>[13497.372016]  [<ffffffff81094606>] ? kthread+0x96/0xa0
<4>[13497.372016]  [<ffffffff8100c10a>] ? child_rip+0xa/0x20
<4>[13497.372016]  [<ffffffff81094570>] ? kthread+0x0/0xa0
<4>[13497.372016]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
<6>[13497.372016] Mem-Info:
<4>[13497.372016] Node 0 DMA per-cpu:
<4>[13497.372016] CPU    0: hi:    0, btch:   1 usd:   0
<4>[13497.385289] CPU    1: hi:    0, btch:   1 usd:   0
<4>[13497.385289] CPU    2: hi:    0, btch:   1 usd:   0
<4>[13497.385289] CPU    3: hi:    0, btch:   1 usd:   0
<4>[13497.385289] Node 0 DMA32 per-cpu:
<4>[13497.385289] CPU    0: hi:  186, btch:  31 usd:  26
<4>[13497.385289] CPU    1: hi:  186, btch:  31 usd:  75
<4>[13497.385289] CPU    2: hi:  186, btch:  31 usd: 177
<4>[13497.385289] CPU    3: hi:  186, btch:  31 usd: 170
<4>[13497.385289] active_anon:83027 inactive_anon:83006 isolated_anon:0
<4>[13497.385289]  active_file:31104 inactive_file:29174 isolated_file:0
<4>[13497.385289]  unevictable:0 dirty:59 writeback:55 unstable:0
<4>[13497.385289]  free:13805 slab_reclaimable:79792 slab_unreclaimable:363470
<4>[13497.385289]  mapped:669 shmem:163871 pagetables:341 bounce:0
<4>[13497.388538] Node 0 DMA free:9700kB min:136kB low:168kB high:204kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:9312kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
<4>[13497.389526] lowmem_reserve[]: 0 2967 2967 2967
<4>[13497.389526] Node 0 DMA32 free:45520kB min:44916kB low:56144kB high:67372kB active_anon:332000kB inactive_anon:332388kB active_file:124252kB inactive_file:116696kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3039080kB mlocked:0kB dirty:236kB writeback:220kB mapped:2676kB shmem:655484kB slab_reclaimable:319168kB slab_unreclaimable:1453880kB kernel_stack:2640kB pagetables:1364kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:861 all_unreclaimable? no
<4>[13497.389526] lowmem_reserve[]: 0 0 0 0
<4>[13497.389526] Node 0 DMA: 3*4kB 1*8kB 3*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 2*4096kB = 9700kB
<4>[13497.389526] Node 0 DMA32: 10258*4kB 18*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 45272kB
<4>[13497.389526] 194553 total pagecache pages
<4>[13497.389526] 0 pages in swap cache
<4>[13497.389526] Swap cache stats: add 1550, delete 1550, find 434/448
<4>[13497.389526] Free swap  = 2091392kB
<4>[13497.389526] Total swap = 2097144kB
<6>[13497.389526] 774397 pages RAM
<6>[13497.389526] 38457 pages reserved
<6>[13497.389526] 19663 pages shared
<6>[13497.389526] 697781 pages non-shared
<0>[13497.402539] LustreError: 11831:0:(mdt_lvb.c:125:mdt_lvbo_fill()) ASSERTION( rc == 0 ) failed: 
<0>[13497.403365] LustreError: 11831:0:(mdt_lvb.c:125:mdt_lvbo_fill()) LBUG
<4>[13497.403820] Pid: 11831, comm: mdt00_004
<4>[13497.404356] 
<4>[13497.404356] Call Trace:
<4>[13497.405032]  [<ffffffffa0a208a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[13497.405510]  [<ffffffffa0a20ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[13497.405979]  [<ffffffffa0bf18d4>] mdt_lvbo_fill+0x764/0x840 [mdt]
<4>[13497.406426]  [<ffffffffa0bf1170>] ? mdt_lvbo_fill+0x0/0x840 [mdt]
<4>[13497.406735]  [<ffffffffa0664033>] ldlm_handle_enqueue0+0x623/0x10c0 [ptlrpc]
<4>[13497.407043]  [<ffffffffa0bb6a96>] mdt_enqueue+0x46/0xe0 [mdt]
<4>[13497.407317]  [<ffffffffa0bbcc57>] mdt_handle_common+0x647/0x16d0 [mdt]
<4>[13497.407604]  [<ffffffffa0bf54a5>] mds_regular_handle+0x15/0x20 [mdt]
<4>[13497.407908]  [<ffffffffa0693898>] ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc]
<4>[13497.408502]  [<ffffffffa0a2155e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>[13497.408814]  [<ffffffffa0a32b9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4>[13497.409138]  [<ffffffffa068afb1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
<4>[13497.409425]  [<ffffffff81054613>] ? __wake_up+0x53/0x70
<4>[13497.409706]  [<ffffffffa0694b82>] ptlrpc_main+0xa22/0x1650 [ptlrpc]
<4>[13497.410010]  [<ffffffffa0694160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
<4>[13497.410288]  [<ffffffff81094606>] kthread+0x96/0xa0
<4>[13497.410537]  [<ffffffff8100c10a>] child_rip+0xa/0x20
<4>[13497.410790]  [<ffffffff81094570>] ? kthread+0x0/0xa0
<4>[13497.411045]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
<4>[13497.411299] 
<0>[13497.681303] Kernel panic - not syncing: LBUG
Comment by Alex Zhuravlev [ 23/Jul/13 ]

I think this can stop to happen once we start to fetch the layouts with explicit getxattr..

Comment by Oleg Drokin [ 21/Jan/14 ]

Because I am really tired of hittign this issue, here's a patch to at least making mdt not crash when it happens:
http://review.whamcloud.com/8947

Comment by Peter Jones [ 11/Feb/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:28:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.