Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
3
-
6757
Description
After recent landings of LU-1431 and a corresponding crop up of allocation failures due to LU-2748 my sanity test hung up like this:
[ 7441.448176] Lustre: DEBUG MARKER: == sanity test 124b: lru resize (performance test) ========================= 02:32:43 (1360481563) ... [ 7511.366772] ll_ost00_008: page allocation failure. order:5, mode:0x50 [ 7511.367282] Pid: 23333, comm: ll_ost00_008 Not tainted 2.6.32-debug #6 [ 7511.367763] Call Trace: [ 7511.368120] [<ffffffff81125bd6>] ? __alloc_pages_nodemask+0x976/0x9e0 [ 7511.368632] [<ffffffff81160a62>] ? kmem_getpages+0x62/0x170 [ 7511.369397] [<ffffffff8116349c>] ? fallback_alloc+0x1bc/0x270 [ 7511.389365] [<ffffffff81162db7>] ? cache_grow+0x4d7/0x520 [ 7511.389816] [<ffffffff81163188>] ? ____cache_alloc_node+0xa8/0x200 [ 7511.390278] [<ffffffff81163838>] ? __kmalloc+0x208/0x2a0 [ 7511.390737] [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs] [ 7511.391245] [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs] [ 7511.391707] [<ffffffffa0ac953e>] ? osd_key_init+0x1e/0x5d0 [osd_ldiskfs] [ 7511.404245] [<ffffffffa062a3df>] ? keys_fill+0x6f/0x190 [obdclass] [ 7511.404941] [<ffffffffa062de8b>] ? lu_context_init+0xab/0x260 [obdclass] [ 7511.405579] [<ffffffffa062e05e>] ? lu_env_init+0x1e/0x30 [obdclass] [ 7511.406197] [<ffffffffa0d4bf27>] ? ofd_lvbo_init+0x137/0x8e0 [ofd] [ 7511.419335] [<ffffffffa04fcb12>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs] [ 7511.419900] [<ffffffffa0767763>] ? ldlm_resource_get+0x383/0x730 [ptlrpc] [ 7511.420407] [<ffffffffa07617a5>] ? ldlm_lock_create+0x55/0xa50 [ptlrpc] [ 7511.420966] [<ffffffffa0786ae6>] ? ldlm_handle_enqueue0+0x156/0x1090 [ptlrpc] [ 7511.443020] [<ffffffffa0787a86>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc] [ 7511.443583] [<ffffffffa0787a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc] [ 7511.459402] [<ffffffffa0b2b8b0>] ? ost_blocking_ast+0x0/0xca0 [ost] [ 7511.459950] [<ffffffffa07843c0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc] [ 7511.478957] [<ffffffffa0b33807>] ? ost_handle+0x1be7/0x4590 [ost] [ 7511.480742] [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 7511.481280] [<ffffffffa07b6e03>] ? ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc] [ 7511.486155] [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 7511.486813] [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 7511.498204] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 7511.498685] [<ffffffffa07b98cd>] ? ptlrpc_main+0xafd/0x17f0 [ptlrpc] [ 7511.499220] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7511.499684] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 7511.500158] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7511.500689] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7511.501159] [<ffffffff8100c140>] ? child_rip+0x0/0x20 ... [ 7511.543594] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 48519: rc -12 [ 7511.544499] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 92 previous similar messages [ 7511.546166] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880048126db0/0x3fda232370b59bd7 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a179 expref: -99 pid: 3863 timeout: 0 lvb_type: 1 [ 7511.551229] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880098de4db0/0x3fda232370b5a187 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a18e expref: -99 pid: 3860 timeout: 0 lvb_type: 1 [ 7512.986352] LustreError: 21397:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880091bdddb0/0x3fda232370b5a221 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a228 expref: -99 pid: 3866 timeout: 0 lvb_type: 1 ... [ 7550.196668] Lustre: DEBUG MARKER: == sanity test 132: som avoids glimpse rpc == 02:34:32 (1360481672) [ 7552.368263] Lustre: DEBUG MARKER: cancel_lru_locks osc start [ 7552.849425] Lustre: DEBUG MARKER: cancel_lru_locks osc stop [ 7553.369267] Lustre: Setting parameter lustre-MDT0000.mdt.som in log lustre-MDT0000 [ 7553.370141] Lustre: Skipped 3 previous similar messages [ 7555.706750] LustreError: 9584:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace lustre-OST0001-osc-ffff8800145e1bf0 resource refcount nonzero (3) after lock cleanup; forcing cleanup. [ 7555.708100] LustreError: 9584:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 3) [ 7555.709025] LustreError: 9584:0:(ldlm_resource.c:1404:ldlm_resource_dump()) --- Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 4) [ 7557.280133] LNet: Service thread pid 22395 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 7557.281500] Pid: 22395, comm: ll_ost00_006 [ 7557.281907] [ 7557.281908] Call Trace: [ 7557.282616] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 [ 7557.283088] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 [ 7557.283548] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [ 7557.284065] [<ffffffffa04fc771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [ 7557.284586] [<ffffffffa077fd0d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] [ 7557.285113] [<ffffffffa077b440>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [ 7557.285846] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 7557.286341] [<ffffffffa077f458>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [ 7557.286875] [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [ 7557.287604] [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [ 7557.288146] [<ffffffffa0d32610>] ofd_destroy_by_fid+0x160/0x380 [ofd] [ 7557.288718] [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] [ 7557.289296] [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [ 7557.289829] [<ffffffffa07a77a5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [ 7557.290312] [<ffffffffa0d33b37>] ofd_destroy+0x187/0x670 [ofd] [ 7557.290751] [<ffffffffa0b35732>] ost_handle+0x3b12/0x4590 [ost] [ 7557.291301] [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 7557.291849] [<ffffffffa07b6e03>] ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc] [ 7557.292785] [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 7557.293433] [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 7557.293961] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 7557.294452] [<ffffffffa07b98cd>] ptlrpc_main+0xafd/0x17f0 [ptlrpc] [ 7557.294939] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7557.295390] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 7557.295928] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7557.296475] [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc] [ 7557.296963] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 7557.298274] [ 7557.298613] LustreError: dumping log to /tmp/lustre-log.1360481679.22395 [ 7558.542139] Lustre: e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 is already connected, SOM will be enabled on the next mount [ 7604.468836] Lustre: lustre-OST0001: haven't heard from client e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 (at 0@lo) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800b3070bf0, cur 1360481727 expire 1360481697 last 1360481675 [ 7604.470498] LustreError: 22092:0:(ofd_lvb.c:145:ofd_lvbo_update()) lustre-OST0001: no lvb when running lvbo_update, res: 48519! [ 7604.480091] LNet: Service thread pid 22395 completed after 87.19s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 7680.616144] INFO: task ll_cfg_requeue:22093 blocked for more than 120 seconds. [ 7680.616937] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7680.617711] ll_cfg_requeu D 0000000000000000 3920 22093 2 0x00000000 [ 7680.618207] ffff880064355ad8 0000000000000046 0000000000000000 0000000000000018 [ 7680.619051] 0000000000000002 0000000064355a90 ffff88002266d000 0000000064355a88 [ 7680.619894] ffff8800492c08b8 ffff880064355fd8 000000000000fba8 ffff8800492c08b8 [ 7680.620688] Call Trace: [ 7680.621034] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 [ 7680.621497] [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30 [ 7680.622011] [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30 [ 7680.622495] [<ffffffff814f9ec7>] ? down_read+0x37/0x40 [ 7680.622970] [<ffffffffa07e1e8e>] sptlrpc_conf_client_adapt+0x6e/0x240 [ptlrpc] [ 7680.623777] [<ffffffffa09d09d8>] osc_set_info_async+0x5d8/0x940 [osc] [ 7680.625332] [<ffffffffa05e60b0>] class_notify_sptlrpc_conf+0x230/0x4f0 [obdclass] [ 7680.626146] [<ffffffffa0b73671>] mgc_process_cfg_log+0x2e1/0x14c0 [mgc] [ 7680.626651] [<ffffffffa05fb217>] ? class_handle2object+0x97/0x170 [obdclass] [ 7680.627169] [<ffffffffa0b74cb3>] mgc_process_log+0x463/0x1330 [mgc] [ 7680.627668] [<ffffffffa04fcbce>] ? cfs_free+0xe/0x10 [libcfs] [ 7680.629329] [<ffffffffa0b6fa40>] ? mgc_blocking_ast+0x0/0x770 [mgc] [ 7680.629868] [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [ 7680.630356] [<ffffffffa0b766d8>] mgc_requeue_thread+0x348/0x790 [mgc] [ 7680.630832] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 7680.631299] [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc] [ 7680.631774] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 7680.632245] [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc] [ 7680.632707] [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc] [ 7680.633176] [<ffffffff8100c140>] ? child_rip+0x0/0x20
and the ll_cfg_requeue thread never recovers repeating this trace forever.