Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2791

Stuck client on server OOM/lost message

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6757

    Description

      After recent landings of LU-1431 and a corresponding crop up of allocation failures due to LU-2748 my sanity test hung up like this:

      [ 7441.448176] Lustre: DEBUG MARKER: == sanity test 124b: lru resize (performance test) ========================= 02:32:43 (1360481563)
      ...
      [ 7511.366772] ll_ost00_008: page allocation failure. order:5, mode:0x50
      [ 7511.367282] Pid: 23333, comm: ll_ost00_008 Not tainted 2.6.32-debug #6
      [ 7511.367763] Call Trace:
      [ 7511.368120]  [<ffffffff81125bd6>] ? __alloc_pages_nodemask+0x976/0x9e0
      [ 7511.368632]  [<ffffffff81160a62>] ? kmem_getpages+0x62/0x170
      [ 7511.369397]  [<ffffffff8116349c>] ? fallback_alloc+0x1bc/0x270
      [ 7511.389365]  [<ffffffff81162db7>] ? cache_grow+0x4d7/0x520
      [ 7511.389816]  [<ffffffff81163188>] ? ____cache_alloc_node+0xa8/0x200
      [ 7511.390278]  [<ffffffff81163838>] ? __kmalloc+0x208/0x2a0
      [ 7511.390737]  [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs]
      [ 7511.391245]  [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs]
      [ 7511.391707]  [<ffffffffa0ac953e>] ? osd_key_init+0x1e/0x5d0 [osd_ldiskfs]
      [ 7511.404245]  [<ffffffffa062a3df>] ? keys_fill+0x6f/0x190 [obdclass]
      [ 7511.404941]  [<ffffffffa062de8b>] ? lu_context_init+0xab/0x260 [obdclass]
      [ 7511.405579]  [<ffffffffa062e05e>] ? lu_env_init+0x1e/0x30 [obdclass]
      [ 7511.406197]  [<ffffffffa0d4bf27>] ? ofd_lvbo_init+0x137/0x8e0 [ofd]
      [ 7511.419335]  [<ffffffffa04fcb12>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs]
      [ 7511.419900]  [<ffffffffa0767763>] ? ldlm_resource_get+0x383/0x730 [ptlrpc]
      [ 7511.420407]  [<ffffffffa07617a5>] ? ldlm_lock_create+0x55/0xa50 [ptlrpc]
      [ 7511.420966]  [<ffffffffa0786ae6>] ? ldlm_handle_enqueue0+0x156/0x1090 [ptlrpc]
      [ 7511.443020]  [<ffffffffa0787a86>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
      [ 7511.443583]  [<ffffffffa0787a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
      [ 7511.459402]  [<ffffffffa0b2b8b0>] ? ost_blocking_ast+0x0/0xca0 [ost]
      [ 7511.459950]  [<ffffffffa07843c0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
      [ 7511.478957]  [<ffffffffa0b33807>] ? ost_handle+0x1be7/0x4590 [ost]
      [ 7511.480742]  [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 7511.481280]  [<ffffffffa07b6e03>] ? ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
      [ 7511.486155]  [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 7511.486813]  [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 7511.498204]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 7511.498685]  [<ffffffffa07b98cd>] ? ptlrpc_main+0xafd/0x17f0 [ptlrpc]
      [ 7511.499220]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.499684]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [ 7511.500158]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.500689]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.501159]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      ...
      [ 7511.543594] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 48519: rc -12
      [ 7511.544499] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 92 previous similar messages
      [ 7511.546166] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880048126db0/0x3fda232370b59bd7 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a179 expref: -99 pid: 3863 timeout: 0 lvb_type: 1
      [ 7511.551229] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880098de4db0/0x3fda232370b5a187 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a18e expref: -99 pid: 3860 timeout: 0 lvb_type: 1
      [ 7512.986352] LustreError: 21397:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880091bdddb0/0x3fda232370b5a221 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a228 expref: -99 pid: 3866 timeout: 0 lvb_type: 1
      ...
      [ 7550.196668] Lustre: DEBUG MARKER: == sanity test 132: som avoids glimpse rpc == 02:34:32 (1360481672)
      [ 7552.368263] Lustre: DEBUG MARKER: cancel_lru_locks osc start
      [ 7552.849425] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      [ 7553.369267] Lustre: Setting parameter lustre-MDT0000.mdt.som in log lustre-MDT0000
      [ 7553.370141] Lustre: Skipped 3 previous similar messages
      [ 7555.706750] LustreError: 9584:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace lustre-OST0001-osc-ffff8800145e1bf0 resource refcount nonzero (3) after lock cleanup; forcing cleanup.
      [ 7555.708100] LustreError: 9584:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 3)
      [ 7555.709025] LustreError: 9584:0:(ldlm_resource.c:1404:ldlm_resource_dump()) --- Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 4)
      [ 7557.280133] LNet: Service thread pid 22395 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 7557.281500] Pid: 22395, comm: ll_ost00_006
      [ 7557.281907] 
      [ 7557.281908] Call Trace:
      [ 7557.282616]  [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30
      [ 7557.283088]  [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0
      [ 7557.283548]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
      [ 7557.284065]  [<ffffffffa04fc771>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [ 7557.284586]  [<ffffffffa077fd0d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc]
      [ 7557.285113]  [<ffffffffa077b440>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      [ 7557.285846]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7557.286341]  [<ffffffffa077f458>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      [ 7557.286875]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7557.287604]  [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      [ 7557.288146]  [<ffffffffa0d32610>] ofd_destroy_by_fid+0x160/0x380 [ofd]
      [ 7557.288718]  [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      [ 7557.289296]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7557.289829]  [<ffffffffa07a77a5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      [ 7557.290312]  [<ffffffffa0d33b37>] ofd_destroy+0x187/0x670 [ofd]
      [ 7557.290751]  [<ffffffffa0b35732>] ost_handle+0x3b12/0x4590 [ost]
      [ 7557.291301]  [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 7557.291849]  [<ffffffffa07b6e03>] ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
      [ 7557.292785]  [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 7557.293433]  [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 7557.293961]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7557.294452]  [<ffffffffa07b98cd>] ptlrpc_main+0xafd/0x17f0 [ptlrpc]
      [ 7557.294939]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.295390]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 7557.295928]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.296475]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.296963]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 7557.298274] 
      [ 7557.298613] LustreError: dumping log to /tmp/lustre-log.1360481679.22395
      [ 7558.542139] Lustre: e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 is already connected, SOM will be enabled on the next mount
      [ 7604.468836] Lustre: lustre-OST0001: haven't heard from client e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 (at 0@lo) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800b3070bf0, cur 1360481727 expire 1360481697 last 1360481675
      [ 7604.470498] LustreError: 22092:0:(ofd_lvb.c:145:ofd_lvbo_update()) lustre-OST0001: no lvb when running lvbo_update, res: 48519!
      [ 7604.480091] LNet: Service thread pid 22395 completed after 87.19s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 7680.616144] INFO: task ll_cfg_requeue:22093 blocked for more than 120 seconds.
      [ 7680.616937] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7680.617711] ll_cfg_requeu D 0000000000000000  3920 22093      2 0x00000000
      [ 7680.618207]  ffff880064355ad8 0000000000000046 0000000000000000 0000000000000018
      [ 7680.619051]  0000000000000002 0000000064355a90 ffff88002266d000 0000000064355a88
      [ 7680.619894]  ffff8800492c08b8 ffff880064355fd8 000000000000fba8 ffff8800492c08b8
      [ 7680.620688] Call Trace:
      [ 7680.621034]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
      [ 7680.621497]  [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30
      [ 7680.622011]  [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30
      [ 7680.622495]  [<ffffffff814f9ec7>] ? down_read+0x37/0x40
      [ 7680.622970]  [<ffffffffa07e1e8e>] sptlrpc_conf_client_adapt+0x6e/0x240 [ptlrpc]
      [ 7680.623777]  [<ffffffffa09d09d8>] osc_set_info_async+0x5d8/0x940 [osc]
      [ 7680.625332]  [<ffffffffa05e60b0>] class_notify_sptlrpc_conf+0x230/0x4f0 [obdclass]
      [ 7680.626146]  [<ffffffffa0b73671>] mgc_process_cfg_log+0x2e1/0x14c0 [mgc]
      [ 7680.626651]  [<ffffffffa05fb217>] ? class_handle2object+0x97/0x170 [obdclass]
      [ 7680.627169]  [<ffffffffa0b74cb3>] mgc_process_log+0x463/0x1330 [mgc]
      [ 7680.627668]  [<ffffffffa04fcbce>] ? cfs_free+0xe/0x10 [libcfs]
      [ 7680.629329]  [<ffffffffa0b6fa40>] ? mgc_blocking_ast+0x0/0x770 [mgc]
      [ 7680.629868]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7680.630356]  [<ffffffffa0b766d8>] mgc_requeue_thread+0x348/0x790 [mgc]
      [ 7680.630832]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7680.631299]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.631774]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 7680.632245]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.632707]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.633176]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      and the ll_cfg_requeue thread never recovers repeating this trace forever.

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: