Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2791

Stuck client on server OOM/lost message

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6757

    Description

      After recent landings of LU-1431 and a corresponding crop up of allocation failures due to LU-2748 my sanity test hung up like this:

      [ 7441.448176] Lustre: DEBUG MARKER: == sanity test 124b: lru resize (performance test) ========================= 02:32:43 (1360481563)
      ...
      [ 7511.366772] ll_ost00_008: page allocation failure. order:5, mode:0x50
      [ 7511.367282] Pid: 23333, comm: ll_ost00_008 Not tainted 2.6.32-debug #6
      [ 7511.367763] Call Trace:
      [ 7511.368120]  [<ffffffff81125bd6>] ? __alloc_pages_nodemask+0x976/0x9e0
      [ 7511.368632]  [<ffffffff81160a62>] ? kmem_getpages+0x62/0x170
      [ 7511.369397]  [<ffffffff8116349c>] ? fallback_alloc+0x1bc/0x270
      [ 7511.389365]  [<ffffffff81162db7>] ? cache_grow+0x4d7/0x520
      [ 7511.389816]  [<ffffffff81163188>] ? ____cache_alloc_node+0xa8/0x200
      [ 7511.390278]  [<ffffffff81163838>] ? __kmalloc+0x208/0x2a0
      [ 7511.390737]  [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs]
      [ 7511.391245]  [<ffffffffa04fcc00>] ? cfs_alloc+0x30/0x60 [libcfs]
      [ 7511.391707]  [<ffffffffa0ac953e>] ? osd_key_init+0x1e/0x5d0 [osd_ldiskfs]
      [ 7511.404245]  [<ffffffffa062a3df>] ? keys_fill+0x6f/0x190 [obdclass]
      [ 7511.404941]  [<ffffffffa062de8b>] ? lu_context_init+0xab/0x260 [obdclass]
      [ 7511.405579]  [<ffffffffa062e05e>] ? lu_env_init+0x1e/0x30 [obdclass]
      [ 7511.406197]  [<ffffffffa0d4bf27>] ? ofd_lvbo_init+0x137/0x8e0 [ofd]
      [ 7511.419335]  [<ffffffffa04fcb12>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs]
      [ 7511.419900]  [<ffffffffa0767763>] ? ldlm_resource_get+0x383/0x730 [ptlrpc]
      [ 7511.420407]  [<ffffffffa07617a5>] ? ldlm_lock_create+0x55/0xa50 [ptlrpc]
      [ 7511.420966]  [<ffffffffa0786ae6>] ? ldlm_handle_enqueue0+0x156/0x1090 [ptlrpc]
      [ 7511.443020]  [<ffffffffa0787a86>] ? ldlm_handle_enqueue+0x66/0x70 [ptlrpc]
      [ 7511.443583]  [<ffffffffa0787a90>] ? ldlm_server_completion_ast+0x0/0x640 [ptlrpc]
      [ 7511.459402]  [<ffffffffa0b2b8b0>] ? ost_blocking_ast+0x0/0xca0 [ost]
      [ 7511.459950]  [<ffffffffa07843c0>] ? ldlm_server_glimpse_ast+0x0/0x3b0 [ptlrpc]
      [ 7511.478957]  [<ffffffffa0b33807>] ? ost_handle+0x1be7/0x4590 [ost]
      [ 7511.480742]  [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 7511.481280]  [<ffffffffa07b6e03>] ? ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
      [ 7511.486155]  [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 7511.486813]  [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 7511.498204]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 7511.498685]  [<ffffffffa07b98cd>] ? ptlrpc_main+0xafd/0x17f0 [ptlrpc]
      [ 7511.499220]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.499684]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [ 7511.500158]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.500689]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7511.501159]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      ...
      [ 7511.543594] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 48519: rc -12
      [ 7511.544499] LustreError: 23333:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 92 previous similar messages
      [ 7511.546166] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880048126db0/0x3fda232370b59bd7 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a179 expref: -99 pid: 3863 timeout: 0 lvb_type: 1
      [ 7511.551229] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880098de4db0/0x3fda232370b5a187 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a18e expref: -99 pid: 3860 timeout: 0 lvb_type: 1
      [ 7512.986352] LustreError: 21397:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880091bdddb0/0x3fda232370b5a221 lrc: 6/2,0 mode: --/PR res: 48519/9663677440 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a228 expref: -99 pid: 3866 timeout: 0 lvb_type: 1
      ...
      [ 7550.196668] Lustre: DEBUG MARKER: == sanity test 132: som avoids glimpse rpc == 02:34:32 (1360481672)
      [ 7552.368263] Lustre: DEBUG MARKER: cancel_lru_locks osc start
      [ 7552.849425] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      [ 7553.369267] Lustre: Setting parameter lustre-MDT0000.mdt.som in log lustre-MDT0000
      [ 7553.370141] Lustre: Skipped 3 previous similar messages
      [ 7555.706750] LustreError: 9584:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace lustre-OST0001-osc-ffff8800145e1bf0 resource refcount nonzero (3) after lock cleanup; forcing cleanup.
      [ 7555.708100] LustreError: 9584:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 3)
      [ 7555.709025] LustreError: 9584:0:(ldlm_resource.c:1404:ldlm_resource_dump()) --- Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 4)
      [ 7557.280133] LNet: Service thread pid 22395 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 7557.281500] Pid: 22395, comm: ll_ost00_006
      [ 7557.281907] 
      [ 7557.281908] Call Trace:
      [ 7557.282616]  [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30
      [ 7557.283088]  [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0
      [ 7557.283548]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
      [ 7557.284065]  [<ffffffffa04fc771>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [ 7557.284586]  [<ffffffffa077fd0d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc]
      [ 7557.285113]  [<ffffffffa077b440>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      [ 7557.285846]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7557.286341]  [<ffffffffa077f458>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      [ 7557.286875]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7557.287604]  [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      [ 7557.288146]  [<ffffffffa0d32610>] ofd_destroy_by_fid+0x160/0x380 [ofd]
      [ 7557.288718]  [<ffffffffa077e1d0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
      [ 7557.289296]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7557.289829]  [<ffffffffa07a77a5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      [ 7557.290312]  [<ffffffffa0d33b37>] ofd_destroy+0x187/0x670 [ofd]
      [ 7557.290751]  [<ffffffffa0b35732>] ost_handle+0x3b12/0x4590 [ost]
      [ 7557.291301]  [<ffffffffa0508204>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 7557.291849]  [<ffffffffa07b6e03>] ptlrpc_server_handle_request+0x453/0xe50 [ptlrpc]
      [ 7557.292785]  [<ffffffffa04fc65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 7557.293433]  [<ffffffffa07afe91>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 7557.293961]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7557.294452]  [<ffffffffa07b98cd>] ptlrpc_main+0xafd/0x17f0 [ptlrpc]
      [ 7557.294939]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.295390]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 7557.295928]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.296475]  [<ffffffffa07b8dd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
      [ 7557.296963]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 7557.298274] 
      [ 7557.298613] LustreError: dumping log to /tmp/lustre-log.1360481679.22395
      [ 7558.542139] Lustre: e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 is already connected, SOM will be enabled on the next mount
      [ 7604.468836] Lustre: lustre-OST0001: haven't heard from client e4b81d30-e30c-d5e2-eae1-83e5d43c85e8 (at 0@lo) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800b3070bf0, cur 1360481727 expire 1360481697 last 1360481675
      [ 7604.470498] LustreError: 22092:0:(ofd_lvb.c:145:ofd_lvbo_update()) lustre-OST0001: no lvb when running lvbo_update, res: 48519!
      [ 7604.480091] LNet: Service thread pid 22395 completed after 87.19s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 7680.616144] INFO: task ll_cfg_requeue:22093 blocked for more than 120 seconds.
      [ 7680.616937] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7680.617711] ll_cfg_requeu D 0000000000000000  3920 22093      2 0x00000000
      [ 7680.618207]  ffff880064355ad8 0000000000000046 0000000000000000 0000000000000018
      [ 7680.619051]  0000000000000002 0000000064355a90 ffff88002266d000 0000000064355a88
      [ 7680.619894]  ffff8800492c08b8 ffff880064355fd8 000000000000fba8 ffff8800492c08b8
      [ 7680.620688] Call Trace:
      [ 7680.621034]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
      [ 7680.621497]  [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30
      [ 7680.622011]  [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30
      [ 7680.622495]  [<ffffffff814f9ec7>] ? down_read+0x37/0x40
      [ 7680.622970]  [<ffffffffa07e1e8e>] sptlrpc_conf_client_adapt+0x6e/0x240 [ptlrpc]
      [ 7680.623777]  [<ffffffffa09d09d8>] osc_set_info_async+0x5d8/0x940 [osc]
      [ 7680.625332]  [<ffffffffa05e60b0>] class_notify_sptlrpc_conf+0x230/0x4f0 [obdclass]
      [ 7680.626146]  [<ffffffffa0b73671>] mgc_process_cfg_log+0x2e1/0x14c0 [mgc]
      [ 7680.626651]  [<ffffffffa05fb217>] ? class_handle2object+0x97/0x170 [obdclass]
      [ 7680.627169]  [<ffffffffa0b74cb3>] mgc_process_log+0x463/0x1330 [mgc]
      [ 7680.627668]  [<ffffffffa04fcbce>] ? cfs_free+0xe/0x10 [libcfs]
      [ 7680.629329]  [<ffffffffa0b6fa40>] ? mgc_blocking_ast+0x0/0x770 [mgc]
      [ 7680.629868]  [<ffffffffa077f830>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
      [ 7680.630356]  [<ffffffffa0b766d8>] mgc_requeue_thread+0x348/0x790 [mgc]
      [ 7680.630832]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [ 7680.631299]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.631774]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 7680.632245]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.632707]  [<ffffffffa0b76390>] ? mgc_requeue_thread+0x0/0x790 [mgc]
      [ 7680.633176]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      and the ll_cfg_requeue thread never recovers repeating this trace forever.

      Attachments

        Issue Links

          Activity

            [LU-2791] Stuck client on server OOM/lost message
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.4.0 [ 10154 ]
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            yong.fan nasf (Inactive) added a comment - This is the patch: http://review.whamcloud.com/#change,5634

            The failure for zero-sized ext lock "lvb" is another instance of LU-2790.

            ====================
            [ 7511.546166] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880048126db0/0x3fda232370b59bd7 lrc: 6/2,0 mode: -/PR res: 48519/9663677440 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a179 expref: -99 pid: 3863 timeout: 0 lvb_type: 1
            [ 7511.551229] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880098de4db0/0x3fda232370b5a187 lrc: 6/2,0 mode: -/PR res: 48519/9663677440 rrc: 2 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a18e expref: -99 pid: 3860 timeout: 0 lvb_type: 1
            [ 7512.986352] LustreError: 21397:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880091bdddb0/0x3fda232370b5a221 lrc: 6/2,0 mode: -/PR res: 48519/9663677440 rrc: 3 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a228 expref: -99 pid: 3866 timeout: 0 lvb_type: 1
            ...
            [ 7555.706750] LustreError: 9584:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace lustre-OST0001-osc-ffff8800145e1bf0 resource refcount nonzero (3) after lock cleanup; forcing cleanup.
            [ 7555.708100] LustreError: 9584:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 3)
            [ 7555.709025] LustreError: 9584:0:(ldlm_resource.c:1404:ldlm_resource_dump()) — Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 4)
            ...
            ====================

            As shown in the log, there were three failed locks left on client and blocked the client umount thread in test_132: client_disconnect_export() => ldlm_namespace_free_prior() => __ldlm_namespace_free()

            The blocked umount thread holds the semaphore "client_obd::cl_sem" via "down_write(&cli->cl_sem);", the ll_cfg_requeue threads need read semaphore against the "client_obd::cl_sem" , so blocked.

            yong.fan nasf (Inactive) added a comment - The failure for zero-sized ext lock "lvb" is another instance of LU-2790 . ==================== [ 7511.546166] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880048126db0/0x3fda232370b59bd7 lrc: 6/2,0 mode: - /PR res: 48519/9663677440 rrc: 1 type: EXT [0->18446744073709551615] (req 0 >18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a179 expref: -99 pid: 3863 timeout: 0 lvb_type: 1 [ 7511.551229] LustreError: 21396:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880098de4db0/0x3fda232370b5a187 lrc: 6/2,0 mode: - /PR res: 48519/9663677440 rrc: 2 type: EXT [0->18446744073709551615] (req 0 >18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a18e expref: -99 pid: 3860 timeout: 0 lvb_type: 1 [ 7512.986352] LustreError: 21397:0:(ldlm_lock.c:1542:ldlm_fill_lvb()) ### Replied unexpected ost LVB size 0 ns: lustre-OST0001-osc-ffff8800145e1bf0 lock: ffff880091bdddb0/0x3fda232370b5a221 lrc: 6/2,0 mode: - /PR res: 48519/9663677440 rrc: 3 type: EXT [0->18446744073709551615] (req 0 >18446744073709551615) flags: 0x0 nid: local remote: 0x3fda232370b5a228 expref: -99 pid: 3866 timeout: 0 lvb_type: 1 ... [ 7555.706750] LustreError: 9584:0:(ldlm_resource.c:805:ldlm_resource_complain()) Namespace lustre-OST0001-osc-ffff8800145e1bf0 resource refcount nonzero (3) after lock cleanup; forcing cleanup. [ 7555.708100] LustreError: 9584:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 3) [ 7555.709025] LustreError: 9584:0:(ldlm_resource.c:1404:ldlm_resource_dump()) — Resource: ffff88000f3c6e78 (48519/9663677440/0/0) (rc: 4) ... ==================== As shown in the log, there were three failed locks left on client and blocked the client umount thread in test_132: client_disconnect_export() => ldlm_namespace_free_prior() => __ldlm_namespace_free() The blocked umount thread holds the semaphore "client_obd::cl_sem" via "down_write(&cli->cl_sem);", the ll_cfg_requeue threads need read semaphore against the "client_obd::cl_sem" , so blocked.
            yong.fan nasf (Inactive) made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]
            pjones Peter Jones made changes -
            Assignee Original: Jinshan Xiong [ jay ] New: nasf [ yong.fan ]
            pjones Peter Jones added a comment -

            Fanyong could you please look into this one? thanks Peter

            pjones Peter Jones added a comment - Fanyong could you please look into this one? thanks Peter
            jlevi Jodi Levi (Inactive) made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Jinshan Xiong [ jay ]
            Labels New: LB
            Priority Original: Critical [ 2 ] New: Blocker [ 1 ]

            Jinshan,
            Could you have a look at this one and assign to Fan Yong if he needs to take this one. He is on holiday this week so we wanted to start with you.

            jlevi Jodi Levi (Inactive) added a comment - Jinshan, Could you have a look at this one and assign to Fan Yong if he needs to take this one. He is on holiday this week so we wanted to start with you.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-1431 [ LU-1431 ]

            People

              yong.fan nasf (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: