Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2823

large-scale test_3a hung: LBUG: ASSERTION(ergo(!obd->obd_recovering, diff >= 0)) failed: lustre-OST0001: 1013476 - 1024134 = -10658

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 1.8.9
    • None
    • 3
    • 6834

    Description

      The large-scale test_3a hung as follows:

      1 : Starting failover on mds
      Failing mds on node client-20-ib
      CMD: client-20-ib grep -c /mnt/mds' ' /proc/mounts
      Stopping /mnt/mds (opts:)
      CMD: client-20-ib umount -d /mnt/mds
      CMD: client-20-ib lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      affected facets: mds
      df pid is 29886
      Failover mds to client-20-ib
      05:15:08 (1360934108) waiting for client-20-ib network 900 secs ...
      05:15:08 (1360934108) network interface is UP
      CMD: client-20-ib hostname
      Starting mds: -o user_xattr,acl  /dev/lvm-MDS/P1 /mnt/mds
      CMD: client-20-ib mkdir -p /mnt/mds; mount -t lustre -o user_xattr,acl  /dev/lvm-MDS/P1 /mnt/mds
      CMD: client-20-ib PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/tests:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib64/lustre/utils:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey::/usr/lib64/lustre/tests/mpi NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \" 0xffb7e3ff\" 32 
      client-20-ib: lnet.debug=-1
      client-20-ib: lnet.subsystem_debug=0xffb7e3ff
      client-20-ib: lnet.debug_mb=32
      CMD: client-20-ib e2label /dev/lvm-MDS/P1
      Started lustre-MDT0000
      affected facets: mds
      CMD: client-20-ib PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/tests:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib64/lustre/utils:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey::/usr/lib64/lustre/tests/mpi NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre-MDT0000.recovery_status 200 
      client-20-ib: *.lustre-MDT0000.recovery_status status: RECOVERING
      client-20-ib: Waiting 195 secs for *.lustre-MDT0000.recovery_status recovery done. status: RECOVERING
      client-20-ib: *.lustre-MDT0000.recovery_status status: COMPLETE
      CMD: client-20-ib lctl get_param -n *.lustre-MDT0000.recovery_status
      RECOVERY TIME: NFILES=50000 number of clients: 1  recovery_duration: 2
      

      Console log on Client (client-22-ib) showed that:

      05:15:13:Lustre: MGC192.168.4.20@o2ib: Connection restored to service MGS using nid 192.168.4.20@o2ib.
      05:15:13:Lustre: Skipped 5 previous similar messages
      05:18:15:INFO: task mdsrate:9347 blocked for more than 120 seconds.
      05:18:15:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      05:18:15:mdsrate       D 0000000000000002     0  9347   9346 0x00000080
      05:18:15: ffff8802f9f71d38 0000000000000082 ffff8802181a4200 0000000000000001
      05:18:15: ffff88029d9e2018 0000000000640c40 ffffffff8100bb8e ffff8802f9f71d38
      05:18:15: ffff880313dcfab8 ffff8802f9f71fd8 000000000000fb88 ffff880313dcfab8
      05:18:15:Call Trace:
      05:18:15: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
      05:18:15: [<ffffffff8104dfbf>] ? mutex_spin_on_owner+0x9f/0xc0
      05:18:15: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
      05:18:15: [<ffffffff8115eb22>] ? kmem_cache_alloc+0x182/0x190
      05:18:15: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
      05:18:15: [<ffffffff81186e4e>] do_filp_open+0x2be/0xd60
      05:18:15: [<ffffffff8104338c>] ? __do_page_fault+0x1ec/0x480
      05:18:15: [<ffffffff811b9cae>] ? ep_poll+0x12e/0x330
      05:18:15: [<ffffffff81193272>] ? alloc_fd+0x92/0x160
      05:18:15: [<ffffffff81173a39>] do_sys_open+0x69/0x140
      05:18:15: [<ffffffff81173b50>] sys_open+0x20/0x30
      05:18:15: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Console log on MDS (client-20-ib) showed that:

      05:15:03:Lustre: DEBUG MARKER: 1 : Starting failover on mds
      05:15:03:Lustre: DEBUG MARKER: grep -c /mnt/mds' ' /proc/mounts
      05:15:03:Lustre: DEBUG MARKER: umount -d /mnt/mds
      05:15:03:Lustre: Failing over lustre-MDT0000
      05:15:03:Lustre: Skipped 17 previous similar messages
      05:15:03:LustreError: 3780:0:(osc_create.c:595:osc_create()) lustre-OST0000-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3782:0:(osc_create.c:595:osc_create()) lustre-OST0001-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3782:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 1/7: rc = -4
      05:15:03:LustreError: 3781:0:(osc_create.c:595:osc_create()) lustre-OST0002-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3782:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0001_UUID failed at mds_lov_clear_orphans: -4
      05:15:03:LustreError: 3782:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0001_UUID sync failed -4, deactivating
      05:15:03:LustreError: 3785:0:(osc_create.c:595:osc_create()) lustre-OST0005-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3786:0:(osc_create.c:595:osc_create()) lustre-OST0006-osc: oscc recovery failed: -4
      05:15:03:LustreError: 6245:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      05:15:03:LustreError: 6245:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Skipped 1 previous similar message
      05:15:03:LustreError: 6245:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      05:15:03:LustreError: 6245:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) Skipped 1 previous similar message
      05:15:03:LustreError: 3780:0:(lov_obd.c:1153:lov_clear_orphans()) error in orphan recovery on OST idx 0/7: rc = -4
      05:15:03:LustreError: 3784:0:(osc_create.c:595:osc_create()) lustre-OST0004-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3784:0:(mds_lov.c:1057:__mds_lov_synchronize()) lustre-OST0004_UUID failed at mds_lov_clear_orphans: -4
      05:15:03:LustreError: 3784:0:(mds_lov.c:1057:__mds_lov_synchronize()) Skipped 3 previous similar messages
      05:15:03:LustreError: 3784:0:(mds_lov.c:1066:__mds_lov_synchronize()) lustre-OST0004_UUID sync failed -4, deactivating
      05:15:03:LustreError: 3784:0:(mds_lov.c:1066:__mds_lov_synchronize()) Skipped 3 previous similar messages
      05:15:03:LustreError: 3783:0:(osc_create.c:595:osc_create()) lustre-OST0003-osc: oscc recovery failed: -4
      05:15:03:LustreError: 3478:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524309: rc = -5
      05:15:03:LustreError: 3478:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      05:15:03:Lustre: 3478:0:(mds_reint.c:257:mds_finish_transno()) commit transaction for disconnected client 3e5acf3c-82ca-17ee-2609-2dd7736ba638: rc -5
      <~snip~>
      05:31:03:Lustre: lustre-OST0001-osc: Connection to service lustre-OST0001 via nid 192.168.4.21@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      05:31:03:LustreError: 11-0: an error occurred while communicating with 192.168.4.21@o2ib. The ost_connect operation failed with -16
      

      Console log on OSS (client-21-ib) showed that:

      05:14:54:Lustre: DEBUG MARKER: 1 : Starting failover on mds
      05:15:16:Lustre: Service thread pid 14788 completed after 122.91s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      05:15:16:Lustre: Skipped 1 previous similar message
      05:15:16:Lustre: 14830:0:(ldlm_lib.c:820:target_handle_connect()) lustre-OST0000: received new MDS connection from NID 192.168.4.20@o2ib, removing former export from same NID
      05:15:16:Lustre: 14830:0:(ldlm_lib.c:820:target_handle_connect()) Skipped 16 previous similar messages
      05:15:16:Lustre: lustre-OST0000: received MDS connection from 192.168.4.20@o2ib
      05:15:16:Lustre: 14857:0:(filter.c:3129:filter_destroy_precreated()) lustre-OST0002: deleting orphan objects from 1013476 to 1017190, orphan objids won't be reused any more.
      05:15:16:Lustre: Skipped 11 previous similar messages
      05:15:16:LustreError: 11-0: an error occurred while communicating with 192.168.4.20@o2ib. The obd_ping operation failed with -107
      05:15:16:LustreError: 166-1: MGC192.168.4.20@o2ib: Connection to service MGS via nid 192.168.4.20@o2ib was lost; in progress operations using this service will fail.
      05:15:16:Lustre: 9791:0:(import.c:855:ptlrpc_connect_interpret()) MGS@MGC192.168.4.20@o2ib_0 changed server handle from 0x3e8c8b7064aac07f to 0x3e8c8b7064b57220
      05:15:16:Lustre: MGC192.168.4.20@o2ib: Reactivating import
      05:15:27:LustreError: 14877:0:(filter.c:3120:filter_destroy_precreated()) lustre-OST0003: destroy_in_progress already cleared
      05:15:27:Lustre: Service thread pid 14809 completed after 137.70s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      05:15:27:LustreError: 14872:0:(filter.c:3120:filter_destroy_precreated()) lustre-OST0000: destroy_in_progress already cleared
      05:15:27:Lustre: Service thread pid 14787 completed after 140.41s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      05:15:38:LustreError: 14884:0:(filter.c:3120:filter_destroy_precreated()) lustre-OST0004: destroy_in_progress already cleared
      05:15:38:Lustre: Service thread pid 14810 completed after 148.92s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      05:15:38:LustreError: 14865:0:(filter.c:3120:filter_destroy_precreated()) lustre-OST0005: destroy_in_progress already cleared
      05:15:38:LustreError: 17526:0:(filter.c:3234:filter_handle_precreate()) ASSERTION(ergo(!obd->obd_recovering, diff >= 0)) failed: lustre-OST0001: 1013476 - 1024134 = -10658
      05:15:38:LustreError: 17526:0:(filter.c:3234:filter_handle_precreate()) LBUG
      05:15:38:Pid: 17526, comm: ll_ost_creat_04
      05:15:38:
      05:15:38:Call Trace:
      05:15:38: [<ffffffff8afc26a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      05:15:38: [<ffffffff8afc2bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
      05:15:38: [<ffffffff8b4782c6>] filter_create+0x1186/0x17d0 [obdfilter]
      05:15:38: [<ffffffff8afcacfd>] libcfs_debug_vmsg2+0x70d/0x970 [libcfs]
      05:15:38: [<ffffffff8b15cec9>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
      05:15:38: [<ffffffff8b44d801>] ost_handle+0x1281/0x55c0 [ost]
      05:15:38: [<ffffffff8afc7868>] libcfs_ip_addr2str+0x38/0x40 [libcfs]
      05:15:38: [<ffffffff8b168874>] ptlrpc_server_handle_request+0x984/0xe00 [ptlrpc]
      05:15:38: [<ffffffff8b169f16>] ptlrpc_main+0xf16/0x10e0 [ptlrpc]
      05:15:38: [<ffffffff8005dfc1>] child_rip+0xa/0x11
      05:15:38: [<ffffffff8b169000>] ptlrpc_main+0x0/0x10e0 [ptlrpc]
      05:15:38: [<ffffffff8005dfb7>] child_rip+0x0/0x11
      

      Maloo report: https://maloo.whamcloud.com/test_sets/bbb3f1bc-779a-11e2-987d-52540035b04c

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: