Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12416

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      After the multi-rail branch merge mount faces with:

      [   80.154928] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]
      [   80.157527] Kernel panic - not syncing: softlockup: hung tasks
      [   80.158948] CPU: 0 PID: 11956 Comm: mount.lustre Tainted: G           OEL ------------   3.10.0-862.14.4.el7.x86_64 #22
      [   80.161585] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [   80.163751] Call Trace:
      [   80.164430]  <IRQ>  [<ffffffff92c22b0e>] dump_stack+0x19/0x1b
      [   80.165605]  [<ffffffff92c1d7af>] panic+0xe8/0x21f
      [   80.166601]  [<ffffffff9262c838>] ? show_regs+0x58/0x210
      [   80.167670]  [<ffffffff9273a761>] watchdog_timer_fn+0x231/0x240
      [   80.169127]  [<ffffffff9273a530>] ? watchdog+0x40/0x40
      [   80.170551]  [<ffffffff926bdd93>] __hrtimer_run_queues+0xf3/0x270
      [   80.172526]  [<ffffffff926be31f>] hrtimer_interrupt+0xaf/0x1d0
      [   80.173902]  [<ffffffff926573ab>] local_apic_timer_interrupt+0x3b/0x60
      [   80.175189]  [<ffffffff92c38a13>] smp_apic_timer_interrupt+0x43/0x60
      [   80.176421]  [<ffffffff92c352b2>] apic_timer_interrupt+0x162/0x170
      [   80.178102]  <EOI>  [<ffffffffc074db71>] ? lnet_peer_ni_alloc+0x61/0x390 [lnet]
      [   80.180384]  [<ffffffff92703494>] ? __raw_callee_save___pv_queued_spin_unlock+0x10/0x17
      [   80.182187]  [<ffffffffc06c47b8>] cfs_percpt_unlock+0x38/0xb0 [libcfs]
      [   80.183421]  [<ffffffffc0756757>] lnet_discover_peer_locked+0x77/0x3d0 [lnet]
      [   80.184929]  [<ffffffff926bab40>] ? wake_up_atomic_t+0x30/0x30
      [   80.186831]  [<ffffffffc0756b20>] LNetPrimaryNID+0x70/0x1a0 [lnet]
      [   80.188202]  [<ffffffffc0b295ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
      [   80.190003]  [<ffffffffc0b1d94c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
      [   80.192225]  [<ffffffffc0aefcd2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
      [   80.193890]  [<ffffffffc0af1d49>] client_obd_setup+0xd19/0x1430 [ptlrpc]
      [   80.195201]  [<ffffffffc06b994f>] ? cfs_hash_buckets_realloc+0x1bf/0x690 [libcfs]
      [   80.196932]  [<ffffffffc0e85aae>] mgc_setup+0x3e/0x650 [mgc]
      [   80.198397]  [<ffffffffc084259c>] obd_setup+0x15c/0x280 [obdclass]
      [   80.199982]  [<ffffffffc06ba18c>] ? cfs_hash_create+0x36c/0xa20 [libcfs]
      [   80.201576]  [<ffffffffc0843888>] class_setup+0x2a8/0x840 [obdclass]
      [   80.203390]  [<ffffffffc0846b2e>] class_process_config+0x191e/0x2840 [obdclass]
      [   80.205245]  [<ffffffffc0838e92>] ? class_add_uuid+0x282/0x4c0 [obdclass]
      [   80.206614]  [<ffffffffc084ae78>] do_lcfg+0x258/0x500 [obdclass]
      [   80.207822]  [<ffffffffc084f6a8>] lustre_start_simple+0x88/0x210 [obdclass]
      [   80.209462]  [<ffffffffc08504a5>] lustre_start_mgc+0xc75/0x2420 [obdclass]
      [   80.210917]  [<ffffffffc084f6a8>] ? lustre_start_simple+0x88/0x210 [obdclass]
      [   80.213187]  [<ffffffffc087d2eb>] server_fill_super+0xbfb/0x1890 [obdclass]
      [   80.214570]  [<ffffffffc08526b8>] lustre_fill_super+0x328/0x950 [obdclass]
      [   80.216089]  [<ffffffffc0852390>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      [   80.218447]  [<ffffffff928100bf>] mount_nodev+0x4f/0xb0
      [   80.219779]  [<ffffffffc084a888>] lustre_mount+0x38/0x60 [obdclass]
      [   80.221244]  [<ffffffff92810c3e>] mount_fs+0x3e/0x1b0
      [   80.222538]  [<ffffffff9282e177>] vfs_kern_mount+0x67/0x110
      [   80.224375]  [<ffffffff9283079f>] do_mount+0x1ef/0xce0
      [   80.226543]  [<ffffffff927e836c>] ? kmem_cache_alloc_trace+0x3c/0x200
      [   80.228959]  [<ffffffff928315d3>] SyS_mount+0x83/0xd0
      [   80.230329]  [<ffffffff92c3429b>] system_call_fastpath+0x22/0x27
      

      Important: this is in case when there is only one CPU.

      Attachments

        Issue Links

          Activity

            [LU-12416] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s Original: Lustre 2.13.0 [ 14290 ]
            pjones Peter Jones made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            tracked under LU-10931

            pjones Peter Jones added a comment - tracked under LU-10931
            green Oleg Drokin made changes -
            Link New: This issue duplicates LU-10931 [ LU-10931 ]
            green Oleg Drokin added a comment -

            there's a patch trackign there that should fix this.

            green Oleg Drokin added a comment - there's a patch trackign there that should fix this.
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.13.0 [ 14290 ]
            arshad512 Arshad Hussain made changes -
            Attachment New: soft_lockup_during_mount.png [ 32769 ]

            I am also facing this issue on a single CPU VM after doing latest pull. From commit "f9ad0d1" it is failing and anything below this commit it working fine. That is till commit deb31c2 it is working fine.

             

            1aae733 LU-11297 lnet: MR Routing Feature - Failed (HEAD)
            ...
            00a2932 LU-11297 lnet: handle router health off - Failed
            f9ad0d1 LU-11641 lnet: handle discovery - Failed
            deb31c2 LU-11470 lnet: drop all rule - Pass
            ...
            4344562 LU-11300 lnet: consider alive_router_check_interval - Pass
            

            Kernel version & Distribution

            # uname -r
            3.10.0-862.9.1.el7_lustre.x86_64
            #
            
            # cat /etc/redhat-release
            CentOS Linux release 7.5.1804 (Core)
            #
            
            arshad512 Arshad Hussain added a comment - I am also facing this issue on a single CPU VM after doing latest pull. From commit " f9ad0d1 " it is failing and anything below this commit it working fine. That is till commit deb31c2 it is working fine.   1aae733 LU-11297 lnet: MR Routing Feature - Failed (HEAD) ... 00a2932 LU-11297 lnet: handle router health off - Failed f9ad0d1 LU-11641 lnet: handle discovery - Failed deb31c2 LU-11470 lnet: drop all rule - Pass ... 4344562 LU-11300 lnet: consider alive_router_check_interval - Pass Kernel version & Distribution # uname -r 3.10.0-862.9.1.el7_lustre.x86_64 # # cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) #
            vsaveliev Vladimir Saveliev created issue -

            People

              wc-triage WC Triage
              vsaveliev Vladimir Saveliev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: