Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6476

conf-sanity: test_53a Error: 'test failed to respond and timed out'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for bfaccini <bruno.faccini@intel.com>
      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b5f17674-e6d2-11e4-83f0-5254006e85c2.

      One of the tests nodes running test_53a()/thread_sanity()/lustre_rmmod got an Oops in _spin_lock_irqsave() with the following stack dumped on the Console :

      13:04:49:BUG: unable to handle kernel NULL pointer dereference at (null)
      13:04:49:IP: [<ffffffff8152cdcf>] _spin_lock_irqsave+0x1f/0x40
      13:04:49:PGD 7985a067 PUD 7bb69067 PMD 0 
      13:04:49:Oops: 0002 [#1] SMP 
      13:04:49:last sysfs file: /sys/devices/system/cpu/possible
      13:04:49:CPU 1 
      13:04:49:Modules linked in: ptlrpc(-)(U) obdclass(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ptlrpc_gss]
      13:04:49:
      13:04:49:Pid: 9166, comm: rmmod Not tainted 2.6.32-504.12.2.el6.x86_64 #1 Red Hat KVM
      13:04:49:RIP: 0010:[<ffffffff8152cdcf>]  [<ffffffff8152cdcf>] _spin_lock_irqsave+0x1f/0x40
      13:04:49:RSP: 0018:ffff88007b1c5db8  EFLAGS: 00010082
      13:04:49:RAX: 0000000000010000 RBX: 0000000000000000 RCX: 0000000000000000
      13:04:49:RDX: 0000000000000282 RSI: 0000000000000003 RDI: 0000000000000000
      13:04:49:RBP: ffff88007b1c5db8 R08: ffff880079865540 R09: 0000000000000001
      13:04:49:R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007b900aa0
      13:04:49:R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      13:04:49:FS:  00007f46301fe700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      13:04:49:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      13:04:49:CR2: 0000000000000000 CR3: 000000007cb24000 CR4: 00000000000006e0
      13:04:49:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      13:04:49:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      13:04:49:Process rmmod (pid: 9166, threadinfo ffff88007b1c4000, task ffff88007b900aa0)
      13:04:49:Stack:
      13:04:49: ffff88007b1c5df8 ffffffff8105bd02 ffff880078f849c0 0000000000000880
      13:04:49:<d> ffff88007b900aa0 ffff88007b1c5e68 0000000000000000 0000000000000001
      13:04:49:<d> ffff88007b1c5e08 ffffffffa0761cdf ffff88007b1c5e18 ffffffffa0743104
      13:04:49:Call Trace:
      13:04:49: [<ffffffff8105bd02>] __wake_up+0x32/0x70
      13:04:49: [<ffffffffa0761cdf>] lnet_acceptor_stop+0x3f/0x50 [lnet]
      13:04:49: [<ffffffffa0743104>] LNetNIFini+0x74/0x100 [lnet]
      13:04:49: [<ffffffffa09ed775>] ptlrpc_ni_fini+0x135/0x1b0 [ptlrpc]
      13:04:49: [<ffffffffa09ed803>] ptlrpc_exit_portals+0x13/0x20 [ptlrpc]
      13:04:49: [<ffffffffa0a2fcaa>] ptlrpc_exit+0x22/0x38 [ptlrpc]
      13:04:49: [<ffffffff810bcf14>] sys_delete_module+0x194/0x260
      13:04:49: [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
      13:04:49: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      13:04:49:Code: c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 b8 00 00 01 00 <f0> 0f c1 07 0f b7 c8 c1 e8 10 39 c1 74 0e f3 90 0f 1f 44 00 00 
      13:04:49:RIP  [<ffffffff8152cdcf>] _spin_lock_irqsave+0x1f/0x40
      13:04:49: RSP <ffff88007b1c5db8>
      13:04:49:CR2: 0000000000000000
      

      Looks like there is a race between lnet_acceptor_stop() execution and concurrent (and unexpected?) acceptor thread exit, where lnet_acceptor_stop() :

      void
      lnet_acceptor_stop(void)
      {
              if (lnet_acceptor_state.pta_shutdown) /* not running */
                      return;
      
              lnet_acceptor_state.pta_shutdown = 1;
              wake_up_all(sk_sleep(lnet_acceptor_state.pta_sock->sk));
      
              /* block until acceptor signals exit */
              wait_for_completion(&lnet_acceptor_state.pta_signal);
      }
      

      tries to wake-up all socket sleepers without protecting itself against acceptor thread concurent exit in lnet_acceptor() routine :

              }
      
              sock_release(lnet_acceptor_state.pta_sock);
              lnet_acceptor_state.pta_sock = NULL;
      
              CDEBUG(D_NET, "Acceptor stopping\n");
      
              /* unblock lnet_acceptor_stop() */
              complete(&lnet_acceptor_state.pta_signal);
              return 0;
      }
      

      where acceptor socket is released and its sk_sleep sleepers wait-queue is cleared under protection of sk_callback_lock RW-Lock in sock_orphan(). So seems that lnet_acceptor_stop() should better use, or implement the same behavior than, sock_def_wakeup() :

      static void sock_def_wakeup(struct sock *sk)
      {
              read_lock(&sk->sk_callback_lock);
              if (sk_has_sleeper(sk))
                      wake_up_interruptible_all(sk->sk_sleep);
              read_unlock(&sk->sk_callback_lock);
      }
      

      instead of simply calling wake_up_all().
      This may have been introduced by recent integration of patches to implement dynamic acceptor start/stop feature.
      Will push a patch to give a try to my previous thought on how this can be fixed.

      Attachments

        Issue Links

          Activity

            [LU-6476] conf-sanity: test_53a Error: 'test failed to respond and timed out'

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14503/
            Subject: LU-6476 lnet: avoid race during acceptor thread termination
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7574e32e64b77851163d0a8d3141b406d03621be

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14503/ Subject: LU-6476 lnet: avoid race during acceptor thread termination Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7574e32e64b77851163d0a8d3141b406d03621be

            Having a better look to the code (but this may need to be confirmed with an in-deep crash-dump analysis !!...), I am now convinced that this can only occur during some concurrent socket activity with the shutdown/lnet_acceptor_stop() process, which requires the sleepers list to be accessed under protection, thus with the preferred sk->sk_state_change() method to do so.

            bfaccini Bruno Faccini (Inactive) added a comment - Having a better look to the code (but this may need to be confirmed with an in-deep crash-dump analysis !!...), I am now convinced that this can only occur during some concurrent socket activity with the shutdown/lnet_acceptor_stop() process, which requires the sleepers list to be accessed under protection, thus with the preferred sk->sk_state_change() method to do so.

            I don't this this is related to LU-6002. The wake_up_all code in lnet_acceptor_stop was introduced by: LU-6245

            ed88907a (nathan          2007-02-10 00:05:05 +0000 505) 
            ed88907a (nathan          2007-02-10 00:05:05 +0000 506) void
            ed88907a (nathan          2007-02-10 00:05:05 +0000 507) lnet_acceptor_stop(void)
            ed88907a (nathan          2007-02-10 00:05:05 +0000 508) {
            91df044a (Amir Shehata    2015-01-12 17:14:33 -0800 509)        if (lnet_acceptor_state.pta_shutdown) /* not running */
            2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 510)                return;
            19205dfe (maxim           2007-09-28 21:46:35 +0000 511) 
            2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 512)        lnet_acceptor_state.pta_shutdown = 1;
            0b868add (James Simmons   2015-03-13 13:11:19 -0400 513)        wake_up_all(sk_sleep(lnet_acceptor_state.pta_sock->sk));
            b7455572 (maxim           2009-02-03 13:43:21 +0000 514) 
            2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 515)        /* block until acceptor signals exit */
            2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 516)        wait_for_completion(&lnet_acceptor_state.pta_signal);
            19205dfe (maxim           2007-09-28 21:46:35 +0000 517) }
            
            ashehata Amir Shehata (Inactive) added a comment - I don't this this is related to LU-6002 . The wake_up_all code in lnet_acceptor_stop was introduced by: LU-6245 ed88907a (nathan 2007-02-10 00:05:05 +0000 505) ed88907a (nathan 2007-02-10 00:05:05 +0000 506) void ed88907a (nathan 2007-02-10 00:05:05 +0000 507) lnet_acceptor_stop(void) ed88907a (nathan 2007-02-10 00:05:05 +0000 508) { 91df044a (Amir Shehata 2015-01-12 17:14:33 -0800 509) if (lnet_acceptor_state.pta_shutdown) /* not running */ 2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 510) return ; 19205dfe (maxim 2007-09-28 21:46:35 +0000 511) 2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 512) lnet_acceptor_state.pta_shutdown = 1; 0b868add (James Simmons 2015-03-13 13:11:19 -0400 513) wake_up_all(sk_sleep(lnet_acceptor_state.pta_sock->sk)); b7455572 (maxim 2009-02-03 13:43:21 +0000 514) 2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 515) /* block until acceptor signals exit */ 2841be33 (John L. Hammond 2013-03-13 14:46:08 -0500 516) wait_for_completion(&lnet_acceptor_state.pta_signal); 19205dfe (maxim 2007-09-28 21:46:35 +0000 517) }

            It looks like this was likely added via "LU-6002 lnet: startup acceptor thread dynamically" patch http://review.whamcloud.com/13010

            adilger Andreas Dilger added a comment - It looks like this was likely added via " LU-6002 lnet: startup acceptor thread dynamically" patch http://review.whamcloud.com/13010

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14503
            Subject: LU-6476 lnet: avoid race during acceptor thread termination
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9f23308ae7b338d0c0f4b43d2e408d50fca4b3f6

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/14503 Subject: LU-6476 lnet: avoid race during acceptor thread termination Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9f23308ae7b338d0c0f4b43d2e408d50fca4b3f6
            yong.fan nasf (Inactive) added a comment - Another failure instance: https://testing.hpdd.intel.com/test_sets/88b81062-df63-11e4-bf2e-5254006e85c2

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: