Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2822

softlockups, evictions during recovery-scale

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.9
    • None
    • CHAOS4/RHEL5 Old Hyperion system
    • 3
    • 6828

    Description

      running the mds-recovery-scale test. Not really sure what is going on. The test completes one failover, then activity ceases. Appear to be having soft lockups on the clients, pdflush is blocked, followed/prceeded by evictions from OSTs.
      Example, this is during an MDS failover event:

      Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-MDT0000-mdc-ffff8101bf048400: tried all connections, increasing latency to 8s
      Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 16 previous similar messages
      Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426985036308729 sent from lustre-OST0006-osc-ffff8101bf048400 to NID 192.168.120.86@o2ib 12s ago has timed out (12s prior to deadline).
      Feb 15 09:59:11 ehyperion607 req@ffff8101a84e8400 x1426985036308729/t0 o8->lustre-OST0006_UUID@192.168.120.86@o2ib:28/4 lens 368/584 e 0 to 1 dl 1360951115 ref 2 fl Rpc:N/0/0 rc 0/0
      Feb 15 09:59:11 ehyperion607 Lustre: 892:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 30 previous similar messages
      Feb 15 09:59:11 ehyperion607 BUG: soft lockup - CPU#0 stuck for 60s! [events/0:50]
      Feb 15 09:59:11 ehyperion607 CPU 0:
      Feb 15 09:59:11 ehyperion607 Modules linked in: mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm rdma_cm iw_cm ib_addr ib_ucm ib_uverbs ib_umad mlx4_ib mlx4_core ipoib_helper ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev shpchp ahci i7core_edac edac_mc libata i2c_i801 pcspkr i2c_core scsi_mod uhci_hcd tpm_tis tpm tpm_bios nfs nfs_acl lockd sunrpc e1000e
      Feb 15 09:59:11 ehyperion607 Pid: 50, comm: events/0 Tainted: G     ---- 2.6.18-348.1.1.el5 #1
      Feb 15 09:59:11 ehyperion607 RIP: 0010:[<ffffffff80077925>]  [<ffffffff80077925>] __smp_call_function_many+0x96/0xbc
      Feb 15 09:59:11 ehyperion607 RSP: 0018:ffff8101bf953d40  EFLAGS: 00000297
      Feb 15 09:59:11 ehyperion607 RAX: 0000000000000006 RBX: 0000000000000007 RCX: 0000000000000000
      Feb 15 09:59:11 ehyperion607 RDX: 00000000000000ff RSI: 00000000000000ff RDI: 00000000000000c0
      Feb 15 09:59:11 ehyperion607 RBP: ffff8101bf953e10 R08: 0000000000000008 R09: 0000000000000038
      Feb 15 09:59:11 ehyperion607 R10: ffff8101bf953ce0 R11: ffff8101c571c500 R12: 00000002ffffffff
      Feb 15 09:59:11 ehyperion607 R13: ffffffff8008dedf R14: 00000001ffffffff R15: ffffffff80157cde
      Feb 15 09:59:11 ehyperion607 FS:  0000000000000000(0000) GS:ffffffff80435000(0000) knlGS:0000000000000000
      Feb 15 09:59:11 ehyperion607 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      Feb 15 09:59:11 ehyperion607 CR2: 00002aaaabb7c088 CR3: 00000001b73bb000 CR4: 00000000000006a0
      Feb 15 09:59:11 ehyperion607
      Feb 15 09:59:11 ehyperion607 Call Trace:
      Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
      Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
      Feb 15 09:59:11 ehyperion607 [<ffffffff80077a27>] smp_call_function_many+0x38/0x4c
      Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
      Feb 15 09:59:11 ehyperion607 [<ffffffff80077b18>] smp_call_function+0x4e/0x5e
      Feb 15 09:59:11 ehyperion607 [<ffffffff80073de0>] mcheck_check_cpu+0x0/0x30
      Feb 15 09:59:11 ehyperion607 [<ffffffff80073020>] mcheck_timer+0x0/0x6c
      Feb 15 09:59:11 ehyperion607 [<ffffffff80096be9>] on_each_cpu+0x10/0x22
      Feb 15 09:59:11 ehyperion607 [<ffffffff8007303c>] mcheck_timer+0x1c/0x6c
      Feb 15 09:59:11 ehyperion607 [<ffffffff8004d8b9>] run_workqueue+0x9e/0xfb
      Feb 15 09:59:11 ehyperion607 [<ffffffff8004a102>] worker_thread+0x0/0x122
      Feb 15 09:59:11 ehyperion607 [<ffffffff8004a1f2>] worker_thread+0xf0/0x122
      Feb 15 09:59:11 ehyperion607 [<ffffffff8008f38d>] default_wake_function+0x0/0xe
      Feb 15 09:59:11 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132
      Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11
      Feb 15 09:59:11 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132
      Feb 15 09:59:11 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11
      Feb 15 09:59:11 ehyperion607
      Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) lustre-OST0006-osc-ffff8101bf048400: tried all connections, increasing latency to 9s
      Feb 15 09:59:11 ehyperion607 Lustre: 893:0:(import.c:517:import_select_connection()) Skipped 25 previous similar messages
      Feb 15 09:59:54 ehyperion607 INFO: task pdflush:543 blocked for more than 120 seconds.
      Feb 15 09:59:54 ehyperion607 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Feb 15 09:59:54 ehyperion607 pdflush       D ffff8101c546a7a0     0   543    171           544   542 (L-TLB)
      Feb 15 09:59:54 ehyperion607 ffff81033f57fb30 0000000000000046 ffff81033f57fa80 0000000000000000
      Feb 15 09:59:54 ehyperion607 ffffffff88718220 000000000000000a ffff8101bf566820 ffff8101c56de7a0
      Feb 15 09:59:54 ehyperion607 00009ea40c4605ce 00000000001a7575 ffff8101bf566a08 0000000705c2b6a0
      Feb 15 09:59:54 ehyperion607 Call Trace:
      Feb 15 09:59:54 ehyperion607 [<ffffffff8006ed48>] do_gettimeofday+0x40/0x90
      Feb 15 09:59:54 ehyperion607 [<ffffffff80029173>] sync_page+0x0/0x43
      Feb 15 09:59:54 ehyperion607 [<ffffffff800637de>] io_schedule+0x3f/0x67
      Feb 15 09:59:54 ehyperion607 [<ffffffff800291b1>] sync_page+0x3e/0x43
      Feb 15 09:59:54 ehyperion607 [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
      Feb 15 09:59:54 ehyperion607 [<ffffffff8003ff85>] __lock_page+0x5e/0x64
      Feb 15 09:59:54 ehyperion607 [<ffffffff800a3c31>] wake_bit_function+0x0/0x23
      Feb 15 09:59:54 ehyperion607 [<ffffffff8001d54e>] mpage_writepages+0x14f/0x37d
      Feb 15 09:59:54 ehyperion607 [<ffffffff88846c30>] :lustre:ll_writepage_26+0x0/0x10
      Feb 15 09:59:54 ehyperion607 [<ffffffff8005aee2>] do_writepages+0x20/0x2f
      Feb 15 09:59:54 ehyperion607 [<ffffffff8002ff0f>] __writeback_single_inode+0x1a2/0x31c
      Feb 15 09:59:54 ehyperion607 [<ffffffff8002159a>] sync_sb_inodes+0x1b7/0x271
      Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4
      Feb 15 09:59:54 ehyperion607 [<ffffffff8005130b>] writeback_inodes+0x82/0xd8
      Feb 15 09:59:54 ehyperion607 [<ffffffff800cd802>] wb_kupdate+0xf0/0x16a
      Feb 15 09:59:54 ehyperion607 [<ffffffff800568b2>] pdflush+0x0/0x1fb
      Feb 15 09:59:54 ehyperion607 [<ffffffff80056a03>] pdflush+0x151/0x1fb
      Feb 15 09:59:54 ehyperion607 [<ffffffff800cd712>] wb_kupdate+0x0/0x16a
      Feb 15 09:59:54 ehyperion607 [<ffffffff80032c2c>] kthread+0xfe/0x132
      Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfc1>] child_rip+0xa/0x11
      Feb 15 09:59:54 ehyperion607 [<ffffffff800a39eb>] keventd_create_kthread+0x0/0xc4
      Feb 15 09:59:54 ehyperion607 [<ffffffff80032b2e>] kthread+0x0/0x132
      Feb 15 09:59:54 ehyperion607 [<ffffffff8005dfb7>] child_rip+0x0/0x11
      Feb 15 09:59:54 ehyperion607
      

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: