Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1153

Client Unresponsive

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.2.0
    • 3
    • 6441

    Description

      It is possible that I am seeing multiple bugs. So, you may want to split this one bug into several bugs.

      Let me emphasize that this problem occurs only on one system, and that is the system built withhttp://review.whamcloud.com/#change,2170, which is code specifically for the 2.6.38.2 kernel.

      This bug is preventing us from shipping the code to this customer.

      PROBLEM 1
      ---------
      My testing of the client on kernel 2.6.38.2 was going well overnight. The performance I am getting is the line-rate of the 10G NICs while running IOZone. Then, I switched to xdd.linux, which uses direct IO; xdd.linux also got line-rate.

      Then, I noticed that I may have hit a minor bug.

      After the iozone tests, I removed all of the files. Then, I did an lfs df -h, and I saw that some OSTs had 20G used still. After I unmounted all of the clients then remounted them, the problem went away (that is, all the OSTs had the same amount of used space). Here is some session output:

      [root@compute-01-32 lustre]# cd /mnt/lustre
      [root@compute-01-32 lustre]# find . -type f

      [root@compute-01-32 lustre]# # Note that there are no files; I removed them a while ago.
      [root@compute-01-32 lustre]# lfs df -h
      UUID bytes Used Available Use% Mounted on
      denver-MDT0000_UUID 73.2M 4.6M 68.6M 6% /mnt/lustre[MDT:0]
      denver-OST0000_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:0]
      denver-OST0001_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:1]
      denver-OST0002_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:2]
      denver-OST0003_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:3]
      denver-OST0004_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:4]
      denver-OST0005_UUID 190.0G 20.4G 169.6G 11% /mnt/lustre[OST:5]
      denver-OST0006_UUID 190.0G 459.5M 189.6G 0% /mnt/lustre[OST:6]
      denver-OST0007_UUID 190.0G 20.4G 169.6G 11% /mnt/lustre[OST:7]

      filesystem summary: 1.5T 43.6G 1.4T 3% /mnt/lustre

      [root@compute-01-32 lustre]# # Note that two OSTs have used 20.4G, even though no files !

      [root@compute-01-32 lustre]# df
      Filesystem 1K-blocks Used Available Use% Mounted on
      /dev/mapper/vg_compute0132-lv_root
      51606140 12008608 36976092 25% /
      tmpfs 1924356 88 1924268 1% /dev/shm
      /dev/sda2 495844 59594 410650 13% /boot
      /dev/mapper/vg_compute0132-lv_home
      413557824 26856608 365693664 7% /home
      172.20.0.1:/depot/shared
      12095040 12094688 0 100% /shared
      172.20.0.1:/home 413557824 26856608 365693664 7% /home
      10.7.200.111@tcp0:10.7.200.112@tcp0:/denver
      1594036256 45706984 1548328760 3% /mnt/lustre

      PROBLEM 2
      ---------

      I then unmounted Lustre on all clients, rebooted all the clients,
      then remounted Lustre on all clients.
      The problem described above seemed to have been resolved.

      I decided to try and reproduce this bug. So, I started up IOZone again. Keep in mind that this is after a client reboot.

      IOZone ran for a few seconds, then hung. I could ping the node with 2.6.38.8 kernel, but I could not ssh to it. The video console was locked up. Pressing Caps Lock and NumLock on the keyboard did not light up any LEDs on the keyboard. So, I power cycled.

      This is all that I saw in /var/log/messages:

      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.7.1.32.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: New relevant interface eth2.IPv4 for mDNS.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Registering new address record for 10.7.1.32 on eth2.IPv4.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Withdrawing address record for 10.7.1.32 on eth2.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Leaving mDNS multicast group on interface eth2.IPv4 with address 10.7.1.32.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Interface eth2.IPv4 no longer relevant for mDNS.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.7.1.32.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: New relevant interface eth2.IPv4 for mDNS.
      Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Registering new address record for 10.7.1.32 on eth2.IPv4.
      Feb 28 09:33:53 compute-01-32 kernel: device eth2 entered promiscuous mode
      Feb 28 09:33:54 compute-01-32 kernel: Lustre: Lustre: Build Version: 2.1.56-g17d2c48-CHANGED-../lustre/scripts
      Feb 28 09:33:55 compute-01-32 kernel: Lustre: Added LNI 10.7.1.32@tcp [8/256/0/180]
      Feb 28 09:33:55 compute-01-32 kernel: Lustre: Accept secure, port 988
      Feb 28 09:33:55 compute-01-32 kernel: Lustre: MGC10.7.200.111@tcp: Reactivating import
      Feb 28 09:33:55 compute-01-32 kernel: Lustre: Client denver-client has started
      Feb 28 09:37:28 compute-01-32 kernel: LustreError: 2567:0:(osc_request.c:797:osc_announce_cached()) dirty 2028 - 2028 > system dirty_max 647168
      Feb 28 09:37:40 compute-01-32 kernel: LustreError: 2568:0:(osc_request.c:797:osc_announce_cached()) dirty 2040 - 2040 > system dirty_max 647168
      Feb 28 09:37:46 compute-01-32 kernel: LustreError: 2565:0:(osc_request.c:797:osc_announce_cached()) dirty 1954 - 1954 > system dirty_max 647168
      Feb 28 09:37:47 compute-01-32 kernel: LustreError: 2561:0:(osc_request.c:797:osc_announce_cached()) dirty 1975 - 1976 > system dirty_max 647168
      Feb 28 09:37:50 compute-01-32 kernel: LustreError: 2566:0:(osc_request.c:797:osc_announce_cached()) dirty 2001 - 2002 > system dirty_max 647168
      Feb 28 09:38:09 compute-01-32 kernel: LustreError: 2567:0:(osc_request.c:797:osc_announce_cached()) dirty 1946 - 1946 > system dirty_max 647168
      Feb 28 09:39:20 compute-01-32 kernel: LustreError: 2615:0:(osc_request.c:797:osc_announce_cached()) dirty 2017 - 2017 > system dirty_max 647168
      Feb 28 09:39:47 compute-01-32 kernel: LustreError: 2615:0:(osc_request.c:797:osc_announce_cached()) dirty 1847 - 1848 > system dirty_max 647168
      Feb 28 09:39:47 compute-01-32 kernel: LustreError: 2615:0:(osc_request.c:797:osc_announce_cached()) Skipped 3 previous similar messages
      Feb 28 09:40:23 compute-01-32 kernel: LustreError: 2609:0:(osc_request.c:797:osc_announce_cached()) dirty 1918 - 1919 > system dirty_max 647168
      Feb 28 09:40:23 compute-01-32 kernel: LustreError: 2609:0:(osc_request.c:797:osc_announce_cached()) Skipped 1 previous similar message

      I did not have a serial cable plugged into get the crash dump. I have set this up now, so if the problem occurs again, we should get more data. I will try to reproduce this bug.

      PROBLEM 3
      ---------

      I set up the serial cable, and I verified that SysRq-T worked and sent its output over the serial cable to another node that was capturing the data.

      After many hours of testing, the 2.6.38.2 client became unresponsive again. I see some Out of memory messages. I will keep an eye on the slab usage next time.

      Below following is the output over ttyS0. After the problem occurred, I plugged in a monitor and keyboard, and SysRq-T did not work. Perhaps I need to have the keyboard in prior to the problem occurring .

      LustreError: 19819:0:(ldlm_request.c:1171:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      LustreError: 19819:0:(ldlm_request.c:1797:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      LustreError: 19819:0:(ldlm_request.c:1171:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      LustreError: 19819:0:(ldlm_request.c:1797:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      Out of memory: Kill process 1421 (rsyslogd) score 1 or sacrifice child
      Killed process 1421 (rsyslogd) total-vm:242652kB, anon-rss:0kB, file-rss:784kB
      Out of memory: Kill process 1452 (irqbalance) score 1 or sacrifice child
      Killed process 1452 (irqbalance) total-vm:9252kB, anon-rss:0kB, file-rss:416kB
      INFO: task irqbalance:1452 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task irqbalance:1452 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task automount:1823 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task ksmtuned:1975 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task irqbalance:1452 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task automount:1823 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task ksmtuned:1975 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task irqbalance:1452 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task automount:1823 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      INFO: task ksmtuned:1975 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: