Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-179

lustre client lockup when under memory pressure

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 1.8.6
    • None
    • None
    • Client is running 2.6.27.45-lustre-1.8.3.ddn3.3. Connectivity is 10GigE
    • 3
    • 10103

    Description

      A customer is seeing a problem on a client where the client loses access to Lustre when the node is subjected to memory pressure from an errant application.

      Lustre starts reporting -113 (No route to host) errors for certain NIDS in the filesystem despite the TCP/IP network being functional. After the memory pressure is relieved the Lustre errors remain. I am collecting logs currently.

      From the customer report:

      Lnet is reporting no-route-to-host for a significant number of OSS / MDSs (client log attached).

      Mar 29 09:23:27 cgp-bigmem kernel: [589295.826095] LustreError: 4980:0:(events.c:66:request_out_callback()) @@@ type 4, status 113 req@ffff881d2e995400 x1363985318437337/t0 o8>lus03-OST0000_UUID@172.17.128.130@tcp:28/4 lens 368/584 e 0 to 1 dl 1301387122 ref 2 fl Rpc:N/0/0 rc 0/0

      but from user-space on the client, all those nodes are pingable:

      cgp-bigmem:/var/log# ping 172.17.128.130
      PING 172.17.128.130 (172.17.128.130) 56(84) bytes of data.
      64 bytes from 172.17.128.130: icmp_seq=1 ttl=62 time=0.102 ms
      64 bytes from 172.17.128.130: icmp_seq=2 ttl=62 time=0.091 ms
      64 bytes from 172.17.128.130: icmp_seq=3 ttl=62 time=0.091 ms
      64 bytes from 172.17.128.130: icmp_seq=4 ttl=62 time=0.090 ms

      however a lnet ping hangs:
      cgp-bigmem:~# lctl ping 172.17.128.130@tcp

      From another client, the ping works as expected

      farm2-head1:# lctl ping 172.17.128.130@tcp
      12345-0@lo
      12345-172.17.128.130@tcp

      cgp-bigmem:~# lfs check servers | grep -v active
      error: check 'lus01-OST0007-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST0008-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST0009-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST000a-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST000b-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST000c-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST000d-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus01-OST000e-osc-ffff88205bd52000' Resource temporarily unavailable
      error: check 'lus02-MDT0000-mdc-ffff8880735ea000' Resource temporarily unavailable
      error: check 'lus03-OST0000-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0001-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0002-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0003-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0004-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0005-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0006-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0007-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0008-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0009-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST000a-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST000b-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST000c-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST0019-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus03-OST001a-osc-ffff8840730a1400' Resource temporarily unavailable
      error: check 'lus05-OST0010-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0012-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0014-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0016-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0018-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST001a-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST001c-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST000f-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0011-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0013-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0015-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0017-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST0019-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST001b-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus05-OST001d-osc-ffff886070dab800' Resource temporarily unavailable
      error: check 'lus04-OST0001-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST0003-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST0005-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST0007-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST0009-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST000b-osc-ffff88806e9d8c00' Resource temporarily unavailable
      error: check 'lus04-OST000d-osc-ffff88806e9d8c00' Resource temporarily unavailable

      Attachments

        Activity

          [LU-179] lustre client lockup when under memory pressure
          gmpc@sanger.ac.uk Guy Coates added a comment -

          Was able to get output from top from the last client lockup; pdflush is sat in 100% CPU.

          top - 09:10:36 up 2 days, 20:08, 2 users, load average: 801.64, 799.78, 796.51
          Tasks: 2891 total, 36 running, 2855 sleeping, 0 stopped, 0 zombie
          Cpu(s): 0.0%us, 25.1%sy, 0.0%ni, 70.8%id, 4.1%wa, 0.0%hi, 0.0%si, 0.0%st
          Mem: 528386840k total, 70774068k used, 457612772k free, 112k buffers
          Swap: 4192924k total, 0k used, 4192924k free, 81176k cached

          PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
          13691 cgppipe 39 19 23.6g 23g 9992 S 201 4.6 7646:35 java
          5640 cgppipe 39 19 3122m 2.7g 744 S 100 0.5 5717:22 bwa
          18662 root 0 -20 4 4 0 R 100 0.0 3756:26 elim.uptime
          153 root 20 0 0 0 0 R 100 0.0 3759:05 pdflush
          5528 root 20 0 13992 1528 900 R 100 0.0 3761:22 pim
          1809 root 20 0 56440 7628 2240 R 3 0.0 0:04.24 top
          4612 root 20 0 8832 532 404 S 0 0.0 2:30.10 irqbalance

          gmpc@sanger.ac.uk Guy Coates added a comment - Was able to get output from top from the last client lockup; pdflush is sat in 100% CPU. top - 09:10:36 up 2 days, 20:08, 2 users, load average: 801.64, 799.78, 796.51 Tasks: 2891 total, 36 running, 2855 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 25.1%sy, 0.0%ni, 70.8%id, 4.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 528386840k total, 70774068k used, 457612772k free, 112k buffers Swap: 4192924k total, 0k used, 4192924k free, 81176k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13691 cgppipe 39 19 23.6g 23g 9992 S 201 4.6 7646:35 java 5640 cgppipe 39 19 3122m 2.7g 744 S 100 0.5 5717:22 bwa 18662 root 0 -20 4 4 0 R 100 0.0 3756:26 elim.uptime 153 root 20 0 0 0 0 R 100 0.0 3759:05 pdflush 5528 root 20 0 13992 1528 900 R 100 0.0 3761:22 pim 1809 root 20 0 56440 7628 2240 R 3 0.0 0:04.24 top 4612 root 20 0 8832 532 404 S 0 0.0 2:30.10 irqbalance
          gmpc@sanger.ac.uk Guy Coates added a comment - - edited

          We've just had a re-occurrence of this problem running 1.8.5.56 (as tagged in git).
          Client starts logging problems at Jun 9 14:49:13.

          gmpc@sanger.ac.uk Guy Coates added a comment - - edited We've just had a re-occurrence of this problem running 1.8.5.56 (as tagged in git). Client starts logging problems at Jun 9 14:49:13.
          gmpc@sanger.ac.uk Guy Coates added a comment -

          Client log

          gmpc@sanger.ac.uk Guy Coates added a comment - Client log

          Integrated in lustre-b1_8 » i686,server,el5,ofa #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » i686,server,el5,ofa #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » i686,server,el5,inkernel #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » i686,server,el5,inkernel #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » i686,client,el5,ofa #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » i686,client,el5,ofa #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » i686,client,el6,inkernel #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » i686,client,el6,inkernel #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » x86_64,server,el5,ofa #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » x86_64,server,el5,ofa #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          Integrated in lustre-b1_8 » i686,client,el5,inkernel #71
          Remove changelog entry for LU-179

          Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd
          Files :

          • lustre/ChangeLog
          hudson Build Master (Inactive) added a comment - Integrated in lustre-b1_8 » i686,client,el5,inkernel #71 Remove changelog entry for LU-179 Johann Lombardi : 08b76cd92b2a4b6854ce3910a07531996449a9fd Files : lustre/ChangeLog

          People

            bobijam Zhenyu Xu
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: