Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4018

acceptance-small test suite: umount hang with "/lustre/scratch2 is still busy"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.5.0
    • None
    • lustre-master build #1687 tag 2.4.93
      OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, four clients; one agent (c07), one with robinhood/db running (c08) and just running as Lustre clients (c09, c10)
    • 3
    • 10784

    Description

      There is an issue when running acceptance-small where the file system hangs during or right after racer test and will sometime return after several hours or will not return after several (~6) hours. This is a very long description. So, here is the outline of the testing and results:
      1. Run acceptance-small with normal Lustre configuration and HSM not enabled. acceptance-small completes.
      2. Run sanity-hsm.sh with HSM enabled. sanity-hsm completes.
      3. Run acceptance-small with HSM enabled. acceptance-small will not complete.

      The problem of acceptance-small not completing is very reproducible; every time I run acceptance-small with HSM enabled, it hangs on some unmount/cleanup call in the tests.

      Here are the details:

      On a newly build Lustre file system, I ran the acceptance-small suite without starting HSM (I left hsm_control = stopped, didn't start the copy tool, no agents registered, didn't mount the archive) , but had all the agent variables in my configuration file set. acceptance-small completed and when it came to sanity-hsm, the last test in acceptance-small, HSM ran fine, some test failures, but all tests completed. There are some failures here that could be related to this mount/unmount problem. For example, lfsck fails with

      oss: removing objects from /dev/sda3 on ost1: 
      oss: mount: mount point /mnt/ost1 does not exist
      pdsh@c10: oss: ssh exited with exit code 32
      

      Test results are at https://maloo.whamcloud.com/test_sessions/eb9d697c-2658-11e3-8d26-52540035b04c.

      Next ran sanity-hsm with HSM enabled (started agent, started copytool, registered agent, enabled HSM, etc.) and the tests completed with some failures. Results are at https://maloo.whamcloud.com/test_sessions/15535878-2639-11e3-8d26-52540035b04c.

      Next started acceptance-small without changing any parameters, i.e. HSM was enabled as above. Everything looked like it was running fine until I saw messages on the client console that kicked off the tests (c10):

      ...
      file_create: FILE=/lustre/scratch/racer/1 SIZE=66368
      file_create: FILE=/lustre/scratch/racer/4 SIZE=155632
      file_create: FILE=/lustre/scratch/racer/7 SIZE=36848
      file_create: FILE=/lustre/scratch/racer/4 SIZE=198040
      file_create: FILE=/lustre/scratch/racer/12 SIZE=70920
      file_create: FILE=/lustre/scratch/racer/17 SIZE=163576
      racer cleanup
      racer cleanup
      racer cleanup
      racer cleanup
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      sleeping 5 sec ...
      there should be NO racer processes:
      root      7383  0.0  0.0 103244   880 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root      7385  0.0  0.0 103244   880 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      there should be NO racer processes:
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root      3805  0.0  0.0 103244   872 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root      3807  0.0  0.0 103244   872 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      root     31318  0.0  0.0 103244   892 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     31320  0.0  0.0 103244   888 ?        R    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root     30362  0.0  0.0 103244   892 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      pid=16205 rc=0
      
      there should be NO racer processes:
      root      3805  0.0  0.0 103244   872 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root      3807  0.0  0.0 103244   872 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch2
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      root     31318  0.0  0.0 103244   892 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     31320  0.0  0.0 103244   896 ?        R    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch2
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      there should be NO racer processes:
      root     30362  0.0  0.0 103244   892 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      root     30364  0.0  0.0 103244   892 ?        S    15:59   0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec
      Filesystem           1K-blocks      Used Available Use% Mounted on
      mds@o2ib:/scratch       374928    346576      8576  98% /lustre/scratch2
      We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds.
      pid=16206 rc=0
      Resetting fail_loc on all nodes...done.
      PASS 1 (306s)
      == racer test complete, duration 312 sec == 15:59:37 (1380149977)
      Stopping clients: c07,c08,c09,c10 /lustre/scratch2 (opts:)
      Stopping client c07 /lustre/scratch2 opts:
      Stopping client c09 /lustre/scratch2 opts:
      Stopping client c08 /lustre/scratch2 opts:
      Stopping client c10 /lustre/scratch2 opts:
      COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      dd      24324 root    1w   REG 3953,876890    31744 144115574404849344 /lustre/scratch2/racer/1
      /lustre/scratch2 is still busy, wait one second
      COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      dd      23480 root    1w   REG 3953,876890  3895296 144115574438378493 /lustre/scratch/racer/13
      COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      dd       9884 root    1w   REG 3953,876890 21794816 144115574438363138 /lustre/scratch/racer/11
      dd      30142 root    1w   REG 3953,876890 13414400 144115574388043790 /lustre/scratch2/racer/3
      /lustre/scratch2 is still busy, wait one second
      COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      dd      11466 root    1w   REG 3953,876890  3895296 144115574438378493 /lustre/scratch2/racer/13 (deleted)
      dd      16589 root    1w   REG 3953,876890 21794816 144115574438363138 /lustre/scratch/racer/11
      dd      24466 root    1w   REG 3953,876890    31744 144115574404849344 /lustre/scratch/racer/1
      /lustre/scratch2 is still busy, wait one second
      /lustre/scratch2 is still busy, wait one second
      /lustre/scratch2 is still busy, wait one second
      

      And the "still busy message" continued for several hours. After approximately 3.5 hours, the tests started running again. So, I let things run and went to bed. When I checked back in the morning, the testing was stuck in conf-sanity with the following on the c10 console:

      oss: Are the lustre modules loaded?
      oss: Check /etc/modprobe.conf and /proc/filesystems
      pdsh@c10: oss: ssh exited with exit code 19
      Start of /dev/sda3 on ost1 failed 19
      pdsh@c10: mds: ssh exited with exit code 3
      can't get osc.scratch-OST0000-osc-MDT0000.ost_server_uuid by list_param in 40 secs
      Go with osc.scratch-OST0000-osc-MDT0000.ost_server_uuid directly
      mds:  rpc : @@@@@@ FAIL: can't put import for osc.scratch-OST0000-osc-MDT0000.ost_server_uuid into FULL state after 40 sec, have  
      mds:   Trace dump:
      mds:   = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
      mds:   = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
      mds:   = /usr/lib64/lustre/tests/test-framework.sh:5374:_wait_import_state()
      mds:   = /usr/lib64/lustre/tests/test-framework.sh:5393:wait_import_state()
      mds:   = rpc.sh:20:main()
      mds: Dumping lctl log to /tmp/test_logs/1380167996/rpc..*.1380168036.log
      mds: Dumping logs only on local client.
      pdsh@c10: mds: ssh exited with exit code 1
      conf-sanity test_54a: @@@@@@ FAIL: import is not in FULL state 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5481:_wait_osc_import_state()
        = /usr/lib64/lustre/tests/test-framework.sh:5496:wait_osc_import_state()
        = /usr/lib64/lustre/tests/conf-sanity.sh:99:gen_config()
        = /usr/lib64/lustre/tests/conf-sanity.sh:109:reformat_and_config()
        = /usr/lib64/lustre/tests/conf-sanity.sh:3321:test_54a()
        = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
        = /usr/lib64/lustre/tests/conf-sanity.sh:3323:main()
      Dumping lctl log to /tmp/test_logs/2013-09-25/143504/conf-sanity.test_54a.*.1380168037.log
      oss: Host key verification failed.
      oss: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      mds: Host key verification failed.
      oss: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
      mds: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      mds: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
      pdsh@c10: oss: ssh exited with exit code 255
      pdsh@c10: mds: ssh exited with exit code 255
      FAIL 54a (100s)
      
      == conf-sanity test 54b: test llverfs and partial verify of filesystem == 21:00:38 (1380168038)
      start mds service on mds
      Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
      mds: mount.lustre: according to /etc/mtab /dev/sda3 is already mounted on /lustre/scratch/mdt0
      pdsh@c10: mds: ssh exited with exit code 17
      Start of /dev/sda3 on mds1 failed 17
      conf-sanity test_54b: @@@@@@ FAIL: MDT start failed 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
        = /usr/lib64/lustre/tests/conf-sanity.sh:231:setup()
        = /usr/lib64/lustre/tests/conf-sanity.sh:3331:test_54b()
        = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
        = /usr/lib64/lustre/tests/conf-sanity.sh:3336:main()
      Dumping lctl log to /tmp/test_logs/2013-09-25/143504/conf-sanity.test_54b.*.1380168039.log
      oss: Host key verification failed.
      oss: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      oss: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
      mds: Host key verification failed.
      pdsh@c10: oss: ssh exited with exit code 255
      mds: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      mds: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
      pdsh@c10: mds: ssh exited with exit code 255
      FAIL 54b (2s)
      
      == conf-sanity test 55: check lov_objid size == 21:00:40 (1380168040)
      Stopping /lustre/scratch/mdt0 (opts:-f) on mds
      pdsh@c10: mds: ssh exited with exit code 1
      
         Permanent disk data:
      Target:     scratch:MDT0000
      Index:      0
      Lustre FS:  scratch
      Mount type: ldiskfs
      Flags:      0x65
                    (MDT MGS first_time update )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: sys.timeout=20 lov.stripesize=1048576 lov.stripecount=0 mdt.identity_upcall=/usr/sbin/l_getidentity
      
      device size = 20481MB
      formatting backing filesystem ldiskfs on /dev/sda3
      	target name  scratch:MDT0000
      	4k blocks     50000
      	options        -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_itable_init,lazy_journal_init -F
      mkfs_cmd = mke2fs -j -b 4096 -L scratch:MDT0000  -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_itable_init,lazy_journal_init -F /dev/sda3 50000
      Writing CONFIGS/mountdata
      pdsh@c10: oss: ssh exited with exit code 1
      pdsh@c10: oss: ssh exited with exit code 1
      
         Permanent disk data:
      Target:     scratch:OST03ff
      Index:      1023
      Lustre FS:  scratch
      Mount type: ldiskfs
      Flags:      0x62
                    (OST first_time update )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=192.168.2.105@o2ib sys.timeout=20
      
      device size = 12292MB
      formatting backing filesystem ldiskfs on /dev/sda3
      	target name  scratch:OST03ff
      	4k blocks     50000
      	options        -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F
      mkfs_cmd = mke2fs -j -b 4096 -L scratch:OST03ff  -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F /dev/sda3 50000
      Writing CONFIGS/mountdata
      start mds service on mds
      Starting mds1:   /dev/sda3 /lustre/scratch/mdt0
      Started scratch-MDT0000
      start ost1 service on oss
      Starting ost1:   /dev/sda3 /lustre/scratch/ost1
      oss: mount.lustre: mount /dev/sda3 at /lustre/scratch/ost1 failed: No such device
      oss: Are the lustre modules loaded?
      oss: Check /etc/modprobe.conf and /proc/filesystems
      pdsh@c10: oss: ssh exited with exit code 19
      Start of /dev/sda3 on ost1 failed 19
      mount scratch on /lustre/scratch.....
      Starting client: c10: -o user_xattr,flock mds@o2ib:/scratch /lustre/scratch
      

      At this time, the MDT was mounted, but nothing else was mounted; no OSTs were mounted and none of the nodes had the Lustre file system mounted.

      Started testing again with a clean system with 2.4.93 meaning used loadjenkins to load a new OS and formatted all storage. After everything was up and running, I enabled HSM on the MDS, registered agent, started copytool, but I left the agent (c07) off of the remote client list in case it was having problems with mount/unmounts. Leaving the agent out didn't help anything as you can see below. sanity, sanityn, sanity-benchmark and metadata-updates all ran, but when racer was running or just completed, the following was on c10 console:

      ....
      == racer test complete, duration 312 sec == 14:47:42 (1380232062)
      Stopping clients: c08,c09,c10 /lustre/scratch2 (opts:)
      Stopping client c10 /lustre/scratch2 opts:
      Stopping client c08 /lustre/scratch2 opts:
      Stopping client c09 /lustre/scratch2 opts:
      COMMAND   PID USER   FD   TYPE      DEVICE  SIZE/OFF               NODE NAME
      dd       1344 root    1w   REG 3953,876890  11075584 144115473691245005 /lustre/scratch/racer/10
      dd      10191 root    1w   REG 3953,876890 171792384 144115473741559103 /lustre/scratch2/racer/19
      dd      13174 root    1w   REG 3953,876890 172275712 144115473691238078 /lustre/scratch/racer/19
      cat     13200 root    1w   REG 3953,876890 342456320 144115473724745995 /lustre/scratch2/racer/6
      cat     13200 root    3r   REG 3953,876890 172275712 144115473691238078 /lustre/scratch2/racer/19
      cat     14311 root    1w   REG 3953,876890   2856960 144115473691246648 /lustre/scratch/racer/4
      cat     14311 root    3r   REG 3953,876890   2857984 144115473775086376 /lustre/scratch/racer/12
      dd      29283 root    1w   REG 3953,876890    373760 144115473758289951 /lustre/scratch/racer/6
      dd      29288 root    1w   REG 3953,876890    373760 144115473758289951 /lustre/scratch2/racer/13
      dd      32281 root    1w   REG 3953,876890   4388864 144115473758324783 /lustre/scratch2/racer/7
      /lustre/scratch2 is still busy, wait one second
      COMMAND   PID USER   FD      TYPE      DEVICE  SIZE/OFF               NODE NAME
      cat      1744 root    1w      REG 3953,876890   1141760 144115473791881355 /lustre/scratch2/racer/12
      cat      1744 root    3r      REG 3953,876890   1142784 144115473741554890 /lustre/scratch2/racer/14
      dd       2731 root    1w      REG 3953,876890  40736768 144115473791888614 /lustre/scratch2/racer/4
      dd       3692 root    1w      REG 3953,876890 159873024 144115473724773982 /lustre/scratch/racer/15
      cat      7919 root    1w      REG 3953,876890   4327424 144115473791855853 /lustre/scratch2/racer/14
      cat      7919 root    3r  unknown 3953,876890                              /lustre/scratch2/racer/8/4
      dd      16792 root    1w      REG 3953,876890    424960 144115473758289949 /lustre/scratch2/racer/16
      dd      16795 root    1w      REG 3953,876890    424960 144115473758289949 /lustre/scratch/racer/16
      dd      16826 root    1w      REG 3953,876890    934912 144115473775067150 /lustre/scratch2/racer/14
      dd      16851 root    1w      REG 3953,876890    934912 144115473775067150 /lustre/scratch/racer/14
      cat     19347 root    3r      REG 3953,876890  11075584 144115473691245005 /lustre/scratch/racer/16
      cat     30055 root    1w      REG 3953,876890 342456320 144115473724745995 /lustre/scratch/racer/1
      cat     30055 root    3r      REG 3953,876890 172275712 144115473691238078 /lustre/scratch/racer/4
      /lustre/scratch2 is still busy, wait one second
      COMMAND   PID USER   FD      TYPE      DEVICE SIZE/OFF               NODE NAME
      cat      9753 root    3r      REG 3953,876890  4389888 144115473758324783 /lustre/scratch/racer/7
      dd      19612 root    1w      REG 3953,876890  2859008 144115473775086376 /lustre/scratch2/racer/12
      dd      20439 root    1w  unknown 3953,876890                             /lustre/scratch/racer/8/1
      dd      20860 root    1w      REG 3953,876890  1143808 144115473741554890 /lustre/scratch/racer/5
      dd      23548 root    1w      REG 3953,876890   374784 144115473758289951 /lustre/scratch/racer/13
      dd      27659 root    1w      REG 3953,876890  4328448 144115473691239547 /lustre/scratch2/racer/6
      /lustre/scratch2 is still busy, wait one second
      /lustre/scratch2 is still busy, wait one second
      /lustre/scratch2 is still busy, wait one second
      ...
      

      MDS and OSS all mounted, agent has scratch mounted and copytool running, c08, c09 and c10 all have the Lustre file system mounted on /lustre/scratch and /lustre/scratch2. From the MDS dmesg:

      LustreError: 25757:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 0: rc = -11
      LustreError: 25757:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 179 previous similar messages
      

      The agent did mount the Lustre file system, but is not taking part in the testing, i.e. not on the remote client list.

      On the agent, I see:

      LustreError: 166-1: MGC192.168.2.105@o2ib: Connection to MGS (at 192.168.2.105@o2ib) was lost; in progress operations using this service will fail
      LustreError: 11-0: scratch-OST0001-osc-ffff88082d9d7000: Communicating with 192.168.2.106@o2ib, operation obd_ping failed with -107.
      Lustre: Evicted from MGS (at 192.168.2.105@o2ib) after server handle changed from 0xc3f51c4989c55d76 to 0xc3f51c4989e1bc9e
      Lustre: MGC192.168.2.105@o2ib: Connection restored to MGS (at 192.168.2.105@o2ib)
      

      On all the client nodes, in dmesg I see:

      LustreError: 26096:0:(file.c:3068:ll_inode_revalidate_fini()) scratch: revalidate FID [0x200004283:0x9f09:0x0] error: rc = -116
      

      Only on c10 do I see:

      Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ == 14:14:25 (1380230065)
      Lustre: DEBUG MARKER: min OST has 19147140kB available, using 28720710kB file size
      INFO: task bonnie++:19773 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bonnie++      D 0000000000000004     0 19773  19510 0x00000080
       ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679
       0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5
       ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8
      Call Trace:
       [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50
       [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
       [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8150ec23>] wait_for_common+0x123/0x180
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20
       [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0
       [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50
       [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90
       [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130
       [<ffffffff811b2177>] sys_sync+0x17/0x40
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task bonnie++:19773 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bonnie++      D 0000000000000004     0 19773  19510 0x00000080
       ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679
       0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5
       ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8
      Call Trace:
       [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50
       [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
       [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8150ec23>] wait_for_common+0x123/0x180
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20
       [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0
       [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50
       [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90
       [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130
       [<ffffffff811b2177>] sys_sync+0x17/0x40
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task bonnie++:19773 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bonnie++      D 0000000000000004     0 19773  19510 0x00000080
       ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679
       0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5
       ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8
      Call Trace:
       [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50
       [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
       [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8150ec23>] wait_for_common+0x123/0x180
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20
       [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0
       [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50
       [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90
       [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130
       [<ffffffff811b2177>] sys_sync+0x17/0x40
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task bonnie++:19773 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bonnie++      D 0000000000000004     0 19773  19510 0x00000080
       ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679
       0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5
       ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8
      Call Trace:
       [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50
       [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
       [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8150ec23>] wait_for_common+0x123/0x180
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20
       [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0
       [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50
       [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90
       [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130
       [<ffffffff811b2177>] sys_sync+0x17/0x40
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      INFO: task bonnie++:19773 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      bonnie++      D 0000000000000004     0 19773  19510 0x00000080
       ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679
       0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5
       ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8
      Call Trace:
       [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50
       [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150
       [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8150ec23>] wait_for_common+0x123/0x180
       [<ffffffff81063410>] ? default_wake_function+0x0/0x20
       [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20
       [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0
       [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50
       [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90
       [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130
       [<ffffffff811b2177>] sys_sync+0x17/0x40
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      LNet: 19792:0:(debug.c:218:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      Lustre: DEBUG MARKER: SKIP: sanity-benchmark test_iozone skipping SLOW test iozone
      ...
      Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10 DURATION=300 == 14:42:36 (1380231756)
      LustreError: 30894:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2
      LustreError: 29189:0:(file.c:3068:ll_inode_revalidate_fini()) scratch: revalidate FID [0x200004282:0x3fd:0x0] error: rc = -116
      

      While scratch2 is still busy, on the clients I see variations of:

       [root@c09 ~]# lsof /lustre/scratch2
      COMMAND   PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      cat      9753 root    3r   REG 3953,876890 23490560 144115473758324783 /lustre/scratch/racer/7
      dd      19612 root    1w   REG 3953,876890 21051392 144115473775086376 /lustre/scratch2/racer/12
      dd      20860 root    1w   REG 3953,876890 18997248 144115473741554890 /lustre/scratch/racer/5
      dd      23548 root    1w   REG 3953,876890  4968448 144115473758289951 /lustre/scratch/racer/13
      dd      27659 root    1w   REG 3953,876890 20217856 144115473691239547 /lustre/scratch2/racer/6
      

      and

       [root@c09 ~]# ps ax | grep racer
       2328 pts/0    S+     0:00 grep racer
       9753 ?        S      0:10 cat /lustre/scratch/racer/16
      19612 ?        D      0:08 dd if=/dev/zero of=/lustre/scratch2/racer/12 bs=1k count=90648
      20860 ?        S      0:08 dd if=/dev/zero of=/lustre/scratch/racer/5 bs=1k count=141568
      23548 ?        S      0:02 dd if=/dev/zero of=/lustre/scratch/racer/17 bs=1k count=180680
      27659 ?        S      0:08 dd if=/dev/zero of=/lustre/scratch2/racer/0 bs=1k count=62512
      

      After five hours, the state of the nodes has not changed. I was able to upload partial logs for this last run at https://maloo.whamcloud.com/test_sessions/a4370948-2721-11e3-b741-52540035b04c

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: