Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-653

recovery-mds-scale (FLAVOR=OSS): dbench write failed on handle 11108 (Cannot send after transport endpoint shutdown)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 3
    • 4903

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

      <~snip~>
      Client load failed during failover. Exiting
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-15-ib
      Client load failed on node client-15-ib
      
      client client-15-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib
                    /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug
      2011-08-31 01:45:08 Terminating clients loads ...
      Duration:                43200
      Server failover period: 600 seconds
      Exited after:           0 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      ost7: 0 times
      Status: FAIL: rc=1
      

      /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib:

      <~snip~>
         2      5122     0.00 MB/sec  execute  40 sec  latency 144211.791 ms
         2      5122     0.00 MB/sec  execute  41 sec  latency 145211.965 ms
         2      5122     0.00 MB/sec  execute  42 sec  latency 146212.143 ms
      [5748] write failed on handle 11108 (Cannot send after transport endpoint shutdown)
      Child failed with status 1
      

      /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug:

      2011-08-31 01:40:59: dbench run starting
      + mkdir -p /mnt/lustre/d0.dbench-client-15-ib
      + load_pid=3602
      + wait 3602
      + rundbench -D /mnt/lustre/d0.dbench-client-15-ib 2
      touch: missing file operand
      Try `touch --help' for more information.
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2011-08-31 01:43:43: dbench failed'
      + echo '2011-08-31 01:43:43: dbench failed'
      2011-08-31 01:43:43: dbench failed
      

      Syslog on client node client-15-ib showed that:

      Aug 31 01:42:51 client-15 kernel: Lustre: 2534:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
      Aug 31 01:42:51 client-15 kernel:  req@ffff880308397400 x1378646783695240/t908(908) o-1->lustre-OST0000_UUID@192.168.4.135@o2ib:6/4 lens 512/400 e 1 to 0 dl 1314780217 ref 2 fl
       Interpret:R/ffffffff/ffffffff rc -75/-1Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378646783702455 sent from lustre-OST0002-osc-ffff880326e33c00 to NID 
      192.168.4.134@o2ib has failed due to network error: [sent 1314780175] [real_sent 1314780175] [current 1314780175] [deadline 26s] [delay -26s]  req@ffff8803118e7000 x13786467837
      02455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.134@o2ib:28/4 lens 368/512 e 0 to 1 dl 1314780201 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
      Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
      Aug 31 01:43:10 client-15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_connect operation failed with -19Aug 31 01:43:10 client-15 kernel: LustreError: Skipped 6 previous similar messagesAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) lustre-OST0002-osc-ffff880326e33c00: tried all connections, increasing latency to 21sAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) Skipped 6 previous similar messages
      Aug 31 01:43:32 client-15 kernel: Lustre: lustre-OST0002-osc-ffff880326e33c00: Connection restored to service lustre-OST0002 using nid 192.168.4.135@o2ib.
      Aug 31 01:43:36 client-15 kernel: INFO: task flush-lustre-1:2690 blocked for more than 120 seconds.
      Aug 31 01:43:36 client-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug 31 01:43:36 client-15 kernel: flush-lustre- D 0000000000000002     0  2690      2 0x00000080Aug 31 01:43:36 client-15 kernel: ffff88030e35d9a0 0000000000000046 0000000000000000 ffffffffa0303424
      Aug 31 01:43:36 client-15 kernel: 0000000000000000 ffff88031f3fdc00 ffff88030e35d930 00000001002932fc
      Aug 31 01:43:36 client-15 kernel: ffff880325e47078 ffff88030e35dfd8 000000000000f598 ffff880325e47078
      Aug 31 01:43:36 client-15 kernel: Call Trace:Aug 31 01:43:36 client-15 kernel: [<ffffffffa0303424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d320>] ? sync_page+0x0/0x50Aug 31 01:43:36 client-15 kernel: [<ffffffff814db3c3>] io_schedule+0x73/0xc0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d35d>] sync_page+0x3d/0x50
      Aug 31 01:43:36 client-15 kernel: [<ffffffff814dbada>] __wait_on_bit_lock+0x5a/0xc0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d2f7>] __lock_page+0x67/0x70
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d17>] ? __writepage+0x17/0x40
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81122272>] write_cache_pages+0x392/0x4a0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d00>] ? __writepage+0x0/0x40
      Aug 31 01:43:36 client-15 kernel: [<ffffffff811223a4>] generic_writepages+0x24/0x30
      Aug 31 01:43:36 client-15 kernel: [<ffffffff811223d1>] do_writepages+0x21/0x40
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bbdd>] writeback_single_inode+0xdd/0x2c0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bfde>] writeback_sb_inodes+0xce/0x180
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c13b>] writeback_inodes_wb+0xab/0x1b0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c4db>] wb_writeback+0x29b/0x3f0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff814dac27>] ? thread_return+0x4e/0x777
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8107a1a2>] ? del_timer_sync+0x22/0x30
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c7c9>] wb_do_writeback+0x199/0x240
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c8d3>] bdi_writeback_task+0x63/0x1b0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dfc7>] ? bit_waitqueue+0x17/0xd0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81130c56>] bdi_start_fn+0x86/0x100
      Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
      Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      Aug 31 01:43:43 client-15 kernel: Lustre: 2534:0:(import.c:1160:completed_replay_interpret()) lustre-OST0000-osc-ffff880326e33c00: version recovery fails, reconnecting
      Aug 31 01:43:43 client-15 kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
      

      Maloo report: https://maloo.whamcloud.com/test_sets/e68e0d04-d3b4-11e0-8d02-52540025f9af

      Please refer to the attached recovery-mds-scale.1314780314.log.tar.bz2 for more logs.

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: