Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3940

Clients in recovery for a very long time - LustreError: (ldlm_lib.c:941:target_handle_connect())

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 10422

    Description

      Dear support,

      We encountered a serious problem with a production lustre installation where customers experience is negatively affected.
      More specifically, some OSTs became offline following a node cluster fault and these volumes tend to remain in recovery status forever, denying connection from new clients (XXX clients in recovery for 18446744073709550863s).
      A practical solution we found requires a manual abort of the recovery procedure to properly mount the volumes.
      We need to understand whether this situation is due to a bug or a configuration problem since we have not been able to retrieve useful information on the bug tracking system.
      Below we attach a short analysis of the problem and some extract from /var/log/messages file.

      Thanks in advance for your support.

      Kernel Version installed on the OSS:

      Linux version 2.6.32-220.4.2.el6_lustre.x86_64 (jenkins@client-31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012

      Lustre version installed on the OSS and MGS/MDS:
      Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

      Lustre version installed on the clients:
      There is a mix of versions (we can be more precise if needed)
      Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64
      Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64

      Processors:
      2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz
      Ram:
      48GB

      Infrastructure info:
      2 node for MDS/MGS with pacemaker/corosync cluster
      8 node for OSS in a 2 node cluster configuration with pacemaker/corosync cluster with a dothill storage controller per pair.
      ~1000 client

      Problem description:

      The problem described below was found on a two node cluster n-oss07 and n-oss08.
      Storage configuration:

      n-oss07:
      nero-OST0006 -> vd01
      nero-OST000e -> vd03
      nero-OST0016 -> vd05
      nero-OST001e -> vd07

      n-oss08:
      nero-OST0007 -> vd02
      nero-OST000f -> vd04
      nero-OST0017 -> vd02
      nero-OST001f -> vd04

      On 27.08 at about 11:24 the node n-oss07 hang and was shutdown by the node n-oss08 that successfully took over the the n-oss07 volumes putting the OSTs in recovery mode, particularly the OST nero-OST001e with a very long recovery time.
      There was plenty of log like the following extract:

      -------------------------------------------------------------------------
      Aug 27 11:34:25 n-oss08 kernel: LustreError: 7913:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (eb0a36de-8fcf-cfb5-e996-eb6968148594): 329 clients in recovery for 18446744073709551490s
      -------------------------------------------------------------------------

      When the node n-oss07 came back online, it took back its resources forcing the unmount thus hanging the node n-oss08.

      -------------------------------------------------------------------------
      Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Running stop for /dev/mapper/vd07 on /lustre/vd07
      Aug 27 11:41:21 n-oss08 Filesystem[18812]: INFO: Trying to unmount /lustre/vd07
      Aug 27 11:41:21 n-oss08 kernel: Lustre: Failing over nero-OST001e
      Aug 27 11:41:21 n-oss08 kernel: LustreError: 18866:0 (ldlm_lib.c:1978:target_stop_recovery_thread()) nero-OST001e: Aborting recovery
      Aug 27 11:41:40 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)
      Aug 27 11:41:40 n-oss08 kernel: LustreError: 4278:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@
      processing error (19) req@ffff8805a769d800 x1443813942539660/t0(0) o8><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377596600
      ref 1 fl Interpret:/0/ffffffff rc -19/-1
      Aug 27 11:41:40 n-oss08 kernel: LustreError: 4278:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 15 previous similar messages
      Aug 27 11:42:30 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)
      Aug 27 11:42:30 n-oss08 kernel: LustreError: Skipped 709 previous similar messages
      Aug 27 11:43:35 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST001e_UUID' is not available for connect (stopping)

      Aug 27 11:43:35 n-oss08 kernel: LustreError: Skipped 1053 previous similar messages
      Aug 27 11:43:48 n-oss08 kernel: INFO: task umount:18866 blocked for more than 120 seconds.
      Aug 27 11:43:48 n-oss08 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug 27 11:43:48 n-oss08 kernel: umount D 0000000000000008 0 18866 18812 0x00000080
      Aug 27 11:43:48 n-oss08 kernel: ffff880198ce7998 0000000000000082 0000000000000000 ffffffffa06e6ee3
      Aug 27 11:43:48 n-oss08 kernel: ffff880198ce7928 ffffffffa0436105 00000000000007ba ffffffffa06d9670
      Aug 27 11:43:48 n-oss08 kernel: ffff8803f7467b38 ffff880198ce7fd8 000000000000f4e8 ffff8803f7467b38
      Aug 27 11:43:48 n-oss08 kernel: Call Trace:
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0436105>] ? cfs_print_to_console+0x75/0xe0 [libcfs]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814edb95>] schedule_timeout+0x215/0x2e0
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814ed813>] wait_for_common+0x123/0x180
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff814ed92d>] wait_for_completion+0x1d/0x20
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa06664a0>] target_stop_recovery_thread+0x50/0xa0 [ptlrpc]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa066728e>] target_recovery_fini+0x1e/0x30 [ptlrpc]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa07c2266>] filter_precleanup+0xa6/0x470 [obdfilter]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa051dd66>] ? class_disconnect_exports+0x126/0x220 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0531da9>] class_cleanup+0x199/0xa30 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa04404f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0519db6>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0533303>] class_process_config+0xcc3/0x1670 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0437823>] ? cfs_alloc+0x63/0x90 [libcfs]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa052fc4b>] ? lustre_cfg_new+0x31b/0x640 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0533dfc>] class_manual_cleanup+0x14c/0x560 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0519db6>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa053f5ec>] server_put_super+0xaac/0xf40 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff81191376>] ? invalidate_inodes+0xf6/0x190
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811786cb>] generic_shutdown_super+0x5b/0xe0
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811787b6>] kill_anon_super+0x16/0x60
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffffa0535a06>] lustre_kill_super+0x36/0x60 [obdclass]
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff81179740>] deactivate_super+0x70/0x90
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff811956cf>] mntput_no_expire+0xbf/0x110
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8119616b>] sys_umount+0x7b/0x3a0
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff810d4582>] ? audit_syscall_entry+0x272/0x2a0
      Aug 27 11:43:48 n-oss08 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      -------------------------------------------------------------------------

      Both the node were manually rebooted and they recovered their own resources but n-oss07 still continued to show a very long recovery time wrt OST nero-OST001e

      -------------------------------------------------------------------------
      Aug 27 12:05:49 n-oss07 kernel: LustreError: 3851:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying
      connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551063s
      Aug 27 12:06:14 n-oss07 kernel: LustreError: 3785:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551038s
      Aug 27 12:06:39 n-oss07 kernel: LustreError: 3762:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709551013s
      Aug 27 12:07:04 n-oss07 kernel: LustreError: 3760:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550988s
      Aug 27 12:07:29 n-oss07 kernel: LustreError: 3851:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection
      for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550963s
      Aug 27 12:07:54 n-oss07 kernel: LustreError: 3773:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550938s
      Aug 27 12:08:19 n-oss07 kernel: LustreError: 3851:0:(ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550913s
      Aug 27 12:09:09 n-oss07 kernel: LustreError: 3762:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001e: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 325 clients in recovery for 18446744073709550863s
      -------------------------------------------------------------------------

      Same thing for the OST nero-OST001f on the n-oss08 node

      -------------------------------------------------------------------------
      Aug 27 12:09:09 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550945s
      Aug 27 12:09:09 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect())
      Skipped 1 previous similar message
      Aug 27 12:09:59 n-oss08 kernel: LustreError: 3739:0:(ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (16) req@ffff88049d05a400 x1444516067280684/t0(0) o8><?>@<?>:0/0 lens 400/264 e 0 to 0 dl 1377598298 ref 1 fl Interpret:/0/0 rc -16/0
      Aug 27 12:09:59 n-oss08 kernel: LustreError: 3739:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 10 previous similar messages
      Aug 27 12:10:16 n-oss08 cib: [3236]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
      Aug 27 12:10:24 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550870s
      Aug 27 12:10:24 n-oss08 kernel: LustreError: 3821:0:(ldlm_lib.c:941:target_handle_connect()) Skipped 2 previous similar messages
      Aug 27 12:12:54 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550720s
      Aug 27 12:12:54 n-oss08 kernel: LustreError: 3821:0 (ldlm_lib.c:941:target_handle_connect()) Skipped 5 previous similar
      messages
      Aug 27 12:17:29 n-oss08 kernel: LustreError: 3843:0 (ldlm_lib.c:941:target_handle_connect()) nero-OST001f: denying connection for new client 10.201.32.31@o2ib (1439e1a4-2ede-caba-25fd-194822a1ec5a): 360 clients in recovery for 18446744073709550445s
      Aug 27 12:17:29 n-oss08 kernel: LustreError: 3843:0 (ldlm_lib.c:941:target_handle_connect())
      Skipped 10 previous similar messages
      Aug 27 12:20:16 n-oss08 cib: [3236]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
      -------------------------------------------------------------------------

      The only possible action to permit the mount of the OSTs was to manually abort the recovery procedure for nero-OST001e and nero-OST001f

      -------------------------------------------------------------------------
      Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Now serving nero-OST001e on /dev/mapper/vd07 with recovery enabled
      Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Aborting recovery.
      Aug 27 12:50:34 n-oss07 kernel: Lustre: nero-OST001e: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [...]
      Aug 27 12:54:45 n-oss07 kernel: Lustre: Failing over nero-OST001e
      Aug 27 12:54:46 n-oss07 kernel: LustreError: 3814:0 (ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      Aug 27 12:54:46 n-oss07 kernel: Lustre: nero-OST001e: shutting down for failover; client state will be preserved.
      Aug 27 12:54:46 n-oss07 kernel: LustreError: 3814:0 (ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Aug 27 12:54:46 n-oss07 kernel: Lustre: OST nero-OST001e has stopped.
      Aug 27 12:54:49 n-oss07 kernel: Lustre: server umount nero-OST001e complete
      [...]
      Aug 27 12:56:40 n-oss07 Filesystem[5002]: INFO: Running start for /dev/mapper/vd07 on /lustre/vd07
      Aug 27 12:56:40 n-oss07 kernel: Lustre: nero-OST0016: Will be in recovery for at least 2:30, or until 746 clients reconnect
      Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
      Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
      Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
      Aug 27 12:56:40 n-oss07 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
      Aug 27 12:56:40 n-oss07 kernel: Lustre: 5111:0 (ldlm_lib.c:2019:target_recovery_init()) RECOVERY: service nero-OST001e,
      746 recoverable clients, last_transno 734586500
      [...]
      Aug 27 12:56:41 n-oss07 kernel: Lustre: nero-OST001e: Will be in recovery for at least 2:30, or until 746 clients reconnect
      -------------------------------------------------------------------------

      -------------------------------------------------------------------------
      Aug 27 12:53:12 n-oss08 kernel: Lustre: nero-OST001f: Now serving nero-OST001f on /dev/mapper/vd08 with recovery enabled
      Aug 27 12:53:13 n-oss08 kernel: Lustre: nero-OST001f: Aborting recovery.
      Aug 27 12:53:13 n-oss08 kernel: Lustre: nero-OST001f: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 Aug 27 12:53:20 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST000e_UUID' is not available for connect (no target)
      Aug 27 12:53:20 n-oss08 kernel: LustreError: 3402:0 (ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (19) req@ffff8805ff3f4400 x1439334118204356/t0(0) o8><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377600899 ref 1 fl Interpret:/0/ffffffff rc -19/-1
      Aug 27 12:53:20 n-oss08 kernel: LustreError: 3402:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 743 previous similar messages
      Aug 27 12:53:20 n-oss08 kernel: LustreError: Skipped 676 previous similar messages
      Aug 27 12:53:21 n-oss08 kernel: Lustre: 3430:0 (filter.c:2691:filter_connect_internal()) nero-OST001f: Received MDS connection for group 0
      Aug 27 12:53:21 n-oss08 kernel: Lustre: nero-OST001f: received MDS connection from 10.201.62.14@o2ib
      Aug 27 12:53:22 n-oss08 kernel: Lustre: 3412:0 (filter.c:2547:filter_llog_connect()) nero-OST001f: Recovery from log 0xf9f0621/0x0:c4b6c135
      Aug 27 12:53:52 n-oss08 kernel: LustreError: 137-5: UUID 'nero-OST0006_UUID' is not available for connect (no target)
      Aug 27 12:53:52 n-oss08 kernel: LustreError: 3458:0 (ldlm_lib.c:2239:target_send_reply_msg()) @@@ processing error (-19)
      req@ffff8805f7623800 x1443956471336464/t0(0) o8-><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1377600931 ref 1 fl Interpret:/0/ffffffff rc -19/-1
      Aug 27 12:53:52 n-oss08 kernel: LustreError: 3458:0 (ldlm_lib.c:2239:target_send_reply_msg()) Skipped 2056 previous similar messages
      Aug 27 12:53:52 n-oss08 kernel: LustreError: Skipped 2120 previous similar messages
      [...]
      Aug 27 12:54:37 n-oss08 kernel: Lustre: Failing over nero-OST001f Aug 27 12:54:39 n-oss08 kernel: LustreError: 3734:0 (ldlm_request.c:1170:ldlm_cli_cancel_req()) Got rc
      -108 from cancel RPC: canceling anyway
      Aug 27 12:54:39 n-oss08 kernel: Lustre: nero-OST001f: shutting down for failover; client state will be preserved.
      Aug 27 12:54:39 n-oss08 kernel: LustreError: 3734:0 (ldlm_request.c:1796:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Aug 27 12:54:39 n-oss08 kernel: Lustre: OST nero-OST001f has stopped. Aug 27 12:54:42 n-oss08 kernel: Lustre: server umount nero-OST001f complete
      [...]
      Aug 27 12:56:40 n-oss08 Filesystem[4734]: INFO: Running start for /dev/mapper/vd08 on /lustre/vd08
      Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
      Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
      Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
      Aug 27 12:56:40 n-oss08 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts:
      Aug 27 12:56:40 n-oss08 kernel: Lustre: 4843:0 (ldlm_lib.c:2019:target_recovery_init()) RECOVERY: service nero-OST001f,
      746 recoverable clients, last_transno 730833312
      Aug 27 12:56:40 n-oss08 kernel: Lustre: nero-OST001f: Now serving nero-OST001f on /dev/mapper/vd08 with recovery enabled
      Aug 27 12:56:40 n-oss08 kernel: Lustre: nero-OST001f: temporarily refusing client connection from 10.201.51.36@o2ib
      Aug 27 12:56:40 n-oss08 kernel: Lustre: Skipped 42 previous similar messages
      Aug 27 12:56:41 n-oss08 lrmd: [3762]: info: Managed fs_vd08:start process 4734 exited with return code 0.
      Aug 27 12:56:41 n-oss08 crmd: [3765]: info: process_lrm_event: LRM operation fs_vd08_start_0 (call=17, rc=0, cib-update=24, confirmed=true) ok
      Aug 27 12:56:41 n-oss08 kernel: Lustre: nero-OST001f: Will be in recovery for at least 2:30, or until 746 clients reconnect
      -------------------------------------------------------------------------

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            matteo.piccinini Matteo Piccinini (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: