Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2132

2.1.3 client hangs on 'df'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Minor
    • None
    • Lustre 2.1.3
    • None
    • 3
    • 5134

    Description

      A front end node pfe3 hangs on df. Nagios reporting nbp1 unmounted.

      /var/log/messages showed lustre errors before hang are as below. We needed to reboot pfe3 to get nbp1 mounted again.

      ...Oct 7 15:08:25 pfe3 kernel: [490266.072028] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1349646649/real 1349646649] req@ffff880024adb800 x1414695477195603/t0(0) o6->nbp1-OST003c-osc-ffff880073e9d400@10.151.26.31@o2ib:6/4 lens 512/400 e 2 to 1 dl 1349647705 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      Oct 7 15:08:25 pfe3 kernel: [490266.158936] Lustre: 7366:0:(client.c:1780:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
      Oct 7 15:08:25 pfe3 kernel: [490266.188374] Lustre: nbp1-OST003c-osc-ffff880073e9d400: Connection to nbp1-OST003c (at 10.151.26.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Oct 7 15:08:25 pfe3 kernel: [490266.252863] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
      Oct 7 15:08:25 pfe3 kernel: [490266.289310] LustreError: Skipped 20 previous similar messages
      Oct 7 15:10:05 pfe3 kernel: [490366.252622] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
      Oct 7 15:10:05 pfe3 kernel: [490366.289087] LustreError: Skipped 3 previous similar messages
      Oct 7 15:10:39 pfe3 ntpd[5013]: kernel time sync status change 2001
      Oct 7 15:11:36 pfe3 envmodule: bkup load nas
      Oct 7 15:11:37 pfe3 envmodule: bkup load nas
      Oct 7 15:12:15 pfe3 kernel: [490496.137877] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection to nbp1-OST005f (at 10.151.26.34@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Oct 7 15:12:15 pfe3 kernel: [490496.236273] LustreError: 167-0: This client was evicted by nbp1-OST005f; in progress operations using this service will fail.
      Oct 7 15:12:15 pfe3 kernel: [490496.270510] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lock@ffff88005932a358[2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1005f0000:0x2fa3e62:0x0]

      { Oct 7 15:12:15 pfe3 kernel: [490496.318170] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) lovsub@ffff88001513c820: [0 ffff880070c99c28 R(1):[2304, 18446744073709551615]@[0x44855761059:0x128f3:0x0]] [9 ffff880069152e98 P(0):[0, 18446744073709551615]@[0x44855761059:0x128f3:0x0]] Oct 7 15:12:16 pfe3 kernel: [490496.389224] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) osc@ffff880070f789e0: ffff880052ef3480 40160002 0x3bf5f5ecface519f 3 ffff880029da22b8 size: 859986 mtime: 1349647704 atime: 1349647702 ctime: 1349647704 blocks: 1688 Oct 7 15:12:16 pfe3 kernel: [490496.389229] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) }

      lock@ffff88005932a358
      Oct 7 15:12:16 pfe3 kernel: [490496.389231] LustreError: 9848:0:(osc_lock.c:809:osc_ldlm_completion_ast()) dlmlock returned -5
      Oct 7 15:12:16 pfe3 kernel: [490496.389268] LustreError: 9848:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace nbp1-OST005f-osc-ffff880073e9d400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      Oct 7 15:12:16 pfe3 kernel: [490496.389271] LustreError: 9848:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff8800641f5c00 (49954402/0/0/0) (rc: 1)
      Oct 7 15:12:16 pfe3 kernel: [490496.389299] Lustre: nbp1-OST005f-osc-ffff880073e9d400: Connection restored to nbp1-OST005f (at 10.151.26.34@o2ib)
      Oct 7 15:13:00 pfe3 kernel: [490541.188284] LustreError: 11-0: an error occurred while communicating with 10.151.26.31@o2ib. The ost_connect operation failed with -16
      Oct 7 15:13:00 pfe3 kernel: [490541.224748] LustreError: Skipped 7 previous similar messages
      ...

      The messages on ldlm_resource_complain() seems to carry the same signature of ORI-735, but it happens to 2.1.3 on our production systems.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: