Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7392

client evicted: namespace resource [0x2b9a7de:0x0:0x0].0x0 (ffff8806d80cfcc0) refcount nonzero (1)

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 2.8.0
    • lola
      build: 2.7.62-28-g0754bc8, 0754bc8f2623bea184111af216f7567608db35b6; soakbuild '20151104.1'
    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20151104.1' on cluster lola (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151104.1). MDTs are fromated with ldiskfs and OSTs with zfs as storage backend. DNE is enabled. MDSes are configured in HA failover configuration.

      Sequence of events:

      • 2015-11-04 18:47:30 – mds_restart lola-9 completed
      • 2015-11-04 18:50:30 – OSS (lola-5) evict client
        lola-5.log:Nov  4 18:50:30 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.131@o2ib100  ns: filter-soaked-OST0007_UUID lock: ffff880313f841c0/0x15cebc1506e2a9b5 lrc: 3/0,0 mode: PW/PW res: [0x2b9a7de:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 192.168.1.131@o2ib100 remote: 0x175f680569401922 expref: 5 pid: 10807 timeout: 4306845817 lvb_type: 0
        

        similar message exist on all OSS nodes

      • 2015-11-04 18:55:45 – client (lola-31) evicted from ost7
        Nov  4 18:50:45 lola-31 kernel: LustreError: 167-0: soaked-Nov  4 18:50:30 -osc-ffff881071e62400: This client w
        as evicted by soaked-OST0007; in progress operations using this service will fail.
        Nov  4 18:50:45 lola-31 kernel: LustreError: 16677:0:(ldlm_resource.c:887:ldlm_resource_complain()) so
        aked-OST0007-osc-ffff881071e62400: namespace resource [0x2b9a7de:0x0:0x0].0x0 (ffff8806d80cfcc0) refco
        unt nonzero (1) after lock cleanup; forcing cleanup.
        Nov  4 18:50:45 lola-31 kernel: LustreError: 16677:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- R
        esource: [0x2b9a7de:0x0:0x0].0x0 (ffff8806d80cfcc0) refcount = 2
        Nov  4 18:50:45 lola-31 kernel: LustreError: 16677:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Grant
        ed locks (in reverse order):
        Nov  4 18:50:45 lola-31 kernel: LustreError: 16677:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### #
        ## ns: soaked-OST0007-osc-ffff881071e62400 lock: ffff880850f12a80/0x175f680569401922 lrc: 3/0,1 mode: 
        PW/PW res: [0x2b9a7de:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1844674407370955
        1615) flags: 0x526480000000 nid: local remote: 0x15cebc1506e2a9b5 expref: -99 pid: 15402 timeout: 0 lv
        b_type: 1
        
      • till 2015-11-04 18:59:05 ost {8, a, b, c}

        are evicted with same error messages on client and OSSes

      • 2015-11-05 – client (lola-31) osc stay in state DISCONN, EVICTED for the OSTs affected (see
        file 'evicted-client.txt.bz2)
      • client node is unusable and all jobs crashed

      The problem might be related to LU-2067.

      Attached files:

      • OSSes (lola-[2-5]: messages, console log files
      • client lola-31: messages, console log files, 'lctl ..state* - output

      Attachments

        1. console-lola-31.log.bz2
          43 kB
        2. evicted-client.txt.bz2
          0.8 kB
        3. lola-2.log.bz2
          65 kB
        4. lola-3.log.bz2
          75 kB
        5. lola-4.log.bz2
          64 kB
        6. lola-5.log.bz2
          66 kB
        7. messages-lola-2.log.bz2
          456 kB
        8. messages-lola-3.log.bz2
          411 kB
        9. messages-lola-31.log.bz2
          217 kB
        10. messages-lola-4.log.bz2
          226 kB
        11. messages-lola-5.log.bz2
          453 kB

        Issue Links

          Activity

            [LU-7392] client evicted: namespace resource [0x2b9a7de:0x0:0x0].0x0 (ffff8806d80cfcc0) refcount nonzero (1)

            close old tickets

            jay Jinshan Xiong (Inactive) added a comment - close old tickets

            Hi!

            Is there any progress on this problem?

            We are getting hit by a problem that judging from the error messages are a good match, except that there is no zfs involved here.

            We're running 2.8.56 + the fixes for LU-6808 on the client side and 2.5.41-DDN on the servers.

            (The clients also have the tentative fix for LU-6854 installed at the moment since we seemd to suffer from that specific problem too, but the above evicted problem have been showing up long before that patch was applied)

            ake_s Åke Sandgren added a comment - Hi! Is there any progress on this problem? We are getting hit by a problem that judging from the error messages are a good match, except that there is no zfs involved here. We're running 2.8.56 + the fixes for LU-6808 on the client side and 2.5.41-DDN on the servers. (The clients also have the tentative fix for LU-6854 installed at the moment since we seemd to suffer from that specific problem too, but the above evicted problem have been showing up long before that patch was applied)

            This error also occured for build '20160106' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160106) again:
            client:

            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 167-0: soaked-OST000f-osc-ffff8808301ba000: This client was evicted by soaked-OST000f; in progress operations using this service will fail.
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:887:ldlm_resource_complain()) soaked-OST000f-osc-ffff8808301ba000: namespace resource [0x7c0000401:0x18fbb85:0x0].0x0 (ffff88006f2206c0) refcount nonzero (2) after lock cleanup; forcing cleanup.
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x7c0000401:0x18fbb85:0x0].0x0 (ffff88006f2206c0) refcount = 3
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: soaked-OST000f-osc-ffff8808301ba000 lock: ffff880366080940/0xedab12f62583edad lrc: 3/0,1 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526400000000 nid: local remote: 0x6dd9a1f2125effbd expref: -99 pid: 129953 timeout: 0 lvb_type: 1
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1523:ldlm_resource_dump()) Waiting locks:
            lola-26.log:Jan  7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1525:ldlm_resource_dump()) ### ### ns: soaked-OST000f-osc-ffff8808301ba000 lock: ffff880a0b5668c0/0xedab12f62583edbb lrc: 4/0,1 mode: --/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x106400020000 nid: local remote: 0x6dd9a1f2125effc4 expref: -99 pid: 129954 timeout: 0 lvb_type: 1
            

            OSS

            lola-5.log:Jan  7 19:51:46 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.126@o2ib100  ns: filter-soaked-OST000f_UUID lock: ffff880341ac0300/0x6dd9a1f2125effbd lrc: 3/0,0 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 192.168.1.126@o2ib100 remote: 0xedab12f62583edad expref: 6 pid: 18388 timeout: 4385304734 lvb_type: 0
            lola-5.log:Jan  7 19:51:46 lola-5 kernel: LustreError: 17976:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8802b92190c0 x1522673517776812/t0(0) o105->soaked-OST000f@192.168.1.126@o2ib100:15/16 lens 360/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            lola-5.log:Jan  7 19:51:46 lola-5 kernel: LustreError: 17976:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 192.168.1.126@o2ib100) failed to reply to completion AST (req status 0 rc -5), evict it ns: filter-soaked-OST000f_UUID lock: ffff8803e925c3c0/0x6dd9a1f2125effc4 lrc: 3/0,0 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 192.168.1.126@o2ib100 remote: 0xedab12f62583edbb expref: 4 pid: 18388 timeout: 0 lvb_type: 0
            lola-5.log:Jan  7 19:51:47 lola-5 kernel: LustreError: 17976:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) Skipped 4 previous similar messages
            
            heckes Frank Heckes (Inactive) added a comment - This error also occured for build '20160106' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160106 ) again: client : lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 167-0: soaked-OST000f-osc-ffff8808301ba000: This client was evicted by soaked-OST000f; in progress operations using this service will fail. lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:887:ldlm_resource_complain()) soaked-OST000f-osc-ffff8808301ba000: namespace resource [0x7c0000401:0x18fbb85:0x0].0x0 (ffff88006f2206c0) refcount nonzero (2) after lock cleanup; forcing cleanup. lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x7c0000401:0x18fbb85:0x0].0x0 (ffff88006f2206c0) refcount = 3 lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: soaked-OST000f-osc-ffff8808301ba000 lock: ffff880366080940/0xedab12f62583edad lrc: 3/0,1 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x526400000000 nid: local remote: 0x6dd9a1f2125effbd expref: -99 pid: 129953 timeout: 0 lvb_type: 1 lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1523:ldlm_resource_dump()) Waiting locks: lola-26.log:Jan 7 19:52:06 lola-26 kernel: LustreError: 130892:0:(ldlm_resource.c:1525:ldlm_resource_dump()) ### ### ns: soaked-OST000f-osc-ffff8808301ba000 lock: ffff880a0b5668c0/0xedab12f62583edbb lrc: 4/0,1 mode: --/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x106400020000 nid: local remote: 0x6dd9a1f2125effc4 expref: -99 pid: 129954 timeout: 0 lvb_type: 1 OSS lola-5.log:Jan 7 19:51:46 lola-5 kernel: LustreError: 0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.126@o2ib100 ns: filter-soaked-OST000f_UUID lock: ffff880341ac0300/0x6dd9a1f2125effbd lrc: 3/0,0 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000000020 nid: 192.168.1.126@o2ib100 remote: 0xedab12f62583edad expref: 6 pid: 18388 timeout: 4385304734 lvb_type: 0 lola-5.log:Jan 7 19:51:46 lola-5 kernel: LustreError: 17976:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8802b92190c0 x1522673517776812/t0(0) o105->soaked-OST000f@192.168.1.126@o2ib100:15/16 lens 360/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 lola-5.log:Jan 7 19:51:46 lola-5 kernel: LustreError: 17976:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 192.168.1.126@o2ib100) failed to reply to completion AST (req status 0 rc -5), evict it ns: filter-soaked-OST000f_UUID lock: ffff8803e925c3c0/0x6dd9a1f2125effc4 lrc: 3/0,0 mode: PW/PW res: [0x7c0000401:0x18fbb85:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 192.168.1.126@o2ib100 remote: 0xedab12f62583edbb expref: 4 pid: 18388 timeout: 0 lvb_type: 0 lola-5.log:Jan 7 19:51:47 lola-5 kernel: LustreError: 17976:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) Skipped 4 previous similar messages

            Sorry, for the delay. I'll take care to include the patch in the next soak build.

            heckes Frank Heckes (Inactive) added a comment - Sorry, for the delay. I'll take care to include the patch in the next soak build.
            adilger Andreas Dilger added a comment - - edited

            Peter thinks this may related to the patch http://review.whamcloud.com/15127 "LU-4865 zfs: grow block size by write pattern" which was shown to be causing problems in 2.7.1 testing and was reverted. I've submitted a reversion patch against master for testing to see if the problem goes away:
            https://review.whamcloud.com/17053

            adilger Andreas Dilger added a comment - - edited Peter thinks this may related to the patch http://review.whamcloud.com/15127 " LU-4865 zfs: grow block size by write pattern" which was shown to be causing problems in 2.7.1 testing and was reverted. I've submitted a reversion patch against master for testing to see if the problem goes away: https://review.whamcloud.com/17053

            Hi Jinshan,
            Can you please have a look at this issue?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Jinshan, Can you please have a look at this issue? Thanks. Joe

            People

              jay Jinshan Xiong (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: