Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2030

LDLM timeout resulting in client eviction and truncate system call to fail with EINTR

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.2
    • Lustre Client: 2.1.2-4chaos
      Lustre Server: 2.1.2-3chaos
    • 3
    • 4159

    Description

      I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total).

      The reproducer is:

      $ salloc -N2 -n32
      $ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635
      

      Specifically, I see truncate return EINTR when running the test under strace:

      truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)
      

      The console logs from the clients and servers from a case where it hit this is below.

      Test:

      hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15
      Simul is running with 32 process(es)
      13:14:38: Set iteration 0
      13:14:38: Running test #14(iter 0): creat, shared mode.
      13:14:38: Running test #15(iter 0): truncate, shared mode.
      13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call
      

      Client console messages:

      # hype355 /root > tail -f /var/log/conman/console.hype21{3,4}
      ==> /var/log/conman/console.hype213 <==
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1)
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      
      ==> /var/log/conman/console.hype214 <==
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      

      Server console messages:

      > $ tail /var/log/conman/console.zwicky49
      > ==> /var/log/conman/console.zwicky49 <==
      > 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590
      > 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352
      > 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0
      > 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173
      > 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136
      > 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0
      

      The client version of lustre on the nodes:

      hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      hype[213-214]
      ----------------
      lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      The server version of lustre:

      $ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      zwicky[48-49]
      ----------------
      lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      Attachments

        Activity

          [LU-2030] LDLM timeout resulting in client eviction and truncate system call to fail with EINTR

          Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest?

          prakash Prakash Surya (Inactive) added a comment - Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest?

          Hi Prakash,
          Is the issue reproducible? there is still one more issue in the logs, why OST0010 took so much time to enqueue this glimpse lock?
          is the log of the node containing lc2-OST0010 available?

          hongchao.zhang Hongchao Zhang added a comment - Hi Prakash, Is the issue reproducible? there is still one more issue in the logs, why OST0010 took so much time to enqueue this glimpse lock? is the log of the node containing lc2-OST0010 available?
          hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#change,4382

          Good to hear. Thanks!

          prakash Prakash Surya (Inactive) added a comment - Good to hear. Thanks!

          status update: the possible patch is under creation & test

          hongchao.zhang Hongchao Zhang added a comment - status update: the possible patch is under creation & test

          in the LU-2030-llogs-1 (+dlmtrace, +rpctrace), the "simul_truncate.0" (PID=112500) failed,
          00000002:00010000:9.0:1348694770.130724:0:112500:0:(mdc_locks.c:918:mdc_intent_lock()) (name: simul_truncate.0,[0x20097c3a8:0x16a98:0x0]) in obj [0x20095bc20:0x114c:0x0], intent: getattr flags 00

          and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock
          "ffff8807ff4d3d80/0xf19c08e9d42c4460" at OST0010, which has spent more than 100s to get the lock! the the eviction occur

          hongchao.zhang Hongchao Zhang added a comment - in the LU-2030 -llogs-1 (+dlmtrace, +rpctrace), the "simul_truncate.0" (PID=112500) failed, 00000002:00010000:9.0:1348694770.130724:0:112500:0:(mdc_locks.c:918:mdc_intent_lock()) (name: simul_truncate.0, [0x20097c3a8:0x16a98:0x0] ) in obj [0x20095bc20:0x114c:0x0] , intent: getattr flags 00 and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock "ffff8807ff4d3d80/0xf19c08e9d42c4460" at OST0010, which has spent more than 100s to get the lock! the the eviction occur

          Also, in case it's useful, I reproduced the issue with both '+dlmtrace' and '+rpctrace'. So:

          • LU-2030-llogs.tar.gz : Has '+dlmtrace'
          • LU-2030-llogs-1.tar.gz : Has '+dlmtrace' and '+rpctrace'
          prakash Prakash Surya (Inactive) added a comment - Also, in case it's useful, I reproduced the issue with both '+dlmtrace' and '+rpctrace'. So: LU-2030 -llogs.tar.gz : Has '+dlmtrace' LU-2030 -llogs-1.tar.gz : Has '+dlmtrace' and '+rpctrace'

          Looking at the server logs, the RX errors from the LNet routers is a bit troubling. The clients and the servers have to go through a set of routers, 10.1.1.53@o2ib9 being the NID of one of the routers:

          00010000:00010000:4.0:1348679193.857904:0:20439:0:(ldlm_lockd.c:1069:ldlm_handle_enqueue0()) ### server-side enqueue handler START
          00010000:00010000:4.0:1348679193.857909:0:20439:0:(ldlm_lockd.c:1154:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 2/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: -99 pid: 20439 timeout 0
          00010000:00010000:4.0:1348679193.857921:0:20439:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 2/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0
          00010000:00010000:4.0:1348679193.857925:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1
          00010000:00010000:4.0:1348679193.857926:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1
          00010000:00010000:4.0:1348679193.857931:0:20439:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0
          00010000:00010000:4.0:1348679193.857936:0:20439:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 4/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 7832370900
          00010000:00010000:4.0:1348679193.858049:0:20439:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 3/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: 18 pid: 20439 timeout 0
          00010000:00010000:4.0:1348679193.858054:0:20439:0:(ldlm_lockd.c:1320:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880248b78240, rc 0)
          00000800:00000100:0.0:1348679234.951973:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:1.0:1348679234.951992:0:2591:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:2.0:1348679234.951999:0:2592:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:4.0:1348679234.952008:0:2594:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:0.0:1348679234.952008:0:2590:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:1.0:1348679234.952019:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:2.0:1348679234.952024:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:1.0:1348679234.952034:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:1.0:1348679234.952043:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:1.0:1348679234.952052:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00000800:00000100:2.0:1348679234.952057:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5
          00010000:00020000:4.1F:1348679293.957439:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 4 pid: 20439 timeou
          
          prakash Prakash Surya (Inactive) added a comment - Looking at the server logs, the RX errors from the LNet routers is a bit troubling. The clients and the servers have to go through a set of routers, 10.1.1.53@o2ib9 being the NID of one of the routers: 00010000:00010000:4.0:1348679193.857904:0:20439:0:(ldlm_lockd.c:1069:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00010000:4.0:1348679193.857909:0:20439:0:(ldlm_lockd.c:1154:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 2/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->0] (req 0->0) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: -99 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857921:0:20439:0:(ldlm_lock.c:626:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 2/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857925:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1 00010000:00010000:4.0:1348679193.857926:0:20439:0:(ldlm_extent.c:295:ldlm_check_contention()) contended locks = 1 00010000:00010000:4.0:1348679193.857931:0:20439:0:(ldlm_lockd.c:809:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.857936:0:20439:0:(ldlm_lockd.c:468:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 4/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 14 pid: 20439 timeout 7832370900 00010000:00010000:4.0:1348679193.858049:0:20439:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: filter-lc2-OST000c_UUID lock: ffff880248b78240/0x8691e72961eb80f5 lrc: 3/0,0 mode: --/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xf19c08e9d42c4364 expref: 18 pid: 20439 timeout 0 00010000:00010000:4.0:1348679193.858054:0:20439:0:(ldlm_lockd.c:1320:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff880248b78240, rc 0) 00000800:00000100:0.0:1348679234.951973:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.951992:0:2591:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.951999:0:2592:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:4.0:1348679234.952008:0:2594:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:0.0:1348679234.952008:0:2590:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952019:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.952024:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952034:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952043:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:1.0:1348679234.952052:0:2588:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00000800:00000100:2.0:1348679234.952057:0:2587:0:(o2iblnd_cb.c:472:kiblnd_rx_complete()) Rx from 10.1.1.53@o2ib9 failed: 5 00010000:00020000:4.1F:1348679293.957439:0:0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2 ns: filter-lc2-OST000c_UUID lock: ffff880248b78480/0x8691e72961eb80ee lrc: 3/0,0 mode: PW/PW res: 167630809/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4ac0bfd expref: 4 pid: 20439 timeou

          Here are logs from the server and the client with '+dlmtrace'.

          The client also dumped dumped this to it's console:

          2012-09-26 10:08:14 LustreError: 11-0: lc2-OST000c-osc-ffff880831ce8800: Communicating with 10.1.1.49@o2ib9, operation ost_punch failed with -107.
          2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
          2012-09-26 10:08:14 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
          2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff88043156ddd8[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x1000c0000:0x9fdd7d9:0x0] {
          2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast())     lovsub@ffff88043156e4e0: [0 ffff880773391748 W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0]] 
          2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast())     osc@ffff88043156fb50: ffff880199befb40 00120002 0xbd300c41b4ac0bfd 3 ffff8803dd5e9068 size: 1024 mtime: 1348679193 atime: 1348679193 ctime: 1348679193 blocks: 8
          2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff88043156ddd8
          2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
          2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880831ce8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff88082a0540b8[2 4 0 2 0 00000000] W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0] {
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     vvp@ffff880773390420: 
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     lov@ffff880773391748: 2
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     0 0: ---
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try())     1 0: ---
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88082a0540b8
          2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
          2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880820256ac0 (167630809/0/0/0) (rc: 0)
          2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
          
          prakash Prakash Surya (Inactive) added a comment - Here are logs from the server and the client with '+dlmtrace'. The client also dumped dumped this to it's console: 2012-09-26 10:08:14 LustreError: 11-0: lc2-OST000c-osc-ffff880831ce8800: Communicating with 10.1.1.49@o2ib9, operation ost_punch failed with -107. 2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete 2012-09-26 10:08:14 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail. 2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff88043156ddd8[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x1000c0000:0x9fdd7d9:0x0] { 2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff88043156e4e0: [0 ffff880773391748 W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0]] 2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88043156fb50: ffff880199befb40 00120002 0xbd300c41b4ac0bfd 3 ffff8803dd5e9068 size: 1024 mtime: 1348679193 atime: 1348679193 ctime: 1348679193 blocks: 8 2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff88043156ddd8 2012-09-26 10:08:14 LustreError: 79024:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5 2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880831ce8800 resource refcount nonzero (1) after lock cleanup; forcing cleanup. 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff88082a0540b8[2 4 0 2 0 00000000] W(2):[0, 18446744073709551615]@[0x20097c3a8:0x16a97:0x0] { 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880773390420: 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880773391748: 2 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: --- 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 1 0: --- 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff88082a0540b8 2012-09-26 10:08:14 LustreError: 78995:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5 2012-09-26 10:08:14 LustreError: 79024:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880820256ac0 (167630809/0/0/0) (rc: 0) 2012-09-26 10:08:14 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)

          Here is a Lustre Log file with '+dlmtrace' from the server which had the lock expire.

          prakash Prakash Surya (Inactive) added a comment - Here is a Lustre Log file with '+dlmtrace' from the server which had the lock expire.

          Is the debug log available? there was an extent lock held by client much longer time than server permitted and was evicted, from the debug log,
          we could find which lock was the culprit and help to trace where is the real problem.

          hongchao.zhang Hongchao Zhang added a comment - Is the debug log available? there was an extent lock held by client much longer time than server permitted and was evicted, from the debug log, we could find which lock was the culprit and help to trace where is the real problem.

          People

            hongchao.zhang Hongchao Zhang
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: