Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2030

LDLM timeout resulting in client eviction and truncate system call to fail with EINTR

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.2
    • Lustre Client: 2.1.2-4chaos
      Lustre Server: 2.1.2-3chaos
    • 3
    • 4159

    Description

      I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total).

      The reproducer is:

      $ salloc -N2 -n32
      $ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635
      

      Specifically, I see truncate return EINTR when running the test under strace:

      truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)
      

      The console logs from the clients and servers from a case where it hit this is below.

      Test:

      hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15
      Simul is running with 32 process(es)
      13:14:38: Set iteration 0
      13:14:38: Running test #14(iter 0): creat, shared mode.
      13:14:38: Running test #15(iter 0): truncate, shared mode.
      13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call
      

      Client console messages:

      # hype355 /root > tail -f /var/log/conman/console.hype21{3,4}
      ==> /var/log/conman/console.hype213 <==
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1)
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      
      ==> /var/log/conman/console.hype214 <==
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      

      Server console messages:

      > $ tail /var/log/conman/console.zwicky49
      > ==> /var/log/conman/console.zwicky49 <==
      > 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590
      > 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352
      > 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0
      > 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173
      > 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136
      > 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0
      

      The client version of lustre on the nodes:

      hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      hype[213-214]
      ----------------
      lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      The server version of lustre:

      $ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      zwicky[48-49]
      ----------------
      lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: