Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1637

client evictions while writing data (lock callback timer expired after)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 1.8.8
    • None
    • RHEL5 and Lustre 1.8.8-wc1 on OSSes, 10GigE connected servers and clients. RHEL5 clients running Lustre 1.8.7-wc1, Ubuntu clients running Lustre 1.8.8-wc1 with special kernel
    • 2
    • 4005

    Description

      Recently some of our Ubuntu based detectors started to suffer from zero-sized files on our Lustre during data collection. This was traced to evictions of these clients by the OSSes during data collections, this is happening at least daily on our production file system with data loss for the affected experiment.

      We have managed to reproduce the evictions on our test file system using dd and parallel to write files on one 10GigE attached client while reading the same files on two other 10GigE attached clients, again using dd and parallel, details below.

      On the OSS we get these messages (full logs for both server client attached)

      Jul 16 14:23:44 cs04r-sc-oss01-08 kernel: LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.142.199@tcp  ns: filter-play01-OST0003_UUID lock: ffff81041b944a00/0x8036ad3b0f3d8168 lrc: 3/0,0 mode: PW/PW res: 22839846/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0xd577a930b0412c96 expref: 306 pid: 8124 timeout 4297209606
      Jul 16 14:23:46 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 93327360 != fo_tot_granted 118464512
      Jul 16 14:23:46 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:23:51 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 93327360 != fo_tot_granted 118464512
      Jul 16 14:23:51 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:23:56 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:23:56 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:01 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:24:01 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:06 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:24:06 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: Lustre: 8025:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
      Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: LustreError: 8025:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff8102dc663800 x1407112324567209/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445147 ref 1 fl Interpret:/0/0 rc -114/0
      Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: LustreError: 8025:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 1 previous similar message
      Jul 16 14:24:11 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:24:11 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) Skipped 1 previous similar message
      Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) Skipped 1 previous similar message
      Jul 16 14:24:23 cs04r-sc-oss01-08 kernel: Lustre: 8112:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
      Jul 16 14:24:23 cs04r-sc-oss01-08 kernel: LustreError: 8112:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff81024e9f2000 x1407112324567270/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445163 ref 1 fl Interpret:/0/0 rc -114/0
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: Lustre: 8113:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 8113:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff810108583400 x1407112324567423/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445181 ref 1 fl Interpret:/0/0 rc -114/0
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) Skipped 4 previous similar messages
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
      Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) Skipped 4 previous similar messages
      Jul 16 14:25:01 cs04r-sc-oss01-08 kernel: Lustre: 8262:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
      Jul 16 14:25:01 cs04r-sc-oss01-08 kernel: LustreError: 8262:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff8103a3aa1400 x1407112324567555/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445201 ref 1 fl Interpret:/0/0 rc -114/0
      Jul 16 14:25:15 cs04r-sc-oss01-08 kernel: Lustre: Service thread pid 8689 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jul 16 14:25:15 cs04r-sc-oss01-08 kernel: Pid: 8689, comm: ll_ost_io_297
      [snip]
      

      At the same time on the client doing the writing

      Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: 7315:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324564365 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.66.21@tcp 84s ago has timed out (84s prior to deadline).
      Jul 16 14:23:53 cs04r-sc-serv-48 kernel:   req@ffff81014027f400 x1407112324564365/t0 o10->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 400/592 e 0 to 1 dl 1342445033 ref 1 fl Rpc:/0/0 rc 0/0
      Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: 7315:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
      Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: play01-OST0003-osc-ffff8101af83dc00: Connection to service play01-OST0003 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
      Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: Skipped 2 previous similar messages
      Jul 16 14:24:07 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 2s
      Jul 16 14:24:07 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 2 previous similar messages
      Jul 16 14:24:07 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:24:07 cs04r-sc-serv-48 kernel: LustreError: Skipped 1 previous similar message
      Jul 16 14:24:12 cs04r-sc-serv-48 kernel: Lustre: 7230:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324562867 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.66.21@tcp 117s ago has timed out (117s prior to deadline).
      Jul 16 14:24:12 cs04r-sc-serv-48 kernel:   req@ffff81010cdb7c00 x1407112324562867/t0 o10->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 400/592 e 1 to 1 dl 1342445052 ref 1 fl Rpc:/0/0 rc 0/0
      Jul 16 14:24:12 cs04r-sc-serv-48 kernel: Lustre: 7230:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
      Jul 16 14:24:23 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 3s
      Jul 16 14:24:23 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:24:41 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:25:00 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324567445 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.68.17@tcp 9s ago has timed out (9s prior to deadline).
      Jul 16 14:25:00 cs04r-sc-serv-48 kernel:   req@ffff81029e10e800 x1407112324567445/t0 o8->play01-OST0003_UUID@172.23.68.17@tcp:28/4 lens 368/584 e 0 to 1 dl 1342445100 ref 1 fl Rpc:N/0/0 rc 0/0
      Jul 16 14:25:00 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
      Jul 16 14:25:01 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 5s
      Jul 16 14:25:01 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message
      Jul 16 14:25:01 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:25:26 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:25:47 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 7s
      Jul 16 14:25:47 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message
      Jul 16 14:25:47 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:26:05 cs04r-sc-serv-48 kernel: Lustre: 4520:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324560521 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.68.17@tcp 248s ago has timed out (248s prior to deadline).
      Jul 16 14:26:05 cs04r-sc-serv-48 kernel:   req@ffff81004c2a9c00 x1407112324560521/t0 o4->play01-OST0003_UUID@172.23.68.17@tcp:6/4 lens 448/608 e 2 to 1 dl 1342445165 ref 2 fl Rpc:/0/0 rc 0/0
      Jul 16 14:26:05 cs04r-sc-serv-48 kernel: Lustre: 4520:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
      Jul 16 14:26:13 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
      Jul 16 14:26:41 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(import.c:855:ptlrpc_connect_interpret()) play01-OST0003_UUID@172.23.66.21@tcp changed server handle from 0x8036ad3b0f3866c2 to 0x8036ad3b0f3dbc39
      Jul 16 14:26:41 cs04r-sc-serv-48 kernel: LustreError: 167-0: This client was evicted by play01-OST0003; in progress operations using this service will fail.
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7373:0:(rw.c:198:ll_file_punch()) obd_truncate fails (-4) ino 6208746
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7377:0:(rw.c:198:ll_file_punch()) obd_truncate fails (-5) ino 6208751
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7377:0:(rw.c:198:ll_file_punch()) Skipped 9 previous similar messages
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 4520:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8101a9e26800 x1407112324567962/t0 o4->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 4520:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 19 previous similar messages
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7490:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace play01-OST0003-osc-ffff8101af83dc00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7490:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff8102df597e00 (22839877/0/0/0) (rc: 1)
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: Lustre: play01-OST0003-osc-ffff8101af83dc00: Connection restored to service play01-OST0003 using nid 172.23.66.21@tcp.
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: Lustre: Skipped 2 previous similar messages
      Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7373:0:(rw.c:198:ll_file_punch()) Skipped 41 previous similar messages
      

      Command used to write the files:

      for i in {1..6000} ; do echo $i ; done | /dls_sw/apps/parallel/20110205/bin/parallel -j 30 " dd if=/dev/zero of=/mnt/play01/tmp/$(hostname -s)/file{} bs=1M count=6 "
      

      Command used to read the files:

      for ((;;)) ; do find /mnt/play01/tmp/cs04r-sc-serv-48 -type f -print ; sleep 5 ; done | /dls_sw/apps/parallel/20110205/bin/parallel -j 30 " dd if={} of=/dev/null bs=1M 
      

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: