[LU-1637] client evictions while writing data (lock callback timer expired after) Created: 16/Jul/12  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL5 and Lustre 1.8.8-wc1 on OSSes, 10GigE connected servers and clients. RHEL5 clients running Lustre 1.8.7-wc1, Ubuntu clients running Lustre 1.8.8-wc1 with special kernel


Attachments: File LU-1637-2-logs-and-dumpfiles.tar.bz2     File LU-1637-20120717-logs-and-dumpfiles.tar.bz2     File LUXXX-messages.tar.gz     File OSS-messages-20120717    
Severity: 2
Rank (Obsolete): 4005

 Description   

Recently some of our Ubuntu based detectors started to suffer from zero-sized files on our Lustre during data collection. This was traced to evictions of these clients by the OSSes during data collections, this is happening at least daily on our production file system with data loss for the affected experiment.

We have managed to reproduce the evictions on our test file system using dd and parallel to write files on one 10GigE attached client while reading the same files on two other 10GigE attached clients, again using dd and parallel, details below.

On the OSS we get these messages (full logs for both server client attached)

Jul 16 14:23:44 cs04r-sc-oss01-08 kernel: LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.142.199@tcp  ns: filter-play01-OST0003_UUID lock: ffff81041b944a00/0x8036ad3b0f3d8168 lrc: 3/0,0 mode: PW/PW res: 22839846/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x80000020 remote: 0xd577a930b0412c96 expref: 306 pid: 8124 timeout 4297209606
Jul 16 14:23:46 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 93327360 != fo_tot_granted 118464512
Jul 16 14:23:46 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:23:51 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 93327360 != fo_tot_granted 118464512
Jul 16 14:23:51 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:23:56 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:23:56 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:01 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:24:01 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:06 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:24:06 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: Lustre: 8025:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: LustreError: 8025:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff8102dc663800 x1407112324567209/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445147 ref 1 fl Interpret:/0/0 rc -114/0
Jul 16 14:24:07 cs04r-sc-oss01-08 kernel: LustreError: 8025:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 1 previous similar message
Jul 16 14:24:11 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:24:11 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) Skipped 1 previous similar message
Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:21 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) Skipped 1 previous similar message
Jul 16 14:24:23 cs04r-sc-oss01-08 kernel: Lustre: 8112:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
Jul 16 14:24:23 cs04r-sc-oss01-08 kernel: LustreError: 8112:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff81024e9f2000 x1407112324567270/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445163 ref 1 fl Interpret:/0/0 rc -114/0
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: Lustre: 8113:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 8113:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff810108583400 x1407112324567423/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445181 ref 1 fl Interpret:/0/0 rc -114/0
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) filter_statfs: tot_granted 95424512 != fo_tot_granted 120561664
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2577:filter_grant_sanity_check()) Skipped 4 previous similar messages
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) filter_statfs: tot_pending 0 != fo_tot_pending 25137152
Jul 16 14:24:41 cs04r-sc-oss01-08 kernel: LustreError: 4598:0:(filter.c:2580:filter_grant_sanity_check()) Skipped 4 previous similar messages
Jul 16 14:25:01 cs04r-sc-oss01-08 kernel: Lustre: 8262:0:(ldlm_lib.c:803:target_handle_connect()) play01-OST0003: exp ffff81019b3ac000 already connecting
Jul 16 14:25:01 cs04r-sc-oss01-08 kernel: LustreError: 8262:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114)  req@ffff8103a3aa1400 x1407112324567555/t0 o8-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1342445201 ref 1 fl Interpret:/0/0 rc -114/0
Jul 16 14:25:15 cs04r-sc-oss01-08 kernel: Lustre: Service thread pid 8689 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 16 14:25:15 cs04r-sc-oss01-08 kernel: Pid: 8689, comm: ll_ost_io_297
[snip]

At the same time on the client doing the writing

Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: 7315:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324564365 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.66.21@tcp 84s ago has timed out (84s prior to deadline).
Jul 16 14:23:53 cs04r-sc-serv-48 kernel:   req@ffff81014027f400 x1407112324564365/t0 o10->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 400/592 e 0 to 1 dl 1342445033 ref 1 fl Rpc:/0/0 rc 0/0
Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: 7315:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: play01-OST0003-osc-ffff8101af83dc00: Connection to service play01-OST0003 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 16 14:23:53 cs04r-sc-serv-48 kernel: Lustre: Skipped 2 previous similar messages
Jul 16 14:24:07 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 2s
Jul 16 14:24:07 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 2 previous similar messages
Jul 16 14:24:07 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:24:07 cs04r-sc-serv-48 kernel: LustreError: Skipped 1 previous similar message
Jul 16 14:24:12 cs04r-sc-serv-48 kernel: Lustre: 7230:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324562867 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.66.21@tcp 117s ago has timed out (117s prior to deadline).
Jul 16 14:24:12 cs04r-sc-serv-48 kernel:   req@ffff81010cdb7c00 x1407112324562867/t0 o10->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 400/592 e 1 to 1 dl 1342445052 ref 1 fl Rpc:/0/0 rc 0/0
Jul 16 14:24:12 cs04r-sc-serv-48 kernel: Lustre: 7230:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
Jul 16 14:24:23 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 3s
Jul 16 14:24:23 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:24:41 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:25:00 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324567445 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.68.17@tcp 9s ago has timed out (9s prior to deadline).
Jul 16 14:25:00 cs04r-sc-serv-48 kernel:   req@ffff81029e10e800 x1407112324567445/t0 o8->play01-OST0003_UUID@172.23.68.17@tcp:28/4 lens 368/584 e 0 to 1 dl 1342445100 ref 1 fl Rpc:N/0/0 rc 0/0
Jul 16 14:25:00 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
Jul 16 14:25:01 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 5s
Jul 16 14:25:01 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message
Jul 16 14:25:01 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:25:26 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:25:47 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) play01-OST0003-osc-ffff8101af83dc00: tried all connections, increasing latency to 7s
Jul 16 14:25:47 cs04r-sc-serv-48 kernel: Lustre: 4522:0:(import.c:517:import_select_connection()) Skipped 1 previous similar message
Jul 16 14:25:47 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:26:05 cs04r-sc-serv-48 kernel: Lustre: 4520:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1407112324560521 sent from play01-OST0003-osc-ffff8101af83dc00 to NID 172.23.68.17@tcp 248s ago has timed out (248s prior to deadline).
Jul 16 14:26:05 cs04r-sc-serv-48 kernel:   req@ffff81004c2a9c00 x1407112324560521/t0 o4->play01-OST0003_UUID@172.23.68.17@tcp:6/4 lens 448/608 e 2 to 1 dl 1342445165 ref 2 fl Rpc:/0/0 rc 0/0
Jul 16 14:26:05 cs04r-sc-serv-48 kernel: Lustre: 4520:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
Jul 16 14:26:13 cs04r-sc-serv-48 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -114
Jul 16 14:26:41 cs04r-sc-serv-48 kernel: Lustre: 4521:0:(import.c:855:ptlrpc_connect_interpret()) play01-OST0003_UUID@172.23.66.21@tcp changed server handle from 0x8036ad3b0f3866c2 to 0x8036ad3b0f3dbc39
Jul 16 14:26:41 cs04r-sc-serv-48 kernel: LustreError: 167-0: This client was evicted by play01-OST0003; in progress operations using this service will fail.
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7373:0:(rw.c:198:ll_file_punch()) obd_truncate fails (-4) ino 6208746
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7377:0:(rw.c:198:ll_file_punch()) obd_truncate fails (-5) ino 6208751
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7377:0:(rw.c:198:ll_file_punch()) Skipped 9 previous similar messages
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 4520:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff8101a9e26800 x1407112324567962/t0 o4->play01-OST0003_UUID@172.23.66.21@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 4520:0:(client.c:858:ptlrpc_import_delay_req()) Skipped 19 previous similar messages
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7490:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace play01-OST0003-osc-ffff8101af83dc00 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7490:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff8102df597e00 (22839877/0/0/0) (rc: 1)
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: Lustre: play01-OST0003-osc-ffff8101af83dc00: Connection restored to service play01-OST0003 using nid 172.23.66.21@tcp.
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: Lustre: Skipped 2 previous similar messages
Jul 16 14:26:42 cs04r-sc-serv-48 kernel: LustreError: 7373:0:(rw.c:198:ll_file_punch()) Skipped 41 previous similar messages

Command used to write the files:

for i in {1..6000} ; do echo $i ; done | /dls_sw/apps/parallel/20110205/bin/parallel -j 30 " dd if=/dev/zero of=/mnt/play01/tmp/$(hostname -s)/file{} bs=1M count=6 "

Command used to read the files:

for ((;;)) ; do find /mnt/play01/tmp/cs04r-sc-serv-48 -type f -print ; sleep 5 ; done | /dls_sw/apps/parallel/20110205/bin/parallel -j 30 " dd if={} of=/dev/null bs=1M 


 Comments   
Comment by Peter Jones [ 16/Jul/12 ]

Frederik

Just to be clear - are you able to reproduce the same behaviour on a supported client or does this only manifest with Ubuntu clients?

Thanks

Peter

Comment by Peter Jones [ 16/Jul/12 ]

Bobijam

Could you please review the data provided?

Thanks

Peter

Comment by Frederik Ferner (Inactive) [ 16/Jul/12 ]

Peter,

we've not managed to check if we have got zero-sized files on our RHEL5 clients, the evictions are reproducable with only Red Hat machines.

Thanks,

Frederik

Comment by Johann Lombardi (Inactive) [ 16/Jul/12 ]

The filter_grant_sanity_check() error messages are related to another issue that i already fixed on orion. I will file another bug for this.

As for the eviction issue, the client did not manage to flush dirty data in a timely manner. From the backtraces, it seems that OST_WRITE RPCs (like setattr RPCs too) are stuck waiting for journal commit, e.g.:

Lustre: Service thread pid 8689 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 8689, comm: ll_ost_io_297

Call Trace:
 [<ffffffff88ba4828>] jbd2_log_wait_commit+0xa3/0xf5 [jbd2]
 [<ffffffff800a34a7>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88c5090b>] fsfilt_ldiskfs_commit_wait+0xab/0xd0 [fsfilt_ldiskfs]
 [<ffffffff88c91184>] filter_commitrw_write+0x1e14/0x2dd0 [obdfilter]
 [<ffffffff80063002>] thread_return+0x62/0xfe
 [<ffffffff88c2be1a>] ost_checksum_bulk+0x2aa/0x5a0 [ost]
 [<ffffffff88c32d09>] ost_brw_write+0x1c99/0x2480 [ost]
 [<ffffffff88969ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
 [<ffffffff889348b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
 [<ffffffff8008ee74>] default_wake_function+0x0/0xe
 [<ffffffff8896e0b8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88c360be>] ost_handle+0x2bce/0x55c0 [ost]
 [<ffffffff888becf0>] class_handle2object+0xe0/0x170 [obdclass]
 [<ffffffff8892819a>] lock_res_and_lock+0xba/0xd0 [ptlrpc]
 [<ffffffff8892d168>] __ldlm_handle2lock+0x2f8/0x360 [ptlrpc]
 [<ffffffff8897d6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
 [<ffffffff8897de35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
 [<ffffffff8008d299>] __wake_up_common+0x3e/0x68
 [<ffffffff8897edc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8897de60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

That said, some threads finally managed to move on:

Jul 16 14:26:29 cs04r-sc-oss01-08 kernel: Lustre: Service thread pid 8427 completed after 273.82s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jul 16 14:26:29 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: slow i_mutex 273s due to heavy IO load
Jul 16 14:26:29 cs04r-sc-oss01-08 kernel: Lustre: play01-OST0003: slow journal start 231s due to heavy IO load
...

Would it be possible to

  • enable dump_on_eviction (/proc/sys/lustre/dump_on_eviction)
  • enable dlmtrace, quota & inode debug (lctl set_param debug+="dlmtrace quota inode")
  • reproduce the problem and attach the debug logs (filename will be printed after "dump the log upon eviction" in the logs.

Thanks in advance.

Comment by Frederik Ferner (Inactive) [ 16/Jul/12 ]

new set of /var/log/messages and dumpfiles with additional debugging on both server and client.

Comment by Frederik Ferner (Inactive) [ 16/Jul/12 ]

I've enabled dump_on_eviction on OSS and client and uploaded examples from both.

OSS and client now have these debug flags set:
[bnh65367@cs04r-sc-serv-48 tmp]$ lctl get_param debug
lnet.debug=
inode ioctl neterror warning dlmtrace error emerg ha config console quota
[bnh65367@cs04r-sc-serv-48 tmp]$

Comment by Johann Lombardi (Inactive) [ 16/Jul/12 ]

Thanks. Could you please also dump the jbd statistics (under /proc/fs/jbd2/$dev) when the problem happens?
BTW, is quota enforced for the user owning the file(s)? If so, could you please try with a user which has no quota limit enforced?

Comment by Johann Lombardi (Inactive) [ 16/Jul/12 ]

BTW, do you also see the "slow" messages in the logs during "normal" operations (even when this bug isn't hit)?

Comment by Frederik Ferner (Inactive) [ 17/Jul/12 ]

Quota enforcing is not enabled for any user. I've checked for the user running the reproducer to verify this and quota is not enforced.

We do occassionally see the slow messages on our production file system without evictions.

Note that the best way I have found to reproduce the evictions is by running the reproducer directly after the OST has been mounted as the OST is slower for a while which makes reproducing this on our test file system easier. On the production file system the evictions happen even after all OSTs have been up for days and the load does not seem incredibly high (but it seems the evictions are more likely with higher load).

Attached are the details for another eviction today, syslog messages from OSS and client, dump files and both files in /proc/fs/jdb2/dm-7-8 which is the block device for the OST that was involved in the eviction.

Comment by Zhenyu Xu [ 17/Jul/12 ]

looks like LU-1496, can you please try this patch http://review.whamcloud.com/3157 ?

Comment by Frederik Ferner (Inactive) [ 17/Jul/12 ]

I've started testing the patch for LU-1496 on our test file system today. So
far I've not managed to reproduce the issue.

It looks like one of our clients crashed early during the test. I'm not sure if this is relevant or if the client has
some network connection problems as during these tests we started to see dropped
frames reported on the client (using 'ip -s link show'). This client is
reconnecting frequently and completely stopped responding at one point. Could
you review the logs for this client (and the server) below and confirm if
might be related to the new version on the OSS?

Client logs:

Jul 17 14:02:45 cs04r-sc-serv-06 kernel: Lustre: play01-MDT0000-mdc-ffff81006c1cd800: Connection to service play01-MDT0000 via nid 172.23.138.36@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:02:45 cs04r-sc-serv-06 kernel: Lustre: Skipped 2 previous similar messages
Jul 17 14:02:59 cs04r-sc-serv-06 kernel: Lustre: 2369:0:(import.c:517:import_select_connection()) play01-MDT0000-mdc-ffff81006c1cd800: tried all connections, increasing latency to 2s
Jul 17 14:02:59 cs04r-sc-serv-06 kernel: Lustre: 2369:0:(import.c:517:import_select_connection()) Skipped 40 previous similar messages
Jul 17 14:03:03 cs04r-sc-serv-06 kernel: Lustre: play01-OST0002-osc-ffff81006c1cd800: Connection to service play01-OST0002 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:03:03 cs04r-sc-serv-06 kernel: LustreError: 19727:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Jul 17 14:03:03 cs04r-sc-serv-06 kernel: LustreError: 19727:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Skipped 17 previous similar messages
Jul 17 14:03:03 cs04r-sc-serv-06 kernel: LustreError: 19727:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Jul 17 14:03:03 cs04r-sc-serv-06 kernel: LustreError: 19727:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) Skipped 17 previous similar messages
Jul 17 14:03:19 cs04r-sc-serv-06 kernel: LustreError: 166-1: MGC172.23.138.36@tcp: Connection to service MGS via nid 172.23.138.36@tcp was lost; in progress operations using this service will fail.
Jul 17 14:03:19 cs04r-sc-serv-06 kernel: Lustre: lustre01-OST0000-osc-ffff810067d47000: Connection to service lustre01-OST0000 via nid 172.23.68.14@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: Lustre: play01-OST0004-osc-ffff81006c1cd800: Connection to service play01-OST0004 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: Lustre: Skipped 31 previous similar messages
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: LustreError: 20865:0:(mdc_locks.c:652:mdc_enqueue()) ldlm_cli_enqueue error: -4
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: LustreError: 20838:0:(file.c:116:ll_close_inode_openhandle()) inode 6205016 mdc close failed: rc = -4
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: LustreError: 20866:0:(mdc_locks.c:652:mdc_enqueue()) ldlm_cli_enqueue error: -4
Jul 17 14:03:28 cs04r-sc-serv-06 kernel: LustreError: 20838:0:(file.c:116:ll_close_inode_openhandle()) Skipped 9 previous similar messages
Jul 17 14:03:29 cs04r-sc-serv-06 kernel: LustreError: 2365:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from 12345-172.23.66.21@tcp[2], ip 172.23.66.21:988, with error, wanted: 512056, left: 512056, last alive is 51 secs ago
Jul 17 14:03:29 cs04r-sc-serv-06 kernel: LustreError: 2365:0:(events.c:198:client_bulk_callback()) event type 1, status -5, desc ffff81001af34000
Jul 17 14:03:33 cs04r-sc-serv-06 kernel: Lustre: play01-MDT0000-mdc-ffff81006c1cd800: Connection restored to service play01-MDT0000 using nid 172.23.138.36@tcp.
Jul 17 14:03:33 cs04r-sc-serv-06 kernel: Lustre: Skipped 2 previous similar messages
Jul 17 14:03:34 cs04r-sc-serv-06 kernel: Lustre: MGC172.23.138.36@tcp: Reactivating import
Jul 17 14:03:53 cs04r-sc-serv-06 kernel: Lustre: play01-OST0000-osc-ffff81006c1cd800: Connection to service play01-OST0000 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:03:53 cs04r-sc-serv-06 kernel: Lustre: play01-OST0002-osc-ffff81006c1cd800: Connection restored to service play01-OST0002 using nid 172.23.66.21@tcp.
Jul 17 14:03:53 cs04r-sc-serv-06 kernel: Lustre: Skipped 1 previous similar message
Jul 17 14:03:57 cs04r-sc-serv-06 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -16
Jul 17 14:03:57 cs04r-sc-serv-06 kernel: LustreError: Skipped 2 previous similar messages
Jul 17 14:04:18 cs04r-sc-serv-06 kernel: Lustre: play01-OST0004-osc-ffff81006c1cd800: Connection restored to service play01-OST0004 using nid 172.23.66.21@tcp.
Jul 17 14:04:18 cs04r-sc-serv-06 kernel: Lustre: Skipped 36 previous similar messages
Jul 17 14:04:19 cs04r-sc-serv-06 dbus-daemon: nss_ldap: reconnected to LDAP server ldap://ldap.diamond.ac.uk after 1 attempt
Jul 17 14:12:54 cs04r-sc-serv-06 kernel: Lustre: 2367:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1405110034649039 sent from play01-OST0000-osc-ffff81006c1cd800 to NID 172.23.66.21@tcp 17s ago has timed out (17s prior to deadline).
Jul 17 14:12:54 cs04r-sc-serv-06 kernel:   req@ffff810048417800 x1405110034649039/t0 o400->play01-OST0000_UUID@172.23.66.21@tcp:28/4 lens 192/384 e 0 to 1 dl 1342530774 ref 1 fl Rpc:N/0/0 rc 0/0
Jul 17 14:12:54 cs04r-sc-serv-06 kernel: Lustre: 2367:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 141 previous similar messages
Jul 17 14:12:54 cs04r-sc-serv-06 kernel: Lustre: play01-OST0000-osc-ffff81006c1cd800: Connection to service play01-OST0000 via nid 172.23.66.21@tcp was lost; in progress operations using this service will wait for recovery to complete.
Jul 17 14:12:54 cs04r-sc-serv-06 kernel: Lustre: Skipped 3 previous similar messages
Jul 17 14:13:08 cs04r-sc-serv-06 kernel: Lustre: 2369:0:(import.c:517:import_select_connection()) play01-OST0000-osc-ffff81006c1cd800: tried all connections, increasing latency to 2s
ul 17 14:13:08 cs04r-sc-serv-06 kernel: Lustre: 2369:0:(import.c:517:import_select_connection()) Skipped 75 previous similar messages
Jul 17 14:13:24 cs04r-sc-serv-06 kernel: LustreError: 11-0: an error occurred while communicating with 172.23.66.21@tcp. The ost_connect operation failed with -16
Jul 17 14:13:24 cs04r-sc-serv-06 kernel: LustreError: Skipped 2 previous similar messages
Jul 17 14:13:24 cs04r-sc-serv-06 kernel: Lustre: play01-OST0001-osc-ffff81006c1cd800: Connection restored to service play01-OST0001 using nid 172.23.66.21@tcp.
Jul 17 14:13:31 cs04r-sc-serv-06 automount[8965]: key "data" not found in map source(s).
Jul 17 14:13:38 cs04r-sc-serv-06 kernel: LustreError: 19716:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Jul 17 14:13:38 cs04r-sc-serv-06 kernel: LustreError: 19716:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Skipped 2 previous similar messages
Jul 17 14:13:38 cs04r-sc-serv-06 kernel: LustreError: 19716:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Jul 17 14:13:38 cs04r-sc-serv-06 kernel: LustreError: 19716:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) Skipped 2 previous similar messages
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 167-0: This client was evicted by play01-OST0004; in progress operations using this service will fail.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 167-0: This client was evicted by play01-OST0002; in progress operations using this service will fail.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 2367:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff81004591b800 x1405110034649766/t0 o3->play01-OST0004_UUID@172.23.66.21@tcp:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4651:0:(rw.c:1341:ll_issue_page_read()) page ffff8100021135d8 map ffff81000bc6b330 index 768 flags 68080000000821 count 4 priv ffff810017e31120: read queue failed: rc -5
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace play01-OST0004-osc-ffff81006c1cd800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff810015c8f240 (25725664/0/0/0) (rc: 1)
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Namespace play01-OST0004-osc-ffff81006c1cd800 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:519:ldlm_namespace_cleanup()) Skipped 6 previous similar messages
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Resource: ffff81006273d500 (25725694/0/0/0) (rc: 1)
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: LustreError: 4968:0:(ldlm_resource.c:524:ldlm_namespace_cleanup()) Skipped 6 previous similar messages
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: Lustre: 4437:0:(rw.c:2321:ll_readpage()) ino 6204785 page 769 (3149824) not covered by a lock (mmap?).  check debug logs.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: Lustre: 4434:0:(rw.c:2321:ll_readpage()) ino 6204750 page 257 (1052672) not covered by a lock (mmap?).  check debug logs.
Jul 17 14:13:54 cs04r-sc-serv-06 kernel: Lustre: 4437:0:(rw.c:2321:ll_readpage()) Skipped 4 previous similar messages
Jul 17 15:13:10 cs04r-sc-serv-06 syslogd 1.4.1: restart.

Server logs for the start, full logs attached:

Jul 17 14:03:03 cs04r-sc-oss01-08 kernel: Lustre: 16527:0:(ldlm_lib.c:574:target_handle_reconnect()) play01-OST0002: 69d87395-1fce-0c9e-2505-0acc7fd7164e re
connecting
Jul 17 14:03:03 cs04r-sc-oss01-08 kernel: Lustre: 16527:0:(ldlm_lib.c:874:target_handle_connect()) play01-OST0002: refuse reconnection from 69d87395-1fce-0c
9e-2505-0acc7fd7164e@172.23.138.5@tcp to 0xffff81035b5aee00; still busy with 1 active RPCs
Jul 17 14:03:03 cs04r-sc-oss01-08 kernel: LustreError: 16527:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810214705400 x1
405110034445604/t0 o8->69d87395-1fce-0c9e-2505-0acc7fd7164e@:0/0 lens 368/264 e 0 to 0 dl 1342530283 ref 1 fl Interpret:/0/0 rc -16/0
Jul 17 14:03:03 cs04r-sc-oss01-08 kernel: LustreError: 16527:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 10 previous similar messages
Jul 17 14:03:04 cs04r-sc-oss01-08 kernel: LustreError: 17102:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  req@ffff8101633a5000 x140511003
4445464/t0 o3->69d87395-1fce-0c9e-2505-0acc7fd7164e@:0/0 lens 448/400 e 0 to 0 dl 1342530273 ref 1 fl Interpret:/0/0 rc 0/0
Jul 17 14:03:28 cs04r-sc-oss01-08 kernel: Lustre: 16613:0:(ldlm_lib.c:574:target_handle_reconnect()) play01-OST0004: 69d87395-1fce-0c9e-2505-0acc7fd7164e re
connecting
Jul 17 14:03:28 cs04r-sc-oss01-08 kernel: Lustre: 16613:0:(ldlm_lib.c:874:target_handle_connect()) play01-OST0004: refuse reconnection from 69d87395-1fce-0c
9e-2505-0acc7fd7164e@172.23.138.5@tcp to 0xffff81040daaf800; still busy with 14 active RPCs
Jul 17 14:03:28 cs04r-sc-oss01-08 kernel: LustreError: 16613:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff81016268f400 x1
405110034445741/t0 o8->69d87395-1fce-0c9e-2505-0acc7fd7164e@:0/0 lens 368/264 e 0 to 0 dl 1342530308 ref 1 fl Interpret:/0/0 rc -16/0
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16927:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  req@ffff810118131c00 x140511003
4445474/t0 o3->69d87395-1fce-0c9e-2505-0acc7fd7164e@:0/0 lens 448/400 e 0 to 0 dl 1342530273 ref 1 fl Interpret:/0/0 rc 0/0
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16927:0:(ost_handler.c:829:ost_brw_read()) Skipped 7 previous similar messages
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16106:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff81015bd50000
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16106:0:(events.c:381:server_bulk_callback()) event type 4, status -113, desc ffff810230c96000
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16106:0:(events.c:381:server_bulk_callback()) event type 4, status -113, desc ffff810239526000
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: Lustre: 17054:0:(ost_handler.c:887:ost_brw_read()) play01-OST0004: ignoring bulk IO comm error with 69d87395-1fce-0c9e-2505-0acc7fd7164e@ id 12345-172.23.138.5@tcp - client will retry
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16106:0:(events.c:381:server_bulk_callback()) event type 4, status -113, desc ffff8102bab10000
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: LustreError: 16106:0:(events.c:381:server_bulk_callback()) event type 4, status -113, desc ffff8103853e0000
Jul 17 14:03:29 cs04r-sc-oss01-08 kernel: Lustre: 17227:0:(ost_handler.c:887:ost_brw_read()) play01-OST0004: ignoring bulk IO comm error with 69d87395-1fce-0c9e-2505-0acc7fd7164e@ id 12345-172.23.138.5@tcp - client will retry

We will continue testing this patch on our test file system.

Comment by Johann Lombardi (Inactive) [ 18/Jul/12 ]

While i agree that extending the lock timeout at the end of I/O processing is a good idea, i still think that I/Os taking more than 200s to complete is pathological. I really would like to understand what is going on.

Comment by Frederik Ferner (Inactive) [ 18/Jul/12 ]

Johann,

I agree that it would be good to understand what is going on. Note though that on our production file system we see only very few instances where a thread appears to be hung for 200+s and we don't see the eviction. The usual slow XXX messages without evictions are around 50-100s (still not good but not quite as bad).

On our test file system I found that the best way to reproduce the evictions is to unmount an OST, then mount it again and directly start the reproducer. For reasons that I've not fully understood, I/O to any OST is very slow for a while after mounting it on our file systems. This seems to help in reproducing the evictions. On the production system, our users seem to generate enough load (or something else is wrong) to cause these evictions.

Comment by Johann Lombardi (Inactive) [ 18/Jul/12 ]

Frederik, it is probably due to the time it takes to load all the ext4 bitmaps in memory.
How big are your OSTs and how full are they? Do you also hit this problem if you "empty" the OSTs on the test system (assuming that you can afford to do that )?

Comment by Frederik Ferner (Inactive) [ 18/Jul/12 ]

OST size is 8TB on the test file system and on production file system, 16TB on the other production file system where we initially noticed the evictions. Depending on file system they OSTs are 30% (test file system, 8TB OSTs) and 75% (production fs, 8TB OSTs) and 55% (production fs, 16TB OSTs full.

I should be able to empty at least some of the OSTs on the test file system somewhat, how low would you think we need to go?

Comment by Johann Lombardi (Inactive) [ 18/Jul/12 ]

What you can try instead is to run 'debugfs -R stats /dev/<device> > /dev/null' just after the OSTs are mounted and rerun the test. This should load the bitmaps in memory.
To limit memory pressure on the OSTs (and give a chance to bitmaps to stay in cache), you can also try to disable the OSS read cache via the following command lctl set_param obdfilter.*.*cache_enable=0.
Enabling flex_bg should also improve the bitmap loading time.

Relevant bug, LU-15.

Comment by Frederik Ferner (Inactive) [ 19/Jul/12 ]

Johann,

thanks for this, I'll try these as soon as I can, though it not be until end August on our production systems. Is enabling flex_bg possibly without reformating?

Regarding fixing the evictions, I've not been able to reproduce these on our test file system with the patch. Our users are keen to get a fixed version on one of the production file systems, what would the support situation be if we ran this on a production file systems?

Thanks,
Frederik

Comment by Johann Lombardi (Inactive) [ 20/Jul/12 ]

Unfortunately, flex_bg requires to reformat the OSTs.

Evictions are actually a consequence of slow I/O processing. The patch extends the lock timeout after BRW completion. We are definitely fine with you using this patch in production. That said, i just want to clarify that it addresses the consequence and not the root cause (slow I/Os).

Comment by Frederik Ferner (Inactive) [ 20/Jul/12 ]

Johann,

I understand this, we still hope that with the patch the evictions will be much less frequent.

I've had a look around the jbd statistics on the most affected production file
system earlier today and noticed this in
one of the history files:

R/C  tid   wait  run   lock  flush log   hndls  block inlog ctime write drop  close
R    9906318 0     128   0     0     11    35     10    11   
R    9906319 0     1     0     0     0     1      1     2    
R    9906320 0     5000  0     0     0     14     0     0    
R    9906321 0     53    0     0     0     220    9     10   
R    9906322 0     5000  0     0     1     5357   61    62   
R    9906323 0     5000  0     0     3     32     47    48   
R    9906324 0     5000  650   0     1     1      9     10   
R    9906325 0     5000  0     0     0     1      0     0    
R    9906326 0     5000  0     0     0     3      0     0    
R    9906327 0     61    0     0     1     1      9     10   
R    9906328 0     14    0     0     0     1      9     10   
R    9906329 0     21    0     0     0     1      9     10   
R    9906330 0     5000  0     0     0     7      0     0    
R    9906331 0     5000  0     0     0     21     0     0    
R    9906332 0     5000  3920  0     0     1      9     10   
R    9906333 17    22    0     0     0     1      7     8    
R    9906334 0     262   0     0     2     1      7     8    
R    9906335 0     0     0     0     0     1      7     8    
R    9906336 0     1017  0     0     1     1      7     8    
R    9906337 0     0     366844 0     0     2      10    11   
R    9906338 330575 0     53    0     0     5      13    14   

If I understand this correctly, there was a transaction that could not be
locked for about 366s and the following could not be started for 330s?
Unfortunately I did not note the exact time so can't correlate this with any slow messages, but in the minutes before taking these copies, there was a large number of slow XXX messages.

Another example copied at the same time was this...

R    9926682 0     5000  800358 0     0     1      8     9    
R    9926683 795815 67    105   0     0     13     28    29   
R    9926684 104   6     10    0     1     12     11    12   

I assume these are related to the messages?

Comment by Andreas Dilger [ 17/Apr/17 ]

Closing old issue.

Generated at Sat Feb 10 01:18:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.