[LU-3655] Reoccurrence of permanent eviction scenario Created: 29/Jul/13  Updated: 18/Jul/17  Resolved: 18/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File oss.log     File sync_io.log    
Severity: 3
Rank (Obsolete): 9404

 Description   

Hi,

I am afraid we suffer again from the issue described in LU-2683 and LU-1690. But this time we are running Lustre 2.1.5, which includes the 4 patches from LU-874. We also backported patch http://review.whamcloud.com/5208 from LU-2683 in our sources.

So those 5 patches might not be enough to fix this problem.

Here is the information collected from the crash:

crash>dmesg
...
LustreError: 65257:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 65257:0:(cl_io.c:967:cl_io_cancel()) Canceling ongoing page trasmission
...

crash> ps | grep 65257
  65257 2 5 ffff880fe2ac27d0 IN 0.0 0 0 [ldlm_bl_62]
crash> bt 65257
PID: 65257 TASK: ffff880fe2ac27d0 CPU: 5 COMMAND: "ldlm_bl_62"
 #0 [ffff880fe32a7ae0] schedule at ffffffff81484c15
 #1 [ffff880fe32a7ba8] cfs_waitq_wait at ffffffffa055a6de [libcfs]
 #2 [ffff880fe32a7bb8] cl_sync_io_wait at ffffffffa067f3cb [obdclass]
 #3 [ffff880fe32a7c58] cl_io_submit_sync at ffffffffa067f643 [obdclass]
 #4 [ffff880fe32a7cb8] cl_lock_page_out at ffffffffa0676997 [obdclass]
 #5 [ffff880fe32a7d28] osc_lock_flush at ffffffffa0a6abaf [osc]
 #6 [ffff880fe32a7d78] osc_lock_cancel at ffffffffa0a6acbf [osc]
 #7 [ffff880fe32a7dc8] cl_lock_cancel0 at ffffffffa0675575 [obdclass]
 #8 [ffff880fe32a7df8] cl_lock_cancel at ffffffffa067639b [obdclass]
 #9 [ffff880fe32a7e18] osc_ldlm_blocking_ast at ffffffffa0a6bd9a [osc]
#10 [ffff880fe32a7e88] ldlm_handle_bl_callback at ffffffffa07a0293 [ptlrpc]
#11 [ffff880fe32a7eb8] ldlm_bl_thread_main at ffffffffa07a06d1 [ptlrpc]
#12 [ffff880fe32a7f48] kernel_thread at ffffffff8100412a


crash> dmesg | grep 'SYNC IO'
LustreError: 3140:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 63611:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 65257:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 65316:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 65235:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 65277:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages
LustreError: 63605:0:(cl_io.c:1702:cl_sync_io_wait()) SYNC IO failed with error: -110, try to cancel 1 remaining pages

Sebastien.



 Comments   
Comment by Peter Jones [ 29/Jul/13 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 30/Jul/13 ]

Hi, Sebastien

Is there any log from OST? Is there any other abnormal messages from client log except the "SYNC IO failed with error: -110 ..." ? Thanks.

Comment by Diego Moreno (Inactive) [ 08/Aug/13 ]

2 files in attachment:

  • oss.log = log of all OSS at the same time (17h2*)
  • sync_io.log = log on the client at (17h2*)

No more information in the log... and nothing on the MDS

Comment by Niu Yawei (Inactive) [ 09/Aug/13 ]

There are only few lines of messages in the attached logs, looks line the client has been evicted by OST, so the sync write to the OST failed, but I don't see why the client was evicted from the log. Maybe there was some network problem between the client and OST?

Comment by Alexandre Louvet [ 23/Aug/13 ]

Niu,

Unfortunately, we doesn't have more info in the server log. The OSS is quiet for hours until we get the lock callback timeout. There is nothing in the OSS syslog before the callback message and the physical network (infiniband) doesn't show errors.

note that we are running with Lustre 2.1.5 + some patches that play in the network land :

  • ORNL-22 general ptlrpcd threads pool support
  • LU-1144 implement a NUMA aware ptlrpcd binding policy
  • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
  • LU-2613 to much unreclaimable slab space
  • LU-2624 ptlrpc fix thread stop
  • LU-2683 client deadlock in cl_lock_mutex_get

Alex.

Comment by Alexandre Louvet [ 18/Apr/14 ]

Back on stage. We are seeing this issue more and more. Any idea of what can be collected ?

Regards,

Comment by Niu Yawei (Inactive) [ 21/Apr/14 ]

I think you should collect log from both client and OSS, and get a full stack-trace on client.

Comment by Niu Yawei (Inactive) [ 18/Jul/17 ]

Close old 2.1 issue.

Generated at Sat Feb 10 01:35:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.