[LU-7001] osp_sync.c: 1139: osp_sync_thread Created: 13/Aug/15  Updated: 18/Sep/18  Resolved: 13/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Alex Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch
Environment:

https://build.hpdd.intel.com/job/lustre-reviews/34017/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/
https://build.hpdd.intel.com/job/lustre-master/3137/arch=x86_64,build_type=server,distro=el6.6,ib_stack=inkernel/


Issue Links:
Related
is related to LU-6944 LBUG: (osp_sync.c:1139:osp_sync_threa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The problem unfortunately is not solved, even with the patch http://review.whamcloud.com/#/c/15841/
I propose to raise the topic again LU-6944
The system restarts unexpectedly with errors
Message from syslogd @ hard at Aug 13 8:56:38 ...
  kernel: LustreError: 2796: 0: (osp_sync.c: 1139: osp_sync_thread ()) ASSERTION (thread-> t_flags! = SVC_RUNNING) failed: 684 changes, 1137 in progress, 7 in flight
Message from syslogd @ hard at Aug 13 8:56:38 ...
  kernel: LustreError: 2796: 0: (osp_sync.c: 1139: osp_sync_thread ()) LBUG



 Comments   
Comment by Andreas Dilger [ 13/Aug/15 ]

Please provide the console logs with stack trace from the failing node. What operations are being done to trigger these errors?

Comment by Alex [ 14/Aug/15 ]

How to enable logs to provide them? Operation delete files

Comment by Gerrit Updater [ 09/Sep/15 ]

Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/16335
Subject: LU-7001 osp: remove improper assert of sync thread
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ec59112060a4182b2c9b028ab38cf5f3dc3f8e8d

Comment by Li Xi (Inactive) [ 09/Sep/15 ]

We are hitting this issue repeatedly. I guess it will never recover unless we skip recovery or do something trick.

Can we just remove the assertion? It seems this assertion is not proper, since the running thread has no idea when it will be requested to stop. Also, in osp_init0(), if ptlrpc_init_import() function returns a failure (ptlrpc_init_import() will not return any failure at least currently), it seems the assertion will fail. So this assertion looks dangerous.

Comment by Li Xi (Inactive) [ 10/Sep/15 ]

Finally, we walk around this problem by remove the CATALOGS file. I am wondering whether there is anyway to chack and recover broken llogs records...

Comment by Andreas Dilger [ 10/Sep/15 ]

Li Xi, there are a couple of patches in flight that will repair or skip corrupted log records, but there may still be more types of corruption found on the future.

Comment by Li Xi (Inactive) [ 11/Sep/15 ]

Thank you Andreas for the information.

Do you think it is possible to write a userspace tool to read as well as edit the llog files? I know that llog_reader is being changed, so hopefully, we will be able to at least dump the llog file. But since the llog files can be read locally from MDT/OST ldiskfs, maybe we can use a tool to remove wrong records mannually too?

Comment by Gerrit Updater [ 23/Mar/17 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26132
Subject: LU-7001 osp: fix llog processing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bcebaa04977761773d24fed0821c44fcbd5bef83

Comment by Gerrit Updater [ 05/Apr/17 ]

Alexander Boyko (alexander.boyko@seagate.com) uploaded a new patch: https://review.whamcloud.com/26359
Subject: LU-7001 tests: check osp_sync_thread for wrapped llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8d779dab1025cd46ac44fc80f4dbd5ac85cba8a8

Comment by Gerrit Updater [ 13/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26132/
Subject: LU-7001 osp: fix llog processing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8da9fb0cf14cc79bf1985d144d0a201e136dfe51

Comment by Peter Jones [ 13/Sep/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 20/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/32097
Subject: LU-7001 osp: fix llog processing
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: bf6768549dfa09711daa66ccbf9db766c9f074f6

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32097/
Subject: LU-7001 osp: fix llog processing
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 10cc97e3c1487692b460702bf46220b1acb452ee

Generated at Sat Feb 10 02:05:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.