[LU-5604] Lots of FAIL_ID checking are lost Created: 10/Sep/14  Updated: 25/Nov/19  Resolved: 24/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: Niu Yawei (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Blocker
is blocking LU-5579 MDS crashed by "mdt_check_resent_lock... Resolved
Duplicate
is duplicated by LU-5709 "unified request handlers" commits br... Closed
Related
is related to LU-10816 fix usage of drop_ldlm_reply() in tests Closed
Severity: 3
Rank (Obsolete): 15676

 Description   

It looks to me that lots of FAIL_ID checking are lost from time to time, take the replay-single.sh as an example:

  • test_73c() checked OBD_FAIL_TGT_LAST_REPLAY, but this FAIL_ID is never being checked in Lustre code from the day one it was introduced.
  • test_73b() checked OBD_FAIL_LDLM_REPLY, but this FAIL_ID is now only checked in mdt_reint_open(), I think it should be checked for every lock enqueue as well.
  • test_73a() checked OBD_FAIL_LDLM_ENQUEUE_NET, but this FAIL_ID is not being checked in Lustre code anymore.
  • test_80c() checked OBD_FAIL_UPDATE_OBJ_NET_REP, but this FAIL_ID has been removed from Lustre code.
  • test_83a() checked OBD_FAIL_MDS_FAIL_LOV_LOG_ADD, but this FAIL_ID isn't checked in Lustre code.
    ...

To make sure the error injection test working as expected, I think we'd go through all the fail IDs, and add back all the missed fail_id checking. If some FAIL_ID is obsolete already, we'd remove or improve the corresponding test case.



 Comments   
Comment by John Hammond [ 10/Sep/14 ]

Note that OBD_FAIL_LDLM_ENQUEUE_NET is referenced but unfortunately in a nonobvious way:

/*                                                                                                 
 * Unified target generic handers macros and generic functions.                                    
 */
#define TGT_RPC_HANDLER_HP(base, flags, opc, fn, hp, fmt, version)      \
[opc - base] = {                                                        \
        .th_name        = #opc,                                         \
        .th_fail_id     = OBD_FAIL_ ## opc ## _NET,                     \
        .th_opc         = opc,                                          \
	.th_flags       = flags,                                        \
        .th_act         = fn,                                           \
        .th_fmt         = fmt,                                          \
        .th_version     = version,                                      \
	.th_hp          = hp,                                           \
}

I see that the following are referenced somewhere in lustre/tests/ but never used in the lustre source.

OBD_FAIL_MDS_OST_SETATTR
OBD_FAIL_MDS_LLOG_SYNC_TIMEOUT
OBD_FAIL_MDS_BLOCK_QUOTA_REQ
OBD_FAIL_MDS_DROP_QUOTA_REQ
OBD_FAIL_MDS_FAIL_LOV_LOG_ADD
OBD_FAIL_MDS_LOV_PREP_CREATE
OBD_FAIL_MDS_OPEN_WAIT_CREATE
OBD_FAIL_OST_SETATTR_CREDITS
OBD_FAIL_OST_HOLD_WRITE_RPC
OBD_FAIL_OST_LLOG_RECOVERY_TIMEOUT
OBD_FAIL_OST_CANCEL_COOKIE_TIMEOUT
OBD_FAIL_OST_PAUSE_CREATE
OBD_FAIL_OST_NOMEM
OBD_FAIL_OST_BRW_PAUSE_BULK2
OBD_FAIL_OSC_DIO_PAUSE
OBD_FAIL_PTLRPC_DELAY_RECOV
OBD_FAIL_PTLRPC_DELAY_IMP_FULL
OBD_FAIL_OBD_DQACQ
OBD_FAIL_TGT_DELAY_PRECREATE
OBD_FAIL_TGT_LAST_REPLAY
OBD_FAIL_MDS_SYNC_CAPA_SL
Comment by Andreas Dilger [ 12/Sep/14 ]

Mike, since you did most of the changes to the target unification, and will be further changing the client/server for DoM I think it makes sense for you to work on this to ensure we are not adding (more?) regressions in this area.

Comment by Mikhail Pershin [ 17/Sep/14 ]

I can take this, yes, and check all current FAIL_IDs, though I am not sure how to ensure there will not be new regressions.

Comment by Mikhail Pershin [ 06/Oct/14 ]

Comment from duplicated ticket by Vitaly:

afaics, there are a number of "unified commit handlers" commits which broke some of IT tests:
LU-3467 mdt: call MDT handlers via unified request handler
LU-3467 ofd: use unified handler for OST requests
LU-2145 server: use unified request handler for MGS
etc
there were different fail_loc which were set&checked per RPC operation code (e.g. OBD_FAIL_LDLM_REPLY, OBD_FAIL_OST_LDLM_REPLY_NET, OBD_FAIL_OST_LDLM_REPLY_NET, etc) which were removed. also looking at the code, it is not clear that all the OBD/CFS_FAIL macros and defines are properly ported to the new code.
as an example - OBD_FAIL_LDLM_REPLY is not used anymore, but the following tests expect it to be used:
recovery-small 53
recovery-small 113
replay-dual 19
replay-single 52
replay-single 73b
therefore tests do nothing now.

Comment by Rahul Deshmukh (Inactive) [ 07/Oct/14 ]

Inspected commits
----------------------------
LU-3467 target: generic hpreq handler in target
LU-3467 target: use osc_reply_portal for OUT services
LU-3467 ptlrpc: initialize request session early
LU-3467 target: unified transaction callbacks
LU-3467 ofd: use unified handler for OST requests
LU-3467 mdt: call MDT handlers via unified request handler
LU-3467 seq: unified SEQ handler
LU-3467 target: FLD to use unified request handler
LU-3467 target: move OUT to the unified target code
LU-2145 server: use unified request handler for MGS
LU-2145 target: use tgt_ prefix for target function
LU-2145 target: move target code to the separate directory

After going through above mention commits, found that following fail_id macros (changed)

1. Assignment Removed
-----------------------------------
OBD_FAIL_LDLM_REPLY
OBD_FAIL_OST_ALL_REPLY_NET
OBD_FAIL_OST_LDLM_REPLY_NET
OBD_FAIL_MDS_ALL_REPLY_NET
OBD_FAIL_MGS_ALL_REPLY_NET

2. Conditional check removed
-----------------------------------------
OBD_FAIL_OST_EROFS
OBD_FAIL_OST_CONNECT_NET
OBD_FAIL_OST_CONNECT_NET2
OBD_FAIL_OST_DISCONNECT_NET
OBD_FAIL_OST_CREATE_NET
OBD_FAIL_OST_DESTROY_NET
OBD_FAIL_OST_GETATTR_NET
OBD_FAIL_OST_SETATTR_NET
OBD_FAIL_OST_ENOSPC
OBD_FAIL_OST_PUNCH_NET
OBD_FAIL_OST_STATFS_NET
OBD_FAIL_OST_SYNC_NET
OBD_FAIL_OST_QUOTACHECK_NET
OBD_FAIL_OST_QUOTACTL_NET
OBD_FAIL_LDLM_ENQUEUE_NET
OBD_FAIL_LDLM_CONVERT_NET
OBD_FAIL_LDLM_CANCEL_NET
OBD_FAIL_MDS_ALL_REQUEST_NET
OBD_FAIL_MGS_CONNECT_NET

Checked the above macros and found that following marcos need to be ported to repair the tests.

OBD_FAIL_OST_LDLM_REPLY_NET
OBD_FAIL_LDLM_REPLY
OBD_FAIL_OST_CREATE_NET
OBD_FAIL_LDLM_ENQUEUE

Hence created the patch for the same.
Patch link : http://review.whamcloud.com/#/c/12203/

Please review and let me know if any thing is missed or needs correction.

Comment by Mikhail Pershin [ 08/Oct/14 ]

Please note that all fail ids to simulate lost request are not missed but defined in TGT_RPC_HANDLER_HP macro, see John comment above. I know it is done in non-obvious way and you can't find those fail ids doing grep, but that is so since Lustre 2.0 when new MDT stack was introduced. I am going to add comment containing fail id name to each handler, so anyone can find it by grep.

Comment by Mikhail Pershin [ 08/Oct/14 ]

1. Assignment Removed
-----------------------------------
OBD_FAIL_LDLM_REPLY
OBD_FAIL_OST_ALL_REPLY_NET
OBD_FAIL_OST_LDLM_REPLY_NET
OBD_FAIL_MDS_ALL_REPLY_NET
OBD_FAIL_MGS_ALL_REPLY_NET

OBD_FAIL_OST_ALL_REPLY_NET
OBD_FAIL_MDS_ALL_REPLY_NET
OBD_FAIL_MGS_ALL_REPLY_NET
all IDs above are passed to tgt_init() by ofd, mdt and mgs. They are stored in lu_target::lut_reply_fail_id for further use.

2. Conditional check removed
-----------------------------------------
OBD_FAIL_OST_EROFS
OBD_FAIL_OST_CONNECT_NET
OBD_FAIL_OST_CONNECT_NET2
OBD_FAIL_OST_DISCONNECT_NET
OBD_FAIL_OST_CREATE_NET
OBD_FAIL_OST_DESTROY_NET
OBD_FAIL_OST_GETATTR_NET
OBD_FAIL_OST_SETATTR_NET
OBD_FAIL_OST_ENOSPC
OBD_FAIL_OST_PUNCH_NET
OBD_FAIL_OST_STATFS_NET
OBD_FAIL_OST_SYNC_NET
OBD_FAIL_OST_QUOTACHECK_NET
OBD_FAIL_OST_QUOTACTL_NET
OBD_FAIL_LDLM_ENQUEUE_NET
OBD_FAIL_LDLM_CONVERT_NET
OBD_FAIL_LDLM_CANCEL_NET
OBD_FAIL_MDS_ALL_REQUEST_NET
OBD_FAIL_MGS_CONNECT_NET

Most of them are constructed from opcode name and stored on tgt_handler::th_fail_id. They are checked in tgt_handle_request0() properly.

OBD_FAIL_OST_EROFS
OBD_FAIL_OST_ENOSPC
both are checked inside tgt_brw_write()

OBD_FAIL_OST_CONNECT_NET2 is checked in tgt_request_handle().

So for now I see the missed fail_ids are:
OBD_FAIL_LDLM_REPLY
OBD_FAIL_OST_LDLM_REPLY_NET
OBD_FAIL_UPDATE_OBJ_NET_REP
OBD_FAIL_MDS_FAIL_LOV_LOG_ADD

Comment by Mikhail Pershin [ 08/Oct/14 ]

http://review.whamcloud.com/12232 - patch adds missing checks for the following FAIL ids:
OBD_FAIL_LDLM_REPLY - fail id is set after tgt_enqueue() for non-OST requests.
OBD_FAIL_OST_LDLM_REPLY_NET - the same but for OST enqueue only.
OBD_FAIL_UPDATE_OBJ_NET_REP - it was not lost and functional but renamed, rename reference to it tests.
OBD_FAIL_MDS_FAIL_LOV_LOG_ADD - added back in code along with workaround.
OBD_FAIL_TGT_LAST_REPLAY it was never used and useless in fact, patch removes it along with test.

Therefore only 3 FAIL IDs were really lost after all. I am going to inspect all fail ids in obd_support.h and find out which are not used anymore or miss checks too.

Comment by Liang Zhen (Inactive) [ 19/Nov/14 ]

Tappro, I'm very sorry that I didn't realise you already have a patch for OBD_FAIL_LDLM_REPLY, I submitted another one http://review.whamcloud.com/#/c/12780/
could you take a look? If you prefer to choose yours, I'm OK to abandon mine, thanks.

Comment by Mikhail Pershin [ 09/Feb/15 ]

Patch was updated to include the latest master commits which fix replay-dual issues.

Comment by Gerrit Updater [ 18/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12232/
Subject: LU-5604 tgt: return missed fail ids
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4de90170e2573321e7691364d1d527aedfd25ff9

Comment by Mikhail Pershin [ 24/Feb/15 ]

patch was landed, ticket can be closed

Comment by Gerrit Updater [ 16/Oct/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16846
Subject: LU-5604 tests: fix usage of drop_ldlm_reply() in tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d5c616e494e2e46b348162db410dacc3046f2fb7

Comment by Gerrit Updater [ 27/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/16846/
Subject: LU-5604 tests: fix usage of drop_ldlm_reply() in tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7130551d63c155534ab62cb4942b7c88556daad4

Generated at Sat Feb 10 01:52:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.