[LU-5604] Lots of FAIL_ID checking are lost Created: 10/Sep/14 Updated: 25/Nov/19 Resolved: 24/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Niu Yawei (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 15676 | ||||||||||||||||||||||||
| Description |
|
It looks to me that lots of FAIL_ID checking are lost from time to time, take the replay-single.sh as an example:
To make sure the error injection test working as expected, I think we'd go through all the fail IDs, and add back all the missed fail_id checking. If some FAIL_ID is obsolete already, we'd remove or improve the corresponding test case. |
| Comments |
| Comment by John Hammond [ 10/Sep/14 ] |
|
Note that OBD_FAIL_LDLM_ENQUEUE_NET is referenced but unfortunately in a nonobvious way: /* * Unified target generic handers macros and generic functions. */ #define TGT_RPC_HANDLER_HP(base, flags, opc, fn, hp, fmt, version) \ [opc - base] = { \ .th_name = #opc, \ .th_fail_id = OBD_FAIL_ ## opc ## _NET, \ .th_opc = opc, \ .th_flags = flags, \ .th_act = fn, \ .th_fmt = fmt, \ .th_version = version, \ .th_hp = hp, \ } I see that the following are referenced somewhere in lustre/tests/ but never used in the lustre source. OBD_FAIL_MDS_OST_SETATTR OBD_FAIL_MDS_LLOG_SYNC_TIMEOUT OBD_FAIL_MDS_BLOCK_QUOTA_REQ OBD_FAIL_MDS_DROP_QUOTA_REQ OBD_FAIL_MDS_FAIL_LOV_LOG_ADD OBD_FAIL_MDS_LOV_PREP_CREATE OBD_FAIL_MDS_OPEN_WAIT_CREATE OBD_FAIL_OST_SETATTR_CREDITS OBD_FAIL_OST_HOLD_WRITE_RPC OBD_FAIL_OST_LLOG_RECOVERY_TIMEOUT OBD_FAIL_OST_CANCEL_COOKIE_TIMEOUT OBD_FAIL_OST_PAUSE_CREATE OBD_FAIL_OST_NOMEM OBD_FAIL_OST_BRW_PAUSE_BULK2 OBD_FAIL_OSC_DIO_PAUSE OBD_FAIL_PTLRPC_DELAY_RECOV OBD_FAIL_PTLRPC_DELAY_IMP_FULL OBD_FAIL_OBD_DQACQ OBD_FAIL_TGT_DELAY_PRECREATE OBD_FAIL_TGT_LAST_REPLAY OBD_FAIL_MDS_SYNC_CAPA_SL |
| Comment by Andreas Dilger [ 12/Sep/14 ] |
|
Mike, since you did most of the changes to the target unification, and will be further changing the client/server for DoM I think it makes sense for you to work on this to ensure we are not adding (more?) regressions in this area. |
| Comment by Mikhail Pershin [ 17/Sep/14 ] |
|
I can take this, yes, and check all current FAIL_IDs, though I am not sure how to ensure there will not be new regressions. |
| Comment by Mikhail Pershin [ 06/Oct/14 ] |
|
Comment from duplicated ticket by Vitaly:
|
| Comment by Rahul Deshmukh (Inactive) [ 07/Oct/14 ] |
|
Inspected commits After going through above mention commits, found that following fail_id macros (changed) 1. Assignment Removed 2. Conditional check removed Checked the above macros and found that following marcos need to be ported to repair the tests. OBD_FAIL_OST_LDLM_REPLY_NET Hence created the patch for the same. Please review and let me know if any thing is missed or needs correction. |
| Comment by Mikhail Pershin [ 08/Oct/14 ] |
|
Please note that all fail ids to simulate lost request are not missed but defined in TGT_RPC_HANDLER_HP macro, see John comment above. I know it is done in non-obvious way and you can't find those fail ids doing grep, but that is so since Lustre 2.0 when new MDT stack was introduced. I am going to add comment containing fail id name to each handler, so anyone can find it by grep. |
| Comment by Mikhail Pershin [ 08/Oct/14 ] |
OBD_FAIL_OST_ALL_REPLY_NET
Most of them are constructed from opcode name and stored on tgt_handler::th_fail_id. They are checked in tgt_handle_request0() properly. OBD_FAIL_OST_EROFS OBD_FAIL_OST_CONNECT_NET2 is checked in tgt_request_handle(). So for now I see the missed fail_ids are: |
| Comment by Mikhail Pershin [ 08/Oct/14 ] |
|
http://review.whamcloud.com/12232 - patch adds missing checks for the following FAIL ids: Therefore only 3 FAIL IDs were really lost after all. I am going to inspect all fail ids in obd_support.h and find out which are not used anymore or miss checks too. |
| Comment by Liang Zhen (Inactive) [ 19/Nov/14 ] |
|
Tappro, I'm very sorry that I didn't realise you already have a patch for OBD_FAIL_LDLM_REPLY, I submitted another one http://review.whamcloud.com/#/c/12780/ |
| Comment by Mikhail Pershin [ 09/Feb/15 ] |
|
Patch was updated to include the latest master commits which fix replay-dual issues. |
| Comment by Gerrit Updater [ 18/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12232/ |
| Comment by Mikhail Pershin [ 24/Feb/15 ] |
|
patch was landed, ticket can be closed |
| Comment by Gerrit Updater [ 16/Oct/15 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16846 |
| Comment by Gerrit Updater [ 27/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/16846/ |