[LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed Created: 15/May/17  Updated: 19/Jun/17  Resolved: 19/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 1
Labels: soak
Environment:

Soak stress cluster


Attachments: Text File lu-9504-6-07.soak-11.log.txt     File lustre-log.1496836992.5097.txt.gz     File lustre-log.1496843934.5221.txt.gz     File lustre-log.1496861499.5292.txt.gz     File lustre-log.1496861511.5224.txt.gz     File lustre-log.1496861518.5296.txt.gz     File lustre-log.1497161984.5008.preLBUG.txt.gz     File lustre-log.1497165943.4581.txt.gz     File lustre-log.1497166143.4807.postLBUG.txt.gz     Text File vmcore-dmesg.txt    
Issue Links:
Duplicate
is duplicated by LU-9563 LBUG ldlm_lock_downgrade()) ASSERTION... Resolved
Related
is related to LU-8650 DNE disabled REP-ACK Resolved
is related to LU-9547 LBUG osp_dev.c:755:osp_statfs()) ASSE... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Loaded latest master build, started soak. Only fault induced was router drop.
Soak-9 is the second MDS (MDT0001)
soak-9 has a hard crash in normal operation:
Soak is started, mount completes

[  893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
[  893.801270] LustreError: Skipped 139 previous similar messages^M
[  894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
[  894.110919] Lustre: Skipped 7 previous similar messages^M
[  894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
[  894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
[  894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
[ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
[ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
[ 1465.579478] Lustre: Skipped 30 previous similar messages^M
[ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
[ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
[ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
[ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
[ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
[ 2436.493810] Lustre: Skipped 2 previous similar messages^M
[ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
[ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
[ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
[ 3117.430856] Lustre: Skipped 1 previous similar message^M
[ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
[ 3951.368647] Lustre: Skipped 2 previous similar messages^M
[ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743]  req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
[ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
[ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
[ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
[ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
[ 4632.557537] ^M
[ 4632.557537] Call Trace:^M
[ 4632.566532]  [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
[ 4632.575923]  [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
[ 4632.585006]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
[ 4632.594107]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
[ 4632.602416]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
[ 4632.611086]  [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
[ 4632.620009]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
[ 4632.627027]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
[ 4632.634106]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
[ 4632.641571]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
[ 4632.648510] ^M
[ 4632.651470] Kernel panic - not syncing: LBUG^M
[ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
[ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[ 4632.686895]  ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
[ 4632.696571]  ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
[ 4632.706233]  ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
[ 4632.715875] Call Trace:^M
[ 4632.715875] Call Trace:^M
[ 4632.719901]  [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
[ 4632.726949]  [<ffffffff8168014a>] panic+0xe3/0x1f2^M
[ 4632.733611]  [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
[ 4632.741856]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
[ 4632.750326]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
[ 4632.757970]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
[ 4632.765508]  [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
[ 4632.775290]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
[ 4632.781973]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
[ 4632.790485]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
[ 4632.797662]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M

System then crashed, crash dump is available on the node.
vmcore-dmesg attached.



 Comments   
Comment by John Hammond [ 16/May/17 ]

This assertion was added by https://review.whamcloud.com/#/c/22807/ for LU-8650.

Comment by Peter Jones [ 16/May/17 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by Cliff White (Inactive) [ 17/May/17 ]

Hit again on build 3581 - soak fails within one hour due to this bug.

Comment by Cliff White (Inactive) [ 17/May/17 ]

This issue was not seen on builds 3577 or 3578, the issue is likely in build 3579

Comment by Peter Jones [ 17/May/17 ]

To translate that into git commits

LU-8943 lnd: Enable Multiple OPA Endpoints between Nodes — oleg.drokin / gitweb
LU-5788 test: fix the cmd to getfree space — oleg.drokin / gitweb
LU-9324 lfs: output stripe info in YAML format — oleg.drokin / gitweb
LU-9357 pfl: should inherit pool from previous layout comp — oleg.drokin / gitweb
LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects — oleg.drokin / gitweb
LU-9324 lnet: move cyaml.h under lnet/include/ — oleg.drokin / gitweb
LU-7108 test: Remove sanityn tests from ALWAYS_EXCEPT list — oleg.drokin / gitweb
LU-7567 utils: fix timestamp printing in lfs_changelog() — oleg.drokin / gitweb
LU-8346 obdclass: guarantee all keys filled — oleg.drokin / gitweb
LU-9439 scripts: Change behavior of lustre_rmmod — oleg.drokin / gitweb
LU-9456 lnd: Change sock_create() to sock_create_kern() — oleg.drokin / gitweb
LU-8294 gss: quiet cache_check return ENOENT warning — oleg.drokin / gitweb
LU-9458 ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES — oleg.drokin / gitweb
LU-9096 test: add sanity 253 to ALWAYS_EXCEPT — oleg.drokin / gitweb

Comment by Lai Siyao [ 18/May/17 ]

This assert is from https://review.whamcloud.com/22807 for LU-8650, which assumes all the handling of a reply will be in one specific hr thread, so it doesn't need to consider race, now it looks to be not true, I'll review the code to verify it.

Comment by Gerrit Updater [ 19/May/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207
Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fcd7bdd08ec2c23d1e7557788a42e04b97590e02

Comment by Cliff White (Inactive) [ 19/May/17 ]

Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good.

Comment by Cliff White (Inactive) [ 22/May/17 ]

Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers
(14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results
indicated a problem with the soak load engine, the failovers occurred by the clients were idle.

After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

Comment by Cliff White (Inactive) [ 24/May/17 ]

Hit this on soak with latest master. Patch should land to master.

Comment by Cliff White (Inactive) [ 31/May/17 ]

Tested with latest version of patch
https://review.whamcloud.com/#/c/27207/
Hit LBUG immediately

ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG
May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003
May 31 17:11:18 soak-9 kernel: #012Call Trace:
May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc]
May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc]
May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
May 31 17:11:19 soak-9 kernel:
May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG
Comment by Lai Siyao [ 02/Jun/17 ]

https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

Comment by Cliff White (Inactive) [ 05/Jun/17 ]

I have loaded soak, will start tests today

Comment by Cliff White (Inactive) [ 05/Jun/17 ]

Hit LU-9563 after first failover

Comment by Lai Siyao [ 06/Jun/17 ]

Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

Comment by Cliff White (Inactive) [ 06/Jun/17 ]

I am still waiting for the failure to repeat.

Comment by Cliff White (Inactive) [ 07/Jun/17 ]

Here is the log from prior to the LBUG.

Jun  5 16:44:31 soak-11 kernel: Lustre: Skipped 1 previous similar message
Jun  5 16:44:40 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.124@o2ib100 (stopping)
Jun  5 16:44:40 soak-11 kernel: Lustre: Skipped 5 previous similar messages
Jun  5 16:44:44 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
Jun  5 16:45:02 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib100 (stopping)
Jun  5 16:45:02 soak-11 kernel: Lustre: Skipped 20 previous similar messages
Jun  5 16:45:07 soak-11 kernel: Lustre: 4229:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1496681101/real 0]  req@ffff8807e7265700 x1569380820308784/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.110@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1496681107 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jun  5 16:45:09 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
Jun  5 16:45:34 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
Jun  5 16:45:41 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.102@o2ib10 (stopping)
Jun  5 16:45:41 soak-11 kernel: Lustre: Skipped 11 previous similar messages
Jun  5 16:45:51 soak-11 kernel: LustreError: 5001:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0002-mdtlov: statfs: rc = -108
Jun  5 16:45:51 soak-11 kernel: Lustre: 5001:0:(service.c:2117:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (31:69s); client may timeout.  req@ffff8803f1b0ef00 x1569380804405664/t0(0) o101->bd5dff2c-b3c7-e6f9-ab0e-ec6b3d83954a@192.168.1.117@o2ib100:37/0 lens 1728/544 e 1 to 0 dl 1496681082 ref 1 fl Complete:/0/0 rc -19/-19
Jun  5 16:45:54 soak-11 kernel: Lustre: server umount soaked-MDT0002 complete
Jun  5 16:45:54 soak-11 sshd[5502]: Received disconnect from 10.10.1.135: 11: disconnected by user
Jun  5 16:45:54 soak-11 sshd[5502]: pam_unix(sshd:session): session closed for user root
Jun  5 16:45:54 soak-11 systemd-logind: Removed session 15.
Jun  5 16:45:54 soak-11 systemd: Removed slice user-0.slice.
Jun  5 16:45:54 soak-11 systemd: Stopping user-0.slice.
Jun  5 16:45:56 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10)
Jun  5 16:45:56 soak-11 kernel: Lustre: Skipped 26 previous similar messages
Jun  5 16:51:12 soak-11 kernel: LustreError: 4204:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
Comment by Cliff White (Inactive) [ 07/Jun/17 ]

Hit LBUG again, syslog and lustre-logs from dead server attached

Comment by Lai Siyao [ 08/Jun/17 ]

It's strange I don't see debug messages before LBUG(), Cliff, which update from https://review.whamcloud.com/#/c/27207/ did you use in this test? The debug messages were added since Jun 3 (update 5). And if possible can you retest with the update 7? which added more debug message right before LBUG.

Comment by Cliff White (Inactive) [ 08/Jun/17 ]

I used build 48048, which should have had the debug patches.
I will retry with the latest

Comment by Cliff White (Inactive) [ 08/Jun/17 ]

I see a patch set 6, build 48063, I see no newer builds

Comment by Lai Siyao [ 08/Jun/17 ]

strange, can you access https://review.whamcloud.com/#/c/27207/7 ? which is patch set 7, and the build is https://build.hpdd.intel.com/job/lustre-reviews/48061/.

I just checked build 48063, which is for LU-9049.

Comment by Cliff White (Inactive) [ 08/Jun/17 ]

Okay, loading that build now.

Comment by Cliff White (Inactive) [ 12/Jun/17 ]

Failed again:

Jun 11 07:25:40 soak-11 kernel: perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2549:ldlm_lock_downgrade()) ### weird lock mode ns: mdt-soaked-MDT0003_UUID lock: ffff8806b57f9800/0x3e1187b8839d4a55 lrc: 1/0,0 mode: --/PW res: [0x2c006071b:0x169d:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x44a19401000000 nid: local remote: 0x0 expref: -99 pid: 7178 timeout: 0 lvb_type: 0
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) LBUG
Jun 11 07:25:43 soak-11 kernel: Pid: 4581, comm: ptlrpc_hr01_007
Jun 11 07:25:43 soak-11 kernel: #012Call Trace:
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0eed7b1>] ldlm_lock_downgrade+0x111/0x210 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44818>] ptlrpc_handle_rs+0x5c8/0x700 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44a35>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20  
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44950>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
Jun 11 07:25:43 soak-11 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
Jun 11 07:25:43 soak-11 kernel:
Jun 11 07:25:43 soak-11 kernel: LustreError: dumping log to /tmp/lustre-log.1497165943.4581 
Comment by Cliff White (Inactive) [ 12/Jun/17 ]

Attached three lustre-log, one from immediately before the LBUG, one after

Comment by Lai Siyao [ 13/Jun/17 ]

Cliff, thanks! This helps a lot, so what happened is as below:
1. ptlrpc_handle_rs() handle REPACK rs, and is about to convert PW lock A to COS mode.
2. transaction committed, ptlrpc_handle_rs() is called again which decref lock A.
3. there comes a conflicting lock B, since there is no reader or writer for lock A, it will cancel lock A, which will set its l_granted_mode to 0.
4. step 1 continues, ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) is triggered.

I'll update the patch later.

Comment by Cliff White (Inactive) [ 16/Jun/17 ]

I have run the latest patch for 48 hours on soak. No LBUGS. LFSCK does not complete, but does abort successfully

Comment by Gerrit Updater [ 19/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27207/
Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ca4659c9d1f010c1046b634cf5a592a620ac3935

Comment by Peter Jones [ 19/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.