[LU-8399] MDT hung at lu_object_find_at during umount Created: 14/Jul/16  Updated: 06/Aug/16  Resolved: 06/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

autotest review-dne


Issue Links:
Related
is related to LU-7782 sanity-scrub test_2: NULL pointer der... Resolved
is related to LU-7513 sanity-scrub test_4a: test failed to ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-scrub test_1c and test_4a are timing out with a hang in unmount of a MDT. All failures so far are in review-dne.

From the test_log, we see the test is umounting MDTs:

…
stop mds1
CMD: onyx-33vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts
CMD: onyx-33vm7 umount -d /mnt/lustre-mds1
CMD: onyx-33vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
stop mds2
CMD: onyx-33vm3 grep -c /mnt/lustre-mds2' ' /proc/mounts
CMD: onyx-33vm3 umount -d /mnt/lustre-mds2
CMD: onyx-33vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
stop mds3
CMD: onyx-33vm7 grep -c /mnt/lustre-mds3' ' /proc/mounts
CMD: onyx-33vm7 umount -d /mnt/lustre-mds3

Looking at the MDS that is in the process of unmounting an MDT, in the console logs, we see errors when trying to unmount and the stack trace:

06:49:24:[12121.531608] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds3
06:49:24:[12141.673117] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12141.681956] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12141.688142] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12161.692141] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12161.701736] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12161.708145] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12181.712143] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12181.722112] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12181.729016] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12201.733130] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12201.742520] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12201.749824] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12221.754157] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12221.763268] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12221.770918] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12241.776150] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:49:24:[12241.784285] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:49:24:[12241.792494] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:49:24:[12261.798128] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:54:55:[12261.805202] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:54:55:[12261.813777] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:54:55:[12301.819116] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:54:55:[12301.825881] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 1 previous similar message
06:54:55:[12301.831998] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:54:55:[12301.840643] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 1 previous similar message
06:54:55:[12301.843920] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
06:54:55:[12301.849581] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) Skipped 1 previous similar message
06:54:55:[12360.587118] INFO: task osp_up0-2:22450 blocked for more than 120 seconds.
06:54:55:[12360.592257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
06:54:55:[12360.595199] osp_up0-2       D ffffc90004019000     0 22450      2 0x00000080
06:54:55:[12360.597925]  ffff88007a277a98 0000000000000046 ffff8800441b8000 ffff88007a277fd8
06:54:55:[12360.600858]  ffff88007a277fd8 ffff88007a277fd8 ffff8800441b8000 ffff88004435c240
06:54:55:[12360.603645]  ffff8800435be000 ffff880042fd6020 0000000000000000 ffffc90004019000
06:54:55:[12360.606403] Call Trace:
06:54:55:[12360.608738]  [<ffffffff8163ba29>] schedule+0x29/0x70
06:54:55:[12360.611237]  [<ffffffffa07f797d>] lu_object_find_at+0x4d/0xe0 [obdclass]
06:54:55:[12360.613840]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
06:54:55:[12360.616305]  [<ffffffffa07f7e3f>] lu_object_find_slice+0x1f/0x90 [obdclass]
06:54:55:[12360.618842]  [<ffffffffa0f7c5f4>] osp_trans_stop_cb+0x1b4/0x2c0 [osp]
06:54:55:[12360.621315]  [<ffffffffa0f7eeeb>] osp_update_interpret+0x21b/0x4b0 [osp]
06:54:55:[12360.623814]  [<ffffffffa0a0d725>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
06:54:55:[12360.626273]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
06:54:55:[12360.628649]  [<ffffffffa0a0f12b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
06:54:55:[12360.631038]  [<ffffffffa0a0f6a1>] ptlrpc_set_wait+0x4f1/0x900 [ptlrpc]
06:54:55:[12360.633299]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
06:54:55:[12360.635543]  [<ffffffffa0a0fb2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
06:54:55:[12360.637787]  [<ffffffffa0f7f6b2>] osp_send_update_req+0x1c2/0x830 [osp]
06:54:55:[12360.640042]  [<ffffffffa0f80573>] osp_send_update_thread+0x233/0x5e0 [osp]
06:54:55:[12360.642227]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
06:54:55:[12360.644346]  [<ffffffffa0f80340>] ? osp_invalidate_request+0x370/0x370 [osp]
06:54:55:[12360.646486]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
06:54:55:[12360.648492]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
06:54:55:[12360.650536]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
06:54:55:[12360.652551]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
06:54:55:[12372.704108] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468417841/real 1468417841]  req@ffff880054be9500 x1539734990553872/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1468417866 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
06:54:55:[12372.714842] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 46 previous similar messages
06:54:55:[12381.852134] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
06:54:55:[12381.861581] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
06:54:55:[12381.864156] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
06:54:55:[12381.870152] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
06:54:55:[12381.872522] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
…

sanity-scrub test_1c started failing with this error on July 13, 2016. Logs for some of these failures are at:
https://testing.hpdd.intel.com/test_sets/c22c1f38-48c5-11e6-bf87-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1bf26ff2-48eb-11e6-bf87-5254006e85c2
https://testing.hpdd.intel.com/test_sets/055ea44e-4900-11e6-bf87-5254006e85c2

sanity-scrub test_4a started failing with this error on July 13, 2016. Logs for some of these failures are at:
https://testing.hpdd.intel.com/test_sets/a3238bf4-48d8-11e6-9f8e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/61c003cc-492d-11e6-8968-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cdb52f20-492e-11e6-8968-5254006e85c2



 Comments   
Comment by Sebastien Buisson (Inactive) [ 15/Jul/16 ]

Another occurence of failure on sanity-scrub test 1c on master:
https://testing.hpdd.intel.com/test_sets/3178d7a6-4a0a-11e6-8968-5254006e85c2

Comment by Jian Yu [ 15/Jul/16 ]

This is blocking patch review testing on master branch:
https://testing.hpdd.intel.com/test_sets/1d28e912-49ec-11e6-9f8e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d519d1f0-49ef-11e6-bf87-5254006e85c2
https://testing.hpdd.intel.com/test_sets/50cc4684-49eb-11e6-bf87-5254006e85c2

Comment by nasf (Inactive) [ 15/Jul/16 ]

According to the logs, it is not the OI scrub logic blocked the lu_object_find(), instead, it happened before the OI scrub. To test OI scrub logic, we need to generate some files and directories. For DNE case, we will:
1) create some striped directories via scrub_prep();
2) then stop all the MDTs;
3) and then mount the MDTs as "ldiskfs" mode to backup/restore.

The hung happened at the second step. Because MDT1 umount before the MDT2 (or MDT3) flushing the async update OUT RPC, as to the MDT2 (or MDT3) get failure, then osp_trans_stop_cb() will try to purge related OSP object attribute cache, it needs to locate related object firstly. But we are umounting MDT2 (or MDT4) at that time, related object are marked as to be purged from RAM. On the other hand, someone may be referencing the object and waiting for the transaction callback. But the transaction callback is blocked, then deadlock.

In fact, if the object is purging out of RAM, the osp_invalidate() needs to do nothing. So we can use non-blocked method to locate related object for osp_invalidate() to avoid deadlock.

Comment by Gerrit Updater [ 15/Jul/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21330
Subject: LU-8399 osp: non-blocked osp_invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2cc1f711f29f5c3f82829b414561e083adaf8c12

Comment by Andreas Dilger [ 20/Jul/16 ]

It appears that this problem is related to the patch http://review.whamcloud.com/18801 "LU-7782 scrub: handle slave obj of striped directory".

Comment by Oleg Drokin [ 20/Jul/16 ]

I reopened LU-7782 and reverted that patch. so we probably should close this as adup of LU-7782.
The patch from here + whatever necessary followon fixes need to be rolled into new LU-7782 patch in one form or another, I guess.

Comment by Peter Jones [ 20/Jul/16 ]

makes sense to me

Comment by nasf (Inactive) [ 21/Jul/16 ]

We hit again after the reversion of LU-7782 patch:
https://testing.hpdd.intel.com/test_sets/47596800-4f5b-11e6-bf87-5254006e85c2

It seems that LU-7782 patch makes the issue to be reproduced relatively easily, but not the root reason.

Comment by Jian Yu [ 22/Jul/16 ]

One more instance on master branch: https://testing.hpdd.intel.com/test_sets/55571e7a-4fa6-11e6-bf87-5254006e85c2

Comment by nasf (Inactive) [ 24/Jul/16 ]

We still need the patch:
http://review.whamcloud.com/21330

Comment by Bob Glossman (Inactive) [ 26/Jul/16 ]

another on master, in replay-single test_100b:
https://testing.hpdd.intel.com/test_sets/b2da040a-5306-11e6-8968-5254006e85c2

Comment by Jian Yu [ 03/Aug/16 ]

+1 on master, in replay-single test_100b:
https://testing.hpdd.intel.com/test_sets/613c47b2-594f-11e6-b5b1-5254006e85c2

Comment by Gerrit Updater [ 06/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21330/
Subject: LU-8399 osp: direct reference object to be invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63ef1b3bb8a0598ea2b2c70b5ad0550680723ab8

Comment by nasf (Inactive) [ 06/Aug/16 ]

The patch has been landed to master.

Generated at Sat Feb 10 02:17:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.