[LU-337] Processes stuck in sync_page on lustre client Created: 17/May/11 Updated: 28/Jun/11 Resolved: 06/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-1.8.5.0-3chaos, RHEL5.5ish (CHAOS4.4-2) |
||
| Severity: | 3 |
| Rank (Obsolete): | 4997 |
| Description |
|
In production we are fairly often in the client console logs seeing task pdflush "blocked for more than 120 seconds". Often these are followed by console messages timeouts and evictions. One some nodes, this appears to be non-fatal; recovery takes place and all is well. On others, the node gets into a state where many threads appear to be stuck in sync_page(), apparently in a deadlocked state. pdflush usually has this backtrace regardless of whether the hang is fatal: 2011-05-13 14:52:42 INFO: task pdflush:590 blocked for more than 120 seconds. |
| Comments |
| Comment by Christopher Morrone [ 17/May/11 ] |
|
When processes hang, they all appear to be stuck in sync_page. Here's a list of processes and their general state when they appear to be deadlocked: pdflush: same as trace in bug description sync_page Then there is the ll_imp_inval thread: sync_page And then the "sync" process: sync_page |
| Comment by Peter Jones [ 18/May/11 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 18/May/11 ] |
|
Christopher Morrone, Just want to be sure whether your lustre-1.8.5.0-3chaos codebase contains the byg 21873 patch, they looks similar, memory collector's waiting for some pages which is under writeback. |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
Bug 21873 is "Have clusterDB generated pxeconfig files for each node". I think you meant something else. If the patch went into 1.8.5, we have it. We keep all of our branches publicly available here: https://github.com/chaos/lustre And that tag in particular is here: |
| Comment by Zhenyu Xu [ 19/May/11 ] |
|
sorry, i meant 21983, and I've checked in the source you provided https://github.com/chaos/lustre/tree/1.8.5.0-3chaos, the patch of 21983 is there. |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
Ah, bug 21983 was a bug that dealt with readahead, not writeback. Also, in that bug a single thread was getting stuck on a lock that it held itself. I suspect (but don't know for certain) that is not the case with this one. |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
The admins alerted me to another stuck node. This time only three processes are stuck: pdflush, sync, ptlrpcd Note that the sync command was issued as part of the job cleanup script (slurm's job "epilog" script). Our "lflush" script echoes "clean" into all of the lru_size files in /proc, and then calls /bin/sync once. Ah, in both of the cases I have seen thus far, ptlrpcd is stuck under ldlm_pools_cli_shrink(), originating from an osc_quota_setdq() call that calls cfs_mem_cache_alloc(), which kicks off the cache_alloc_refill, etc. |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
So perhaps this is indeed another case of using the wrong alloc mask. I see that alloc_qinfo() is doing: OBD_SLAB_ALLOC(oqi, qinfo_cachep, CFS_ALLOC_STD, sizeof(*oqi)); I am guessing that is needs to be CFS_ALLOC_IO, just like I needed to do in bug 21983 for llap_from_page_with_lockh(). |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
I pushed http://review.whamcloud.com/582 with that one line change, in case you think that is the correct fix. |
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Build Master (Inactive) [ 20/May/11 ] |
|
Integrated in Johann Lombardi : 74ec6f5c8d1d73108c6de24a82f6384c98f2bac1
|
| Comment by Peter Jones [ 20/May/11 ] |
|
does this patch need porting to master? |
| Comment by Zhenyu Xu [ 20/May/11 ] |
|
i think so, Christopher Morrone, would you mind pushing it in master branch for review? thanks. |
| Comment by Christopher Morrone [ 31/May/11 ] |
|
Pushed for master here: |
| Comment by Zhenyu Xu [ 06/Jun/11 ] |
|
patch landed on b1_8 for 1.8.6 and on master for 2.1.0 |
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 06/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 07/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 07/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|
| Comment by Build Master (Inactive) [ 07/Jun/11 ] |
|
Integrated in Oleg Drokin : d8506f4b3a03b5605fc927409ce16f55ad5bffd5
|