Friday, November 2, 2018

Storing BB order in llvm::Instruction

[Issue] Quadratic behavior in DSE and memcpyopt from use of callCapturesBefore without an OrderedBasicBlock

by Reid Kleckner
2018-09-04
  • Compiled on Windows with optimizations and debug info
  • Compiling clang's SemaChecking.cpp file took ~121s total
    • memcpyopt : 21s
    • DSE: 21s
    • Induction variable simplification: 14s
    • Time is spent in SCEV, not AA.
      • The majority of the time in memcpyopt and DSE was in callCapturesBefore
      • Most of the samples occur in OrderedBlock::comesBefore
        • iterates the linked list of instructions in a BB and numbers them to compute intra-BB dominance.
    • Propose:
      • Maintain accurate position numbers for instructions in the basic block
      • A algorithm with amortized O(1) insertion and removal of instructions mid-block.
      • Hack MemoryDependenceResults
        • These passes already rely on it
    • Patch D51664:
      • https://reviews.llvm.org/D51664
      • Essentially, fold OrderedBasicBlock into BasicBlock, and make it auto-invalidate the instruction ordering when new instructions are added.
      • We don't need to invalidate it when removing instructions
      • Compiling SemaChecking.cpp: 121s -> 74s
        Instruction: 56 bytes -> 64 bytes
      • (In LLVM) MemorySSA and DSE, which maintain their own instruction orderings today.
    Discussion in mailing lists:

    • The inner loop of the local dominance check in DominatorTree::dominates:
      • A better algorithm would be to use two pointers - one at the user and def.  Each time through the loop, move the user iterator “up” the block, and the def iterator “down” the block.  Either the iterators meet each other (in which case return true) or you fine the beginning/end of the block.
      • it will be efficient when the user and def are close to each other, as well as being efficient when the value is at the end of the block.
      • Also, my bet is that most local dom queries return true.
    • It doesn't change the fact, however, that local dominance queries are O(n)
    • We've ended up using OrderedBasicBlock in an increasing number of places, but there are a number of places where this is hard because of the plumbing required, or more importantly, the ambiguity around who owns the state of the cache at any given time.
    • We have significant headroom.
    • I don't see any really feasible way around these issues except moving the ownership of that state into the BB itself.
    • in llvm::Instruction we have 6 bits to play with so should be able to create a more efficient scheme than the one used in llvm::Use (well, we have 6 bits even in llvm::Use but efficiency is probably not as important there because Use lives in an array, not a linked list).
    Some discussion of using the waymarking algorithm:
    • https://lists.llvm.org/pipermail/llvm-dev/2018-September/126418.html
    • https://lists.llvm.org/pipermail/llvm-dev/2018-September/126419.html
    Back to the topic:

    • The typical distribution of local dominance queries
      • Experiment on the SemaChecking.cpp
      • Patch for logging this
      • Result:
        distancefreq
        1-23707
        2-75622
        7-2025764
        20-541726
        54-1482052
        148-4034462
        403-10968423
        1096-298020487
        2980-810334930
        8103-2202663759
        22026-5987441729
      • local dominance queries are, at least for SemaChecking.cpp, not clustered around small distances. The average query is somewhere out around 11,000 instructions apart.
    • Re-run the max RSS during full LTO of llc experiment that I did when I removed the vptr from Value.
      • max RSS before the patch: 4816464 kb
      • after: 4867836 kb
      • an increase of 1.06%, ~50MB of memory spent on instruction positions during full LTO of llc.
    • I think we first need to establish that we need such a cache in llvm::Instruction/llvm::BasicBlock at all.
    • Do you have a sense of where these expensive domtree queries are coming from?  
    • DSE/memcpyopt -> AAResults::callCapturesBefore -> llvm::PointerMayBeCapturedBefore
    • Neither pass (DSE/memcpyopt) in an OrderedBasicBlock, so they rebuild the OrderedBasicBlock in linear time on every query
    • These passes insert instructions, so it's not correct to simply create and reuse an OrderedBasicBlock at this level.
    • These key call sites aside, dominance seems like a pretty common query. We have several somewhat overlapping mechanisms for checking dominance:
      • DominatorTree::dominates(Instruction*,Instruction*), the original interface, widely known as a bottleneck
      • OrderedBasicBlock, as used by memdep, my motivating example
      • OrderedInstructions, uses OrderedBasicBlock, used by GVN and others
      • MemorySSA has a similar ordering of memory values that is also invalidated when inserting memory defs. I may have misunderstood how this works, though.
    • it isn’t fixing N^2 behavior, it is merely changing one N^2 situation for another.  AFAICT there are one of two possible cases in these sorts of transformations:
      • These transformations are iteratively removing or inserting instructions, which invalidate the ordering with your approach, causing subsequent queries to rebuild the equivalent of OrderedBasicBlock.
      • These transformations are not doing anything, in which case this is all wasted work.
    • [Experiment] Instrument the calls to calls from DSE and/or memcpyopt and see how many of the ones for the huge basic blocks actually turn into a transformation in the code?
      • zero => disable these optimizations for large blocks.
      • a few improvements
        • See how to characterize them and what the right solution is based on the sorts of information they need.
    • LLVM is occasionally maligned for the magic numbers like “6” in various optimizations that bound the work in various analyses (e.g. when computing masked bits) but they exist for very real reasons:
      • the cases in which a large search will actually lead to benefits are marginal to non-existent, and
      • the possibility for unbounded queries for core APIs cause all kinds of problems.
      • (CY) The two points were why Chris opposed to this patch.
    • I see the dependence analysis queries as the same sort of situation: until something like MemorySSA is able to define away these dense queries, we should be using bounded searches (in this case larger than 6 but less than 11,000 :-).
    • Impl:
      • Change llvm::BasicBlock to keep track of the number of instruction’s in it
        • It would not change any ilist algorithmic behavior
      • Risk: performance regressions
        • We control that risk by doing some analysis of the payoff of these transformations on large blocks
        • Whether eliminating that store actually leads to a measurable performance improvement
          • Given huge blocks, I suspect the answer is “definitely not”.

    • Removing instructions doesn't invalidate the ordering, and these passes mostly delete instructions
    • If profiling shows that "insert;query;insert;query" is a bottleneck
      • Implement fancy renumbering algorithm
      • OrderedBasicBlock doesn't implement it
        => invalidating on modification is good enough.
    • It sounds like we have two options:
      • 1. Add cutoffs to poorly understood slow passes that are misbehaving
        • It adds code complexity, we'll have to test it, and tomorrow someone will add new quadratic dominance queries.
        • I don't see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow.
      • 2. Make a core data structure change to make dominance calculations faster and simpler for all transforms
    • Dominance has been and will always be an important analysis in LLVM. This patch is basically pushing an analysis cache down into IR.
    • If I added more infrastructure to invert the dependency, store these instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree of instruction insertion and deletion, would that be preferable?
      • That's very similar to the code we're writing today, where the transform takes responsibility for marrying its modifications to its invalidations.
      • The question is, do we want to keep this stuff up to date automatically, like we do for use lists, or is this something we want to maintain on the side?
      • I think dominance is something we might want to start pushing down into IR.

    • The long term path is to move these passes to MemorySSA, which doesn’t have these problems.  At which point, the core IR change you are proposing becomes really the wrong thing.
    • I have several objections to caches like this, and we have been through many failed attempts at making analyses “autoadapt” to loosely coupled transformations (e.g. the CallbackVH stuff, sigh).  My objections are things like:
      • 1) Caching and invalidation are very difficult to get right.
      • 2) Invalidation logic slows down everyone even if they aren’t using the cache (your “check to see if I need to invalidate when permuting the ilist).
      • 3) We try very hard not to put analysis/xform specific gunk into core IR types, because it punishes everyone but benefits only a few passes.
      • 4) LLVM is used for a LOT of widely varying use cases - clang is just one client, and many of them don’t run these passes.  This change pessimizes all those clients, particularly the most sensitive 32-bit systems.
      • 5) If done wrong, these caches can lead break invariants that LLVM optimization passes are supposed to maintain.  For example, if you do “sparse numbering” then the actual numbering will depend on the series of transformations that happen, and if someone accidentally uses the numbers in the wrong way, you could make “opt -pass1 -pass2” behave differently than “opt -pass1 | opt -pass2”.
    • Putting this stuff into DominatorTree could make sense, but then clients of dominator tree would have to invalidate this when doing transformations.  If you put the invalidation logic into ilist, then many of the problems above reoccur.  I don’t think (but don’t know) that it is unreasonable to make all clients of DT invalidate this manually.

    • MemorySSA certainly collects the memory references in a basic block into a set of use/def chains, but determining dominance still requires walking those chains.
      • This might help reduce the constant factor on the O(N), because it skips the non-memory-access-instructions, but the underlying complexity problem remains.
      • Maybe MemorySSA should cache a local numbering, but...
    • MemorySSA only helps if you're fine with skipping non-aliasing accesses.
      • It's not clear to me that this is always the case. 
      • e.g. SLP vectorization:
        • we follow the use-def chain of an address calculation and find two adjacent, but non-aliasing, loads in the same basic block.
        • We want to convert them into a vector load, so we need to place the new vector load at the position of the first (dominating) load.
        • MemorySSA will help determine legality of the transformation, but MemorySSA won't help determine the domainance because it will have the MemorySSA nodes for both loads tied, potentially, to the same MemorySSA definition node (i.e., you can't walk from one to the other, by design, which is why it helps with the legality determination).

    • memcpyopt isn’t/shouldn’t be querying dominance of random instructions within a block, it is finding memcpy’s by scanning scalar def-use chains and trying to back patch relationships between them
      • You’d use a fundamentally different approach with MemorySSA, where  just walk the code, find memcpy’s, and ask what they depend on.
      • This is all sparse, and doesn’t require dominance calls.  InstCombine doing conceptually similar things, and because it is based on use/def chains it doesn’t need to do any of this: the dominance relationship is implicit.
    • This (patch) is proposing a redundant encoding of information, and we’re all just discussing the SW engineering tradeoffs of various approaches.
    • Clang uses memcpyopt and DSE, but (e.g.) a graphics shader compiler would not.

    No comments:

    Post a Comment