Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: git/git
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 6d81fe64dd646654bdb84fa5832c717954bf1696
Choose a base ref
...
head repository: git/git
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 863f2459a2047406c93758e8c3352cd5d2836f1e
Choose a head ref
  • 11 commits
  • 16 files changed
  • 1 contributor

Commits on Oct 25, 2024

  1. midx: avoid duplicate packed_git entries

    When we scan a pack directory to load the idx entries we find into the
    packed_git list, we skip any of them that are contained in a midx. We
    then load them later lazily if we actually need to access the
    corresponding pack, referencing them both from the midx struct and the
    packed_git list.
    
    The lazy-load in the midx code checks to see if the midx already
    mentions the pack, but doesn't otherwise check the packed_git list. This
    makes sense, since we should have added any pack to both lists.
    
    But there's a loophole! If we call close_object_store(), that frees the
    midx entirely, but _not_ the packed_git structs, which we must keep
    around for Reasons[1]. If we then try to look up more objects, we'll
    auto-load the midx again, which won't realize that we've already loaded
    those packs, and will create duplicate entries in the packed_git list.
    
    This is possibly inefficient, because it means we may open and map the
    pack redundantly. But it can also lead to weird user-visible behavior.
    The case I found is in "git repack", which closes and reopens the midx
    after repacking and then calls update_server_info(). We end up writing
    the duplicate entries into objects/info/packs.
    
    We could obviously de-dup them while writing that file, but it seems
    like a violation of more core assumptions that we end up with these
    duplicate structs at all.
    
    We can avoid the duplicates reasonably efficiently by checking their
    names in the pack_map hash. This annoyingly does require a little more
    than a straight hash lookup due to the naming conventions, but it should
    only happen when we are about to actually open a pack. I don't think one
    extra malloc will be noticeable there.
    
    [1] I'm not entirely sure of all the details, except that we generally
        assume the packed_git structs never go away. We noted this
        restriction in the comment added by 6f1e939 (object: fix leaking
        packfiles when closing object store, 2024-08-08), but it's somewhat
        vague. At any rate, if you try freeing the structs in
        close_object_store(), you can observe segfaults all over the test
        suite. So it might be fixable, but it's not trivial.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    8b5763e View commit details
    Browse the repository at this point in the history
  2. t5550: count fetches in "previously-fetched .idx" test

    We have a test in t5550 that looks at index fetching over dumb http. It
    creates two branches, each of which is completely stored in its own
    pack, then fetches the branches independently. What should (and does)
    happen is that the first fetch grabs both .idx files and one .pack file,
    and then the fetch of the second branch re-uses the previously
    downloaded .idx files (fetching none) and grabs the now-required .pack
    file.
    
    Since the next few patches will be touching this area of the code, let's
    beef up the test a little by checking that we're downloading the
    expected items at each step.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    019b21d View commit details
    Browse the repository at this point in the history
  3. dumb-http: store downloaded pack idx as tempfile

    This patch fixes a regression in b1b8dfd (finalize_object_file():
    implement collision check, 2024-09-26) where fetching a v1 pack idx file
    over the dumb-http protocol would cause the fetch to fail.
    
    The core of the issue is that dumb-http stores the idx we fetch from the
    remote at the same path that will eventually hold the idx we generate
    from "index-pack --stdin". The sequence is something like this:
    
      0. We realize we need some object X, which we don't have locally, and
         nor does the other side have it as a loose object.
    
      1. We download the list of remote packs from objects/info/packs.
    
      2. For each entry in that file, we download each pack index and store
         it locally in .git/objects/pack/pack-$hash.idx (the $hash is not
         something we can verify yet and is given to us by the remote).
    
      3. We check each pack index we got to see if it has object X. When we
         find a match, we download the matching .pack file from the remote
         to a tempfile. We feed that to "index-pack --stdin", which
         reindexes the pack, rather than trusting that it has what the other
         side claims it does. In most cases, this will end up generating the
         exact same (byte-for-byte) pack index which we'll store at the same
         pack-$hash.idx path, because the index generation and $hash id are
         computed based on what's in the packfile. But:
    
           a. The other side might have used other options to generate the
              index. For instance we use index v2 by default, but long ago
              it was v1 (and you can still ask for v1 explicitly).
    
           b. The other side might even use a different mechanism to
              determine $hash. E.g., long ago it was based on the sorted
              list of objects in the packfile, but we switched to using the
              pack checksum in 1190a1a (pack-objects: name pack files
              after trailer hash, 2013-12-05).
    
    The regression we saw in the real world was (3a). A recent client
    fetching from a server with a v1 index downloaded that index, then
    complained about trying to overwrite it with its own v2 index. This
    collision is otherwise harmless; we know we want to replace the remote
    version with our local one, but the collision check doesn't realize
    that.
    
    There are a few options to fix it:
    
      - we could teach index-pack a command-line option to ignore only pack
        idx collisions, and use it when the dumb-http code invokes
        index-pack. This would be an awkward thing to expose users to and
        would involve a lot of boilerplate to get the option down to the
        collision code.
    
      - we could delete the remote .idx file right before running
        index-pack. It should be redundant at that point (since we've just
        downloaded the matching pack). But it feels risky to delete
        something from our own .git/objects based on what the other side has
        said. I'm not entirely positive that a malicious server couldn't lie
        about which pack-$hash.idx it has and get us to delete something
        precious.
    
      - we can stop co-mingling the downloaded idx files in our local
        objects directory. This is a slightly bigger change but I think
        fixes the root of the problem more directly.
    
    This patch implements the third option. The big design questions are:
    where do we store the downloaded files, and how do we manage their
    lifetimes?
    
    There are some additional quirks to the dumb-http system we should
    consider. Remember that in step 2 we downloaded every pack index, but in
    step 3 we may only download some of the matching packs. What happens to
    those other idx files now? They sit in the .git/objects/pack directory,
    possibly waiting to be used at a later date. That may save bandwidth for
    a subsequent fetch, but it also creates a lot of weird corner cases:
    
      - our local object directory now has semi-untrusted .idx files sitting
        around, without their matching .pack
    
      - in case 3b, we noted that we might not generate the same hash as the
        other side. In that case even if we download the matching pack,
        our index-pack invocation will store it in a different
        pack-$hash.idx file. And the unmatched .idx will sit there forever.
    
      - if the server repacks, it may delete the old packs. Now we have
        these orphaned .idx files sitting around locally that will never be
        used (nor deleted).
    
      - if we repack locally we may delete our local version of the server's
        pack index and not realize we have it. So we'll download it again,
        even though we have all of the objects it mentions.
    
    I think the right solution here is probably some more complex cache
    management system: download the remote .idx files to their own storage
    directory, mark them as "seen" when we get their matching pack (to avoid
    re-downloading even if we repack), and then delete them when the
    server's objects/info/refs no longer mentions them.
    
    But since the dumb http protocol is so ancient and so inferior to the
    smart http protocol, I don't think it's worth spending a lot of time
    creating such a system. For this patch I'm just downloading the idx
    files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
    deleted when we exit (and due to the name, any we miss due to a crash,
    etc, should eventually be removed by "git gc" runs based on timestamps).
    
    That is slightly worse for one case: if we download an idx but not the
    matching pack, we won't retain that idx for subsequent runs. But the
    flip side is that we're making other cases better (we never hold on to
    useless idx files forever). I suspect that worse case does not even come
    up often, since it implies that the packs are generated to match
    distinct parts of history (i.e., in practice even in a repo with many
    packs you're going to end up grabbing all of those packs to do a clone).
    If somebody really cares about that, I think the right path forward is a
    managed cache directory as above, and this patch is providing the first
    step in that direction anyway (by moving things out of the objects/pack/
    directory).
    
    There are two test changes. One demonstrates the broken v1 index case
    (it double-checks the resulting clone with fsck to be careful, but prior
    to this patch it actually fails at the clone step). The other tweaks the
    expectation for a test that covers the "slightly worse" case to
    accommodate the extra index download.
    
    The code changes are fairly simple. We stop using finalize_object_file()
    to copy the remote's index file into place, and leave it as a tempfile.
    We give the tempfile a real ".idx" name, since the packfile code expects
    that, and thus we make sure it is out of the usual packs/ directory (so
    we'd never mistake it for a real local .idx).
    
    We also have to change parse_pack_index(), which creates a temporary
    packed_git to access our index (we need this because all of the pack idx
    code assumes we have that struct). It reads the index data from the
    tempfile, but prior to this patch would speculatively write the
    finalized name into the packed_git struct using the pack-$hash we expect
    to use.
    
    I was mildly surprised that this worked at all, since we call
    verify_pack_index() on the packed_git which mentions the final name
    before moving the file into place! But it works because
    parse_pack_index() leaves the mmap-ed data in the struct, so the
    lazy-open in verify_pack_index() never triggers, and we read from the
    tempfile, ignoring the filename in the struct completely. Hacky, but it
    works.
    
    After this patch, parse_pack_index() now uses the index filename we pass
    in to derive a matching .pack name. This is OK to change because there
    are only two callers, both in the dumb http code (and the other passes
    in an existing pack-$hash.idx name, so the derived name is going to be
    pack-$hash.pack, which is what we were using anyway).
    
    I'll follow up with some more cleanups in that area, but this patch is
    sufficient to fix the regression.
    
    Reported-by: fox <fox.gbr@townlong-yak.com>
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    63aca3f View commit details
    Browse the repository at this point in the history
  4. packfile: drop has_pack_index()

    The has_pack_index() function has several oddities that may make it
    surprising if you are trying to find out if we have a pack with some
    $hash:
    
      - it is not looking for a valid pack that we found while searching
        object directories. It just looks for any pack-$hash.idx file in the
        pack directory.
    
      - it only looks in the local directory, not any alternates
    
      - it takes a bare "unsigned char" hash, which we try to avoid these
        days
    
    The only caller it has is in the dumb http code; it wants to know if we
    already have the pack idx in question. This can happen if we downloaded
    the pack (and generated its index) during a previous fetch.
    
    Before the previous patch ("dumb-http: store downloaded pack idx as
    tempfile"), it could also happen if we downloaded the .idx from the
    remote but didn't get the matching .pack. But since that patch, we don't
    hold on to those .idx files. So there's no need to look for the .idx
    file in the filesystem; we can just scan through the packed_git list to
    see if we have it.
    
    That lets us simplify the dumb http code a bit, as we know that if we
    have the .idx we have the matching .pack already. And it lets us get rid
    of this odd function that is unlikely to be needed again.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    03b8eed View commit details
    Browse the repository at this point in the history
  5. packfile: drop sha1_pack_name()

    The sha1_pack_name() function has a few ugly bits:
    
      - it writes into a static strbuf (and not even a ring buffer of them),
        which can lead to subtle invalidation problems
    
      - it uses the term "sha1", but it's really using the_hash_algo, which
        could be sha256
    
    There's only one caller of it left. And in fact that caller is better
    off using the underlying odb_pack_name() function itself, since it's
    just copying the result into its own strbuf anyway.
    
    Converting that caller lets us get rid of this now-obselete function.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    c2dc4c9 View commit details
    Browse the repository at this point in the history
  6. packfile: drop sha1_pack_index_name()

    Like sha1_pack_name() that we dropped in the previous commit, this
    function uses an error-prone static strbuf and the somewhat misleading
    name "sha1".
    
    The only caller left is in pack-redundant.c. While this command is
    marked for potential removal in our BreakingChanges document, we still
    have it for now. But it's simple enough to convert it to use its own
    strbuf with the underlying odb_pack_name() function, letting us drop the
    otherwise obsolete function.
    
    Note that odb_pack_name() does its own strbuf_reset(), so it's safe to
    use directly within a loop like this.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    4390fea View commit details
    Browse the repository at this point in the history
  7. packfile: warn people away from parse_packed_git()

    With a name like parse_packed_git(), you might think it's the right way
    to access a local pack index and its associated objects. But not so!
    It's a one-off used by the dumb-http code to access pack idx files we've
    downloaded from the remote, but whose packs we might not have.
    
    There's only one caller left for this function, and ideally we'd drop it
    completely and just inline it there. But that would require exposing
    other internals from packfile.[ch], like alloc_packed_git() and
    check_packed_git_idx().
    
    So let's leave it be for now, and just warn people that it's probably
    not what they're looking for. Perhaps in the long run if we eventually
    drop dumb-http support, we can remove the function entirely then.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    6b2fc22 View commit details
    Browse the repository at this point in the history
  8. http-walker: use object_id instead of bare hash

    We long ago switched most code to using object_id structs instead of
    bare "unsigned char *" hashes. This gives us more type safety from the
    compiler, and generally makes it easier to understand what we expect in
    each parameter.
    
    But the dumb-http code has lagged behind. And indeed, the whole "walker"
    subsystem interface has the same problem, though http-walker is the only
    user left.
    
    So let's update the walker interface to pass object_id structs (which we
    already have anyway at all call sites!), and likewise use those within
    the http-walker methods that it calls.
    
    This cleans up the dumb-http code a bit, but will also let us fix a few
    more commonly used helper functions.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    0af861e View commit details
    Browse the repository at this point in the history
  9. packfile: convert find_sha1_pack() to use object_id

    The find_sha1_pack() function has a few problems:
    
      - it's badly named, since it works with any object hash
    
      - it takes the hash as a bare pointer rather than an object_id struct
    
    We can fix both of these easily, as all callers actually have a real
    object_id anyway.
    
    I also found the existence of this function somewhat confusing, as it is
    about looking in an arbitrary set of linked packed_git structs. It's
    good for things like dumb-http which are looking in downloaded remote
    packs, and not our local packs. But despite the name, it is not a good
    way to find the pack which contains a local object (it skips the use of
    the midx, the pack mru list, and so on).
    
    So let's also add an explanatory comment above the declaration that may
    point people in the right direction.
    
    I suspect the calls in fast-import.c, which use the packed_git list from
    the repository struct, could actually just be using find_pack_entry().
    But since we'd need to keep it anyway for dumb-http, I didn't dig
    further there. If we eventually drop dumb-http support, then it might be
    worth examining them to see if we can get rid of the function entirely.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    4d99559 View commit details
    Browse the repository at this point in the history
  10. packfile: use object_id in find_pack_entry_one()

    The main function we use to search a pack index for an object is
    find_pack_entry_one(). That function still takes a bare pointer to the
    hash, despite the fact that its underlying bsearch_pack() function needs
    an object_id struct. And so we end up making an extra copy of the hash
    into the struct just to do a lookup.
    
    As it turns out, all callers but one already have such an object_id. So
    we can just take a pointer to that struct and use it directly. This
    avoids the extra copy and provides a more type-safe interface.
    
    The one exception is get_delta_base() in packfile.c, when we are chasing
    a REF_DELTA from inside the pack (and thus we have a pointer directly to
    the mmap'd pack memory, not a struct). We can just bump the hashcpy()
    from inside find_pack_entry_one() to this one caller that needs it.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    479ab76 View commit details
    Browse the repository at this point in the history
  11. packfile: use oidread() instead of hashcpy() to fill object_id

    When chasing a REF_DELTA, we need to pull the raw hash bytes out of the
    mmap'd packfile into an object_id struct. We do that with a raw
    hashcpy() of the appropriate length (that happens directly now, though
    before the previous commit it happened inside find_pack_entry_one(),
    also using a hashcpy).
    
    But I think this creates a potentially dangerous situation due to
    d4d364b (hash: convert `oidcmp()` and `oideq()` to compare whole
    hash, 2024-06-14). When using sha1, we'll have uninitialized bytes in
    the latter part of the object_id.hash buffer, which could fool oideq(),
    etc.
    
    We should use oidread() instead, which correctly zero-pads the extra
    bytes, as of c98d762 (global: ensure that object IDs are always
    padded, 2024-06-14).
    
    As far as I can see, this has not been a problem in practice because the
    object_id we feed to find_pack_entry_one() is never used with oideq(),
    etc. It is being compared to the bytes mmap'd from a pack idx file,
    which of course do not have the extra padding bytes themselves. So
    there's no bug here, but this just puzzled me while looking at the code.
    We should do the more obviously safe thing, both for future-proofing and
    to avoid confusing readers.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Taylor Blau <me@ttaylorr.com>
    peff authored and ttaylorr committed Oct 25, 2024
    Configuration menu
    Copy the full SHA
    863f245 View commit details
    Browse the repository at this point in the history
Loading