David's Blog

Jun 03, 2018

Chunking lists in Python

A few weeks ago I had occasion to chunk a potentially large list of items into smaller lists for processing in batches in a Python codebase. While I remembered that itertools didn't have such functionality built-in, it did have a recipe to do so, called grouper. For my use case, this recipe worked fine so I used it but it got me wondering about why Python didn't provide what seems to be a fairly common operation (in another codebase I'd had occasion to do this a lot and we had a helper function for it). This led me down two paths.

One was to gut check that I wasn't crazy for thinking this seemed to be a valid standard library function. I spot checked a few "batteries included" languages that I have some familiarity with (Ruby and Rust) and each have at least one built-in solution for this. Even C++ is considering including a solution in the standard via the Ranges-V3 proposal but that isn't a surprise given the breadth of language features and library functions (particularly around containers) that C++ has.

The other path was to check out the history for why this hadn't been included in Python. That lead me to a thread on the Python mailing lists and a Python issue. While it would be easy to summarize my (outside) view of the reason for not including a standard function as "bikeshedding", the discussions did bring up a number of regarding the interface and implementation.

Some of these tradeoffs: - All chunks same size (w/ possible sentinel/filler) or possible for smaller chunks - list vs iterable? - using common python primitives vs modules implemented in C for speed

At the end of the day, I filed this under one of the oddities of Python and moved on with the task at hand but thought it would be interesting to share. There simply isn't consensus over what constitutes the right chunking helper function for the Python standard library, nor does there appear to be any interest in figuring out the sticking points. FWIW one of the biggest sticking points seem to be around what to do with an off-size chunk at the end but I find myself in agreement with the original mailing list poster that this isn't a huge deal and callers can handle this quite easily if it were a concern.

For kicks, I threw together a few additional chunking functions, each with a slightly different implementation:

import itertools

def chunk(iterable, n=1):
    """
    Returns an iterable of lists of at most length n. Does not add filler
    values if not enough input values for a given chunk.

    >>> list(chunk(range(6), n=2))
    [[0, 1], [2, 3], [4, 5]]
    >>> list(chunk(range(6), n=4))
    [[0, 1, 2, 3], [4, 5]]
    """
    it = iter(iterable)
    current = 0
    next_chunk = []
    while True:
        while current < n:
            try:
                next_chunk.append(next(it))
                current += 1
            except StopIteration:
                if next_chunk:
                    yield next_chunk
                raise StopIteration()
        yield next_chunk
        current = 0
        next_chunk = []


def chunk2(lst, n=1):
    """
    Chunks an input list by size n using slices.

    >>> list(chunk2(list(range(6)), n=2))
    [[0, 1], [2, 3], [4, 5]]
    >>> list(chunk2(list(range(6)), n=4))
    [[0, 1, 2, 3], [4, 5]]
    >>> list(chunk2(list(range(4)), n=5))
    [[0, 1, 2, 3]]
    """
    last_used = 0
    while last_used < len(lst) - 1:
        chunk = lst[last_used:last_used + n]
        last_used += n
        yield chunk
    raise StopIteration()


def chunk3(iterable, n=1):
    """
    Chunks any iterable by size n using islice.

    >>> list(chunk3(range(6), n=2))
    [[0, 1], [2, 3], [4, 5]]
    >>> list(chunk3(range(6), n=4))
    [[0, 1, 2, 3], [4, 5]]
    >>> list(chunk3(range(4), n=5))
    [[0, 1, 2, 3]]
    """
    last_used = 0
    while True:
        chunk = itertools.islice(iterable, last_used, last_used + n)
        last_used += n
        if chunk:
            yield chunk
        if len(chunk) != n:
            raise StopIteration()
← Previous Page 2 of 2