Sunday, November 15, 2009

Fun with Linq Aggregate

Say we’ve got a CSV file:

private const string csv = 
@"1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20";

We want to parse it into nested List<int>, and then print out all the numbers on a single line. We might do it like this:

public void ParseCsvAndOutputAsString()
{
    var data = new List<List<int>>();

    foreach (var line in csv.Split('\n'))
    {
        var innerList = new List<int>();
        foreach (var item in line.Split(','))
        {
            innerList.Add(int.Parse(item));
        }
        data.Add(innerList);
    }

    string output = "";
    foreach (var innerList in data)
    {
        foreach (var item in innerList)
        {
            output = string.Format("{0} {1}", output, item);
        }
    }

    Console.WriteLine(output);
}

Yes, I know I should use a StringBuilder and AddRange, but ignore that for the moment, I’m trying to make a point here.

Taking a collection of values and reducing them down to a single value is a very common task in programming. Here we’re doing it twice; first we’re taking a string, splitting it apart and then reducing it down to a single reference to a List<List<int>>; then we’re taking the may items of data and reducing them to a string.

This is so common in fact, that many programming languages have some kind of ‘reduce’ functionality built in. It’s especially common with functional languages. Did you know that C# also has a reduce function? It’s the Aggregate extension method. Here’s the same method written in two statements with it:

[Test]
public void ParseCsvAndOutputAsStringUsingAgregate()
{
    var data = csv
        .Split('\n')
        .Aggregate(
            new List<List<int>>(),
            (list, line) => list.Append(line
                .Split(',')
                .Select(str => int.Parse(str))
                .Aggregate(
                    new List<int>(),
                    (innerList, item) => innerList.Append(item))));

    Console.WriteLine(data
        .SelectMany(innerList => innerList)
        .Aggregate("", (output, item) => string.Format("{0} {1}", output, item)));
}

Aggregate takes two parameters; the first sets up the initial value, in our case we create new instances of List<List<int>>, List<int> and an empty string, this is known as the ‘accumulator’; the second is the function that does the accumulating.

Aggregate works really well with fluent interfaces, where methods return their instance. I’ve added a fluent ‘Append’ extension method to List<int> to help me here:

public static List<T> Append<T>(this List<T> list, T item)
{
    list.Add(item);
    return list;
}

So any time you’ve got a collection of stuff that you want to ‘reduce’ to a single item, remember Aggregate.

13 comments:

Matt said...

Aggregate is one of those helper functions that's overused when you first learn it because it's so cool and allows you to be so clever. I'd rather maintain the first code example, even with the pedestrian foreach statements.

Mahmoud Ghoz said...

DO you think that this way of writing the code like this will be readable ... it is not readable at all. what do you thik?

Anonymous said...

The first code is much more readable !
Your 2 statements are awful, we just can not imagine what is going down.

Unknown said...

Pretty cool - even though it is just for fun :-) (A point which some of your other readers seem to miss..)

Mike Hadlow said...

I guess readability has something to do with familiarity. If you'd only ever had Aggregate methods and suddenly foreach was introduced, you'd all be complaining about that :)

I do agree though that the first statement is uglier than the comparable nested foreach loops, but I think that's more to do with the clunky C# syntax than with nature of the thing. I should have added an F# (or Haskell or Erlang) example to show much cleaner it looks when your language is functional from the ground up.

Unknown said...

It's clever and I like it - but I probably won't do it.
I worry about recovering from formatting errors in the input file - easy to handle in old fashioned code, less so in fancy linq syntax?

Anonymous said...

As a newcomer to both C# and Python, I had to take up the challenge of doing this in Python. There might be a better way, but this worked for me (I apologize for the slightly broken formatting...):

import operator

csv = """1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20"""

def list_append(list, elem):
. . . .list.append(elem)
. . . .return list

data = reduce(list_append, (line.split(",") for line in csv.split("\n")), [])

output = " ".join(reduce(operator.concat, data))

print(output)

(The ". . . ." construct means "indent 4 spaces", which the comment won't let me do :))

If list.append() returned the list or of lambdas could include two statments, the list_append helper wouldn't be needed. But, maybe there's a guru who knows more than I.

Of course, this is still sub optimal. Loops would probably be better. But for the first reduce, that's just still:

data = []
for line in csv.split("\n"):
. . . .data.append(line.split(","))

Thanks for the challenge!

Mike Hadlow said...

Redbeard, cool thanks for that, it's interesting to see how it would be done in Python.

I have to point out though, that the data structure you parse your csv into is different from my example. It's a single list rather than a list of lists in my case.

Anonymous said...

Actually, it is a list of lists. csv.split("\n") creates a list of strings. Call it los.

Then, (line.split(",") for line in los) creates a generator that spits out a series of lists.

Finally, the reduce appends each element of the list to the source list. In Python, append adds the actual object to the list. Add will split up the list and add individual elements. So the resulting value of data is:

[['1', '2', '3', '4', '5'], ['6', '7', '8', '9', '10'], ['11', '12', '13', '14', '15'], ['16', '17',
'18', '19', '20']]

Incidentally, in the process of researching this I discovered that the reduce line can be changed to:

data = [line.split(",") for line in csv.split("\n")]

That removes the necessity for the list_append function. It also reminds me of the power of list comprehensions :) Makes the whole thing, not counting the csv definition, three lines.

Also, it should be noted that split() could be used instead of split("\n"), but that splits on all white space, and I wanted to be sure :)

Mike Hadlow said...

Redbeard, I stand corrected! That's really cool and shows how Python can do this in a far more concise style than C#.

Thanks again for the example and for taking the time to put me right :p

Anonymous said...

Mike: not a problem. I enjoyed the challenge (and learning more about C# and Python :)

BTW, I'd like to say thanks for all the great WCF/Windsor stuff. It's helped me enormously as we've just recently switched to .NET. Now if only we can figure out how to get it all working on Rackspace Cloud.

Richard OD said...

Interesting set of comments. I'd do it a totally different way if I wanted to do it with LINQ (ignoring the debate about code readability)(apologies for the formatting):

string csv =
@"1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20";

var res = csv.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.Select(line => line.Split(',')
.Select(number => int.Parse(number)).ToList()).ToList();

Anonymous said...

I'm late to the party, but I'd like to point out that any call to Aggregate() with a List as accumulator can be replaced by a simple Select() followed by ToList(), which is much more readable and doesn't require an Append() extension method to be defined.