Django 2.0 Window expressions tutorial

By : Shabda Raaj

Django 2.0 was released recently and among the most exciting things for me is support for Window expressions, which allows adding an OVER clause to querysets. We will use Window expressions to analyze the commits data to the Django repo.

So what is an over clause?

An over clause is of this format

SELECT depname, empno, salary,
  avg(salary)
    OVER (PARTITION BY depname)
FROM empsalary;

Compare this to a similar GROUP BY statement

SELECT depname, avg(salary)
FROM empsalary
GROUP BY depname;

The difference is a GROUP BY has as many rows as grouping elements, here number of depname. An over clause adds the the aggregated result to each row of the select.

Postgres documentation says, "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is true for all other DB implementation as well.

What are real world uses of over expressions?

We will use the Django ORM with the Window expression to to some analysis on the most prolific committers to Django. To do this we will export the commiter names and time of commit to a csv.

git log  --no-merges --date=iso --pretty=format:'%h|%an|%aI' > commits.iso.csv

This is not ranking of Django developers, just of their number of commits, which allows us an interestig dataset. I am grateful to everyone who has contributed to Django - they have made my life immesureably better.

With some light data wrangling using Pandas, we transform this to a per author, per year data and import to Postgres. Our table structure looks like this.

experiments=# \d commits_by_year;
   Table "public.commits_by_year"
    Column     |  Type   | Modifiers
---------------+---------+-----------
 id            | bigint  |
 author        | text    |
 commit_year   | integer |
 commits_count | integer |

We define a model to interact with this table.

from django.db import models


class Committer(models.Model):
    author = models.CharField(max_length=100)
    commit_year = models.PositiveIntegerField()
    commits_count = models.PositiveIntegerField()

    class Meta:
        db_table = 'commits_by_year'

Lets quickly test if our data is imported. You can get a csv from here, and import to Postgres to follow along.

In [2]: Committer.objects.all().count()
Out[2]: 2318

Let us setup our environment and get the imports we need.

## Some ORM imports which we are going to need

from django.db.models import Avg, F, Window
from django.db.models.functions import  Rank, DenseRank, CumeDist
from django_commits.models import Committer

# We will use pandas to display the queryset in tanular format
import pandas
pandas.options.display.max_rows=20

# An utility function to display querysets
def as_table(values_queryset):
    return pandas.DataFrame(list(values_queryset))

Lets quickly look at the data we have.

as_table(Committer.objects.all().values(
  "author", "commit_year", "commits_count"
))
author commit_year commits_count
0 Tim Graham 2017 373
1 Sergey Fedoseev 2017 158
2 Mariusz Felisiak 2017 113
3 Claude Paroz 2017 102
4 Mads Jensen 2017 55
5 Simon Charette 2017 40
6 Jon Dufresne 2017 33
7 Anton Samarchyan 2017 27
8 François Freitag 2017 17
9 Srinivas Reddy Thatiparthy 2017 14
... ... ... ...
2308 Malcolm Tredinnick 2006 175
2309 Georg Bauer 2006 90
2310 Russell Keith-Magee 2006 86
2311 Jacob Kaplan-Moss 2006 83
2312 Luke Plant 2006 20
2313 Wilson Miner 2006 12
2314 Adrian Holovaty 2005 1015
2315 Jacob Kaplan-Moss 2005 130
2316 Georg Bauer 2005 112
2317 Wilson Miner 2005 20

2318 rows × 3 columns

We will now use the Window expression to get the contributors ranked by number of commits, within each year. We will go over the code in detail, but lets look at the queryset and results.

# Find out who have been the most prolific contributors
# in the years 2010-2017

dense_rank_by_year = Window(
    expression=DenseRank(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)
author commit_year commits_count the_rank
0 Tim Graham 2017 373 1
1 Sergey Fedoseev 2017 158 2
2 Mariusz Felisiak 2017 113 3
3 Claude Paroz 2017 102 4
4 Mads Jensen 2017 55 5
5 Simon Charette 2017 40 6
6 Jon Dufresne 2017 33 7
7 Anton Samarchyan 2017 27 8
8 François Freitag 2017 17 9
9 Srinivas Reddy Thatiparthy 2017 14 10
... ... ... ... ...
171 Joseph Kocherhans 2010 53 11
172 Ramiro Morales 2010 53 11
173 Jacob Kaplan-Moss 2010 42 12
174 Chris Beaven 2010 29 13
175 Malcolm Tredinnick 2010 26 14
176 Honza Král 2010 20 15
177 Carl Meyer 2010 17 16
178 Ian Kelly 2010 17 16
179 Simon Meers 2010 11 17
180 Gary Wilson Jr 2010 10 18

181 rows × 4 columns

Lets look a the the ORM code in more detail here.

# We are creating the Window function part of our SQL query here
dense_rank_by_year = Window(
    # We want to get the Rank with no gaps
    expression=DenseRank(),
    # We want to partition the queryset on commit_year
    # Each distinct commit_year is a different partition
    partition_by=F("commit_year"),
    # This decides the ordering within each partition
    order_by=F("commits_count").desc()
)


commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    # Standard filter oprtation, limit rows to 2010-2017
    ).annotate(
    # For each commiter, we are annotating its rank
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)

Now lets try getting the average commits per commiter for each year along with the other data.

avg_commits_per_year = Window(
    # We want the average of commits per committer, with each partition
    expression=Avg("commits_count"),
    # Each individual year is a partition.
    partition_by=F("commit_year")
)

commiters_with_yearly_average = Committer.objects.filter().annotate(
      avg_commit_per_year=avg_commits_per_year
    ).values(
        "author", "commit_year", "commits_count", "avg_commit_per_year"
    )
# We could have done further operation with avg_commit_per_year
# Eg: F(commits_count) - F(avg_commit_per_year),
# would tell us committers who commit more than average
as_table(commiters_with_yearly_average)

This gives us

author avg_commit_per_year commit_year commits_count
0 Wilson Miner 319.250000 2005 20
1 Adrian Holovaty 319.250000 2005 1015
2 Jacob Kaplan-Moss 319.250000 2005 130
3 Georg Bauer 319.250000 2005 112
4 Russell Keith-Magee 188.571429 2006 86
5 Jacob Kaplan-Moss 188.571429 2006 83
6 Luke Plant 188.571429 2006 20
7 Wilson Miner 188.571429 2006 12
8 Adrian Holovaty 188.571429 2006 854
9 Malcolm Tredinnick 188.571429 2006 175
... ... ... ... ...
2308 Adam Johnson 4.916084 2017 13
2309 Tom 4.916084 2017 13
2310 Srinivas Reddy Thatiparthy 4.916084 2017 14
2311 François Freitag 4.916084 2017 17
2312 Anton Samarchyan 4.916084 2017 27
2313 Jon Dufresne 4.916084 2017 33
2314 Simon Charette 4.916084 2017 40
2315 Mads Jensen 4.916084 2017 55
2316 Claude Paroz 4.916084 2017 102
2317 Mariusz Felisiak 4.916084 2017 113

2318 rows × 4 columns

You could try other Window functions such as CumeDist, Rank or Ntile.

from django.db.models.functions import CumeDist
cumedist_by_year = Window(
    expression=CumeDist(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        cumedist_by_year=cumedist_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "cumedist_by_year"
    )
as_table(commiters_with_rank)

Until now, we have partitioned on commit_year. We can partition on other fields too. We will partition on author to find out how their contributions have changed over the years using the Lag window expression.

from django.db.models.functions import Lag
from django.db.models import Value
commits_in_previous_year = Window(
    expression=Lag("commits_count", default=Value(0)),
    partition_by=F("author"),
    order_by=F("commit_year").asc(),
)

commiters_with_pervious_year_commit = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        commits_in_previous_year=commits_in_previous_year
    ).order_by(
        "author", "-commit_year"
    ).values(
        "author", "commit_year", "commits_count", "commits_in_previous_year"
    )
as_table(commiters_with_pervious_year_commit)
author commit_year commits_count commits_in_previous_year
0 Adam Chainz 2016 42 12
1 Adam Chainz 2015 12 0
2 Adam Johnson 2017 13 0
3 Adrian Holovaty 2012 40 98
4 Adrian Holovaty 2011 98 72
5 Adrian Holovaty 2010 72 0
6 Akshesh 2016 31 0
7 Alasdair Nicol 2016 13 19
8 Alasdair Nicol 2015 19 17
9 Alasdair Nicol 2013 17 0
... ... ... ... ...
171 Timo Graham 2012 13 70
172 Timo Graham 2011 70 60
173 Timo Graham 2010 60 0
174 Tom 2017 13 0
175 Unai Zalakain 2013 17 0
176 Vajrasky Kok 2013 14 0
177 areski 2014 15 0
178 eltronix 2016 10 0
179 wrwrwr 2014 21 0
180 Łukasz Langa 2013 15 0

181 rows × 4 columns

I hope this tutorial has been helpful in understanding the window expressions. While still not as felxible as SqlAlchemy, Django ORM has become extremely powerful with recent Django releases. Stay tuned for more advanced ORM tutorials.


Related Posts


Can we help you build amazing apps? Contact us today.

Topics : django
© Agiliq, 2009-2012