Django 2.0 Window expressions tutorial

Django 2.0 was released recently and among the most exciting things for me is support for Window expressions, which allows adding an OVER clause to querysets. We will use Window expressions to analyze the commits data to the Django repo.

So what is an over clause?

An over clause is of this format

SELECT depname, empno, salary,
  avg(salary)
    OVER (PARTITION BY depname)
FROM empsalary;

Compare this to a similar GROUP BY statement

SELECT depname, avg(salary)
FROM empsalary
GROUP BY depname;

The difference is a GROUP BY has as many rows as grouping elements, here number of depname. An over clause adds the the aggregated result to each row of the select.

Postgres documentation says, "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is true for all other DB implementation as well.

What are real world uses of over expressions?

We will use the Django ORM with the Window expression to to some analysis on the most prolific committers to Django. To do this we will export the commiter names and time of commit to a csv.

git log  --no-merges --date=iso --pretty=format:'%h|%an|%aI' > commits.iso.csv

This is not ranking of Django developers, just of their number of commits, which allows us an interestig dataset. I am grateful to everyone who has contributed to Django - they have made my life immesureably better.

With some light data wrangling using Pandas, we transform this to a per author, per year data and import to Postgres. Our table structure looks like this.

experiments=# \d commits_by_year;
   Table "public.commits_by_year"
    Column     |  Type   | Modifiers
---------------+---------+-----------
 id            | bigint  |
 author        | text    |
 commit_year   | integer |
 commits_count | integer |

We define a model to interact with this table.

from django.db import models


class Committer(models.Model):
    author = models.CharField(max_length=100)
    commit_year = models.PositiveIntegerField()
    commits_count = models.PositiveIntegerField()

    class Meta:
        db_table = 'commits_by_year'

Lets quickly test if our data is imported. You can get a csv from here, and import to Postgres to follow along.

In [2]: Committer.objects.all().count()
Out[2]: 2318

Let us setup our environment and get the imports we need.

## Some ORM imports which we are going to need

from django.db.models import Avg, F, Window
from django.db.models.functions import  Rank, DenseRank, CumeDist
from django_commits.models import Committer

# We will use pandas to display the queryset in tanular format
import pandas
pandas.options.display.max_rows=20

# An utility function to display querysets
def as_table(values_queryset):
    return pandas.DataFrame(list(values_queryset))

Lets quickly look at the data we have.

as_table(Committer.objects.all().values(
  "author", "commit_year", "commits_count"
))

	author	commit_year	commits_count
0	Tim Graham	2017	373
1	Sergey Fedoseev	2017	158
2	Mariusz Felisiak	2017	113
3	Claude Paroz	2017	102
4	Mads Jensen	2017	55
5	Simon Charette	2017	40
6	Jon Dufresne	2017	33
7	Anton Samarchyan	2017	27
8	François Freitag	2017	17
9	Srinivas Reddy Thatiparthy	2017	14
...	...	...	...
2308	Malcolm Tredinnick	2006	175
2309	Georg Bauer	2006	90
2310	Russell Keith-Magee	2006	86
2311	Jacob Kaplan-Moss	2006	83
2312	Luke Plant	2006	20
2313	Wilson Miner	2006	12
2314	Adrian Holovaty	2005	1015
2315	Jacob Kaplan-Moss	2005	130
2316	Georg Bauer	2005	112
2317	Wilson Miner	2005	20

2318 rows × 3 columns

We will now use the Window expression to get the contributors ranked by number of commits, within each year. We will go over the code in detail, but lets look at the queryset and results.

# Find out who have been the most prolific contributors
# in the years 2010-2017

dense_rank_by_year = Window(
    expression=DenseRank(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)

	author	commit_year	commits_count	the_rank
0	Tim Graham	2017	373	1
1	Sergey Fedoseev	2017	158	2
2	Mariusz Felisiak	2017	113	3
3	Claude Paroz	2017	102	4
4	Mads Jensen	2017	55	5
5	Simon Charette	2017	40	6
6	Jon Dufresne	2017	33	7
7	Anton Samarchyan	2017	27	8
8	François Freitag	2017	17	9
9	Srinivas Reddy Thatiparthy	2017	14	10
...	...	...	...	...
171	Joseph Kocherhans	2010	53	11
172	Ramiro Morales	2010	53	11
173	Jacob Kaplan-Moss	2010	42	12
174	Chris Beaven	2010	29	13
175	Malcolm Tredinnick	2010	26	14
176	Honza Král	2010	20	15
177	Carl Meyer	2010	17	16
178	Ian Kelly	2010	17	16
179	Simon Meers	2010	11	17
180	Gary Wilson Jr	2010	10	18

181 rows × 4 columns

Lets look a the the ORM code in more detail here.

# We are creating the Window function part of our SQL query here
dense_rank_by_year = Window(
    # We want to get the Rank with no gaps
    expression=DenseRank(),
    # We want to partition the queryset on commit_year
    # Each distinct commit_year is a different partition
    partition_by=F("commit_year"),
    # This decides the ordering within each partition
    order_by=F("commits_count").desc()
)


commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    # Standard filter oprtation, limit rows to 2010-2017
    ).annotate(
    # For each commiter, we are annotating its rank
        the_rank=dense_rank_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "the_rank"
    )
as_table(commiters_with_rank)

Now lets try getting the average commits per commiter for each year along with the other data.

avg_commits_per_year = Window(
    # We want the average of commits per committer, with each partition
    expression=Avg("commits_count"),
    # Each individual year is a partition.
    partition_by=F("commit_year")
)

commiters_with_yearly_average = Committer.objects.filter().annotate(
      avg_commit_per_year=avg_commits_per_year
    ).values(
        "author", "commit_year", "commits_count", "avg_commit_per_year"
    )
# We could have done further operation with avg_commit_per_year
# Eg: F(commits_count) - F(avg_commit_per_year),
# would tell us committers who commit more than average
as_table(commiters_with_yearly_average)

This gives us

	author	avg_commit_per_year	commit_year	commits_count
0	Wilson Miner	319.250000	2005	20
1	Adrian Holovaty	319.250000	2005	1015
2	Jacob Kaplan-Moss	319.250000	2005	130
3	Georg Bauer	319.250000	2005	112
4	Russell Keith-Magee	188.571429	2006	86
5	Jacob Kaplan-Moss	188.571429	2006	83
6	Luke Plant	188.571429	2006	20
7	Wilson Miner	188.571429	2006	12
8	Adrian Holovaty	188.571429	2006	854
9	Malcolm Tredinnick	188.571429	2006	175
...	...	...	...	...
2308	Adam Johnson	4.916084	2017	13
2309	Tom	4.916084	2017	13
2310	Srinivas Reddy Thatiparthy	4.916084	2017	14
2311	François Freitag	4.916084	2017	17
2312	Anton Samarchyan	4.916084	2017	27
2313	Jon Dufresne	4.916084	2017	33
2314	Simon Charette	4.916084	2017	40
2315	Mads Jensen	4.916084	2017	55
2316	Claude Paroz	4.916084	2017	102
2317	Mariusz Felisiak	4.916084	2017	113

2318 rows × 4 columns

You could try other Window functions such as CumeDist, Rank or Ntile.

from django.db.models.functions import CumeDist
cumedist_by_year = Window(
    expression=CumeDist(),
    partition_by=F("commit_year"),
    order_by=F("commits_count").desc()
)

commiters_with_rank = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        cumedist_by_year=cumedist_by_year
    ).order_by(
        "-commit_year", "the_rank"
    ).values(
        "author", "commit_year", "commits_count", "cumedist_by_year"
    )
as_table(commiters_with_rank)

Until now, we have partitioned on commit_year. We can partition on other fields too. We will partition on author to find out how their contributions have changed over the years using the Lag window expression.

from django.db.models.functions import Lag
from django.db.models import Value
commits_in_previous_year = Window(
    expression=Lag("commits_count", default=Value(0)),
    partition_by=F("author"),
    order_by=F("commit_year").asc(),
)

commiters_with_pervious_year_commit = Committer.objects.filter(
        commit_year__gte=2010, commits_count__gte=10
    ).annotate(
        commits_in_previous_year=commits_in_previous_year
    ).order_by(
        "author", "-commit_year"
    ).values(
        "author", "commit_year", "commits_count", "commits_in_previous_year"
    )
as_table(commiters_with_pervious_year_commit)

	author	commit_year	commits_count	commits_in_previous_year
0	Adam Chainz	2016	42	12
1	Adam Chainz	2015	12	0
2	Adam Johnson	2017	13	0
3	Adrian Holovaty	2012	40	98
4	Adrian Holovaty	2011	98	72
5	Adrian Holovaty	2010	72	0
6	Akshesh	2016	31	0
7	Alasdair Nicol	2016	13	19
8	Alasdair Nicol	2015	19	17
9	Alasdair Nicol	2013	17	0
...	...	...	...	...
171	Timo Graham	2012	13	70
172	Timo Graham	2011	70	60
173	Timo Graham	2010	60	0
174	Tom	2017	13	0
175	Unai Zalakain	2013	17	0
176	Vajrasky Kok	2013	14	0
177	areski	2014	15	0
178	eltronix	2016	10	0
179	wrwrwr	2014	21	0
180	Łukasz Langa	2013	15	0

181 rows × 4 columns

I hope this tutorial has been helpful in understanding the window expressions. While still not as felxible as SqlAlchemy, Django ORM has become extremely powerful with recent Django releases. Stay tuned for more advanced ORM tutorials.

Buy Django ORM Cookbook

Thank you for reading the Agiliq blog. This article was written by shabda on Dec 6, 2017 in django .

You can subscribe ⚛ to our blog.

We love building amazing apps for web and mobile for our clients. If you are looking for development help, contact us today ✉.

Would you like to download 10+ free Django and Python books? Get them here