Django 2.0 was released recently and among the most exciting things for me is support for Window expressions, which allows adding an OVER clause to querysets. We will use Window expressions to analyze the commits data to the Django repo.
So what is an over clause?
An over clause is of this format
SELECT depname, empno, salary,
avg(salary)
OVER (PARTITION BY depname)
FROM empsalary;
Compare this to a similar GROUP BY statement
SELECT depname, avg(salary)
FROM empsalary
GROUP BY depname;
The difference is a GROUP BY has as many rows as grouping elements, here number of depname. An over clause adds the the aggregated result to each row of the select.
Postgres documentation says, "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is true for all other DB implementation as well.
What are real world uses of over expressions?
We will use the Django ORM with the Window expression to to some analysis on the most prolific committers to Django. To do this we will export the commiter names and time of commit to a csv.
git log --no-merges --date=iso --pretty=format:'%h|%an|%aI' > commits.iso.csv
This is not ranking of Django developers, just of their number of commits, which allows us an interestig dataset. I am grateful to everyone who has contributed to Django - they have made my life immesureably better.
With some light data wrangling using Pandas, we transform this to a per author, per year data and import to Postgres. Our table structure looks like this.
experiments=# \d commits_by_year;
Table "public.commits_by_year"
Column | Type | Modifiers
---------------+---------+-----------
id | bigint |
author | text |
commit_year | integer |
commits_count | integer |
We define a model to interact with this table.
from django.db import models
class Committer(models.Model):
author = models.CharField(max_length=100)
commit_year = models.PositiveIntegerField()
commits_count = models.PositiveIntegerField()
class Meta:
db_table = 'commits_by_year'
Lets quickly test if our data is imported. You can get a csv from here, and import to Postgres to follow along.
In [2]: Committer.objects.all().count()
Out[2]: 2318
Let us setup our environment and get the imports we need.
## Some ORM imports which we are going to need
from django.db.models import Avg, F, Window
from django.db.models.functions import Rank, DenseRank, CumeDist
from django_commits.models import Committer
# We will use pandas to display the queryset in tanular format
import pandas
pandas.options.display.max_rows=20
# An utility function to display querysets
def as_table(values_queryset):
return pandas.DataFrame(list(values_queryset))
Lets quickly look at the data we have.
as_table(Committer.objects.all().values(
"author", "commit_year", "commits_count"
))
author | commit_year | commits_count | |
---|---|---|---|
0 | Tim Graham | 2017 | 373 |
1 | Sergey Fedoseev | 2017 | 158 |
2 | Mariusz Felisiak | 2017 | 113 |
3 | Claude Paroz | 2017 | 102 |
4 | Mads Jensen | 2017 | 55 |
5 | Simon Charette | 2017 | 40 |
6 | Jon Dufresne | 2017 | 33 |
7 | Anton Samarchyan | 2017 | 27 |
8 | François Freitag | 2017 | 17 |
9 | Srinivas Reddy Thatiparthy | 2017 | 14 |
... | ... | ... | ... |
2308 | Malcolm Tredinnick | 2006 | 175 |
2309 | Georg Bauer | 2006 | 90 |
2310 | Russell Keith-Magee | 2006 | 86 |
2311 | Jacob Kaplan-Moss | 2006 | 83 |
2312 | Luke Plant | 2006 | 20 |
2313 | Wilson Miner | 2006 | 12 |
2314 | Adrian Holovaty | 2005 | 1015 |
2315 | Jacob Kaplan-Moss | 2005 | 130 |
2316 | Georg Bauer | 2005 | 112 |
2317 | Wilson Miner | 2005 | 20 |
2318 rows × 3 columns
We will now use the Window expression to get the contributors ranked by number of commits, within each year. We will go over the code in detail, but lets look at the queryset and results.
# Find out who have been the most prolific contributors
# in the years 2010-2017
dense_rank_by_year = Window(
expression=DenseRank(),
partition_by=F("commit_year"),
order_by=F("commits_count").desc()
)
commiters_with_rank = Committer.objects.filter(
commit_year__gte=2010, commits_count__gte=10
).annotate(
the_rank=dense_rank_by_year
).order_by(
"-commit_year", "the_rank"
).values(
"author", "commit_year", "commits_count", "the_rank"
)
as_table(commiters_with_rank)
author | commit_year | commits_count | the_rank | |
---|---|---|---|---|
0 | Tim Graham | 2017 | 373 | 1 |
1 | Sergey Fedoseev | 2017 | 158 | 2 |
2 | Mariusz Felisiak | 2017 | 113 | 3 |
3 | Claude Paroz | 2017 | 102 | 4 |
4 | Mads Jensen | 2017 | 55 | 5 |
5 | Simon Charette | 2017 | 40 | 6 |
6 | Jon Dufresne | 2017 | 33 | 7 |
7 | Anton Samarchyan | 2017 | 27 | 8 |
8 | François Freitag | 2017 | 17 | 9 |
9 | Srinivas Reddy Thatiparthy | 2017 | 14 | 10 |
... | ... | ... | ... | ... |
171 | Joseph Kocherhans | 2010 | 53 | 11 |
172 | Ramiro Morales | 2010 | 53 | 11 |
173 | Jacob Kaplan-Moss | 2010 | 42 | 12 |
174 | Chris Beaven | 2010 | 29 | 13 |
175 | Malcolm Tredinnick | 2010 | 26 | 14 |
176 | Honza Král | 2010 | 20 | 15 |
177 | Carl Meyer | 2010 | 17 | 16 |
178 | Ian Kelly | 2010 | 17 | 16 |
179 | Simon Meers | 2010 | 11 | 17 |
180 | Gary Wilson Jr | 2010 | 10 | 18 |
181 rows × 4 columns
Lets look a the the ORM code in more detail here.
# We are creating the Window function part of our SQL query here
dense_rank_by_year = Window(
# We want to get the Rank with no gaps
expression=DenseRank(),
# We want to partition the queryset on commit_year
# Each distinct commit_year is a different partition
partition_by=F("commit_year"),
# This decides the ordering within each partition
order_by=F("commits_count").desc()
)
commiters_with_rank = Committer.objects.filter(
commit_year__gte=2010, commits_count__gte=10
# Standard filter oprtation, limit rows to 2010-2017
).annotate(
# For each commiter, we are annotating its rank
the_rank=dense_rank_by_year
).order_by(
"-commit_year", "the_rank"
).values(
"author", "commit_year", "commits_count", "the_rank"
)
as_table(commiters_with_rank)
Now lets try getting the average commits per commiter for each year along with the other data.
avg_commits_per_year = Window(
# We want the average of commits per committer, with each partition
expression=Avg("commits_count"),
# Each individual year is a partition.
partition_by=F("commit_year")
)
commiters_with_yearly_average = Committer.objects.filter().annotate(
avg_commit_per_year=avg_commits_per_year
).values(
"author", "commit_year", "commits_count", "avg_commit_per_year"
)
# We could have done further operation with avg_commit_per_year
# Eg: F(commits_count) - F(avg_commit_per_year),
# would tell us committers who commit more than average
as_table(commiters_with_yearly_average)
This gives us
author | avg_commit_per_year | commit_year | commits_count | |
---|---|---|---|---|
0 | Wilson Miner | 319.250000 | 2005 | 20 |
1 | Adrian Holovaty | 319.250000 | 2005 | 1015 |
2 | Jacob Kaplan-Moss | 319.250000 | 2005 | 130 |
3 | Georg Bauer | 319.250000 | 2005 | 112 |
4 | Russell Keith-Magee | 188.571429 | 2006 | 86 |
5 | Jacob Kaplan-Moss | 188.571429 | 2006 | 83 |
6 | Luke Plant | 188.571429 | 2006 | 20 |
7 | Wilson Miner | 188.571429 | 2006 | 12 |
8 | Adrian Holovaty | 188.571429 | 2006 | 854 |
9 | Malcolm Tredinnick | 188.571429 | 2006 | 175 |
... | ... | ... | ... | ... |
2308 | Adam Johnson | 4.916084 | 2017 | 13 |
2309 | Tom | 4.916084 | 2017 | 13 |
2310 | Srinivas Reddy Thatiparthy | 4.916084 | 2017 | 14 |
2311 | François Freitag | 4.916084 | 2017 | 17 |
2312 | Anton Samarchyan | 4.916084 | 2017 | 27 |
2313 | Jon Dufresne | 4.916084 | 2017 | 33 |
2314 | Simon Charette | 4.916084 | 2017 | 40 |
2315 | Mads Jensen | 4.916084 | 2017 | 55 |
2316 | Claude Paroz | 4.916084 | 2017 | 102 |
2317 | Mariusz Felisiak | 4.916084 | 2017 | 113 |
2318 rows × 4 columns
You could try other Window functions such as CumeDist, Rank or Ntile.
from django.db.models.functions import CumeDist
cumedist_by_year = Window(
expression=CumeDist(),
partition_by=F("commit_year"),
order_by=F("commits_count").desc()
)
commiters_with_rank = Committer.objects.filter(
commit_year__gte=2010, commits_count__gte=10
).annotate(
cumedist_by_year=cumedist_by_year
).order_by(
"-commit_year", "the_rank"
).values(
"author", "commit_year", "commits_count", "cumedist_by_year"
)
as_table(commiters_with_rank)
Until now, we have partitioned on commit_year. We can partition on other fields too. We will partition on author to find out how their contributions have changed over the years using the Lag window expression.
from django.db.models.functions import Lag
from django.db.models import Value
commits_in_previous_year = Window(
expression=Lag("commits_count", default=Value(0)),
partition_by=F("author"),
order_by=F("commit_year").asc(),
)
commiters_with_pervious_year_commit = Committer.objects.filter(
commit_year__gte=2010, commits_count__gte=10
).annotate(
commits_in_previous_year=commits_in_previous_year
).order_by(
"author", "-commit_year"
).values(
"author", "commit_year", "commits_count", "commits_in_previous_year"
)
as_table(commiters_with_pervious_year_commit)
author | commit_year | commits_count | commits_in_previous_year | |
---|---|---|---|---|
0 | Adam Chainz | 2016 | 42 | 12 |
1 | Adam Chainz | 2015 | 12 | 0 |
2 | Adam Johnson | 2017 | 13 | 0 |
3 | Adrian Holovaty | 2012 | 40 | 98 |
4 | Adrian Holovaty | 2011 | 98 | 72 |
5 | Adrian Holovaty | 2010 | 72 | 0 |
6 | Akshesh | 2016 | 31 | 0 |
7 | Alasdair Nicol | 2016 | 13 | 19 |
8 | Alasdair Nicol | 2015 | 19 | 17 |
9 | Alasdair Nicol | 2013 | 17 | 0 |
... | ... | ... | ... | ... |
171 | Timo Graham | 2012 | 13 | 70 |
172 | Timo Graham | 2011 | 70 | 60 |
173 | Timo Graham | 2010 | 60 | 0 |
174 | Tom | 2017 | 13 | 0 |
175 | Unai Zalakain | 2013 | 17 | 0 |
176 | Vajrasky Kok | 2013 | 14 | 0 |
177 | areski | 2014 | 15 | 0 |
178 | eltronix | 2016 | 10 | 0 |
179 | wrwrwr | 2014 | 21 | 0 |
180 | Łukasz Langa | 2013 | 15 | 0 |
181 rows × 4 columns
I hope this tutorial has been helpful in understanding the window expressions. While still not as felxible as SqlAlchemy, Django ORM has become extremely powerful with recent Django releases. Stay tuned for more advanced ORM tutorials.
Thank you for reading the Agiliq blog. This article was written by shabda on Dec 6, 2017 in django .
You can subscribe ⚛ to our blog.
We love building amazing apps for web and mobile for our clients. If you are looking for development help, contact us today ✉.
Would you like to download 10+ free Django and Python books? Get them here