Agenda

This post is targeted towards Django developers.

Pandas is an extremely powerful data analysis library. It can significantly boost the efficiency of your Django application when data analysis is involved.

Pandas can be used in following Django scenarios:

  • Visualizing the tabular data to ensure ORM queries are correct.
  • Gaining speed improvements on reporting dashboards.
  • Answering stakeholder's queries quickly and effortlessly.

We will discuss these 3 scenarios in detail.

Pandas primer

Before we proceed with the three scenarios, we need to load a database table into pandas. Hang on, it's a short section.

Let's use the following Django models throughout this post:

class Question(models.Model):
    question_text = models.CharField(max_length=200)

    def __str__(self):
        return self.question_text


class Choice(models.Model):
    question = models.ForeignKey(Question, on_delete=models.CASCADE)
    choice_text = models.CharField(max_length=200)
    votes = models.IntegerField(default=0)

    def __str__(self):
        return "Choice: {}, Question:{}".format(self.choice_text, self.question.question_text)

We can load all questions in a Pandas object in the following way:

In [14]: import pandas as pd

In [15]: questions_orm = Question.objects.values()

In [16]: questions_pandas = pd.DataFrame(questions_orm)

Let's see what questions_pandas look like:

In [19]: questions_pandas
Out[19]:
   id                          question_text
0  56               Is the color of sky blue
1  57  Is Samsung more reliable than iPhone?
2  59    Who is your favorite GoT character?
3  60    Do you love charles dickens novels?
4  61               Is the color of sky blue

pandas has a data-structure called DataFrame. questions_pandas is a pandas DataFrame instance.

Let's load all choices in a Pandas dataframe intance.

In [20]: choices_pandas = pd.DataFrame(Choice.objects.values())

In [21]: choices_pandas
Out[21]:
   id  question_id choice_text  votes
0  57           56       maybe      0
1  55           56         yes      4
2  58           59         jon      2
3  59           60         yes      1
4  54           59         ned      3
5  56           56          no      2
6  60           61        sure      0

Visualizing the tabular data

Simple ORM queries are straightforward. We are certain that Question.objects.all() or Question.objects.filter(question_text='something') will do the correct thing.

Advanced ORM queries which use annotate spanned over relationships or those which use Q objects aren't so straigtforward. While writing them, I want to confirm that they are giving the intended output. I guess every Django developer finds himself/herself in this situation.

Seeing a tabular structure and running things manually can give additional confidence that our query is correct (especially with the awesomeness of shell_plus). We can see a tabular structure with SQL too but SQL queries are longer to write.

Pandas can show us tabular data. Pandas statements are shorter to write. Also pandas dataframes can be sliced, filtered, grouped, etc, and can be stored in variables which make further statements even shorter and data manipulation easy.

Suppose we are building a reporting dashboard and want the number of votes cast on each question. We also want to order this list with question having highest number of votes at top. Let's write the ORM query for this scenario.

In [29]: questions_orm = Question.objects.annotate(num_votes=Sum('choice__votes')).exclude(num_votes=0).order_by('-num_votes')

This query groups questions and find the number of votes cast on each question and order the queryset by highest number of votes.

In [30]: questions_orm[0].num_votes
Out[30]: 6

In [31]: questions_orm[0].question_text
Out[31]: 'Is the color of sky blue'

Let's verify our result with pandas. We can merge the question_pandas and choice_pandas and can easily visualize the data.

In [41]: question_and_choices = questions_pandas.merge(choices_pandas, left_on='id', right_on='question_id')

In [31]: question_and_choices
Out[42]:
   id_x                        question_text  id_y  question_id choice_text  votes
0    56             Is the color of sky blue    57           56       maybe      0
1    56             Is the color of sky blue    55           56         yes      4
2    56             Is the color of sky blue    56           56          no      2
3    59  Who is your favorite GoT character?    58           59         jon      2
4    59  Who is your favorite GoT character?    54           59         ned      3
5    60  Do you love charles dickens novels?    59           60         yes      1
6    61             Is the color of sky blue    60           61        sure      0

The pandas merge operation is same as a database join but pandas merge is shorter than SQL join statement.

The top three rows of the tabular data shown above are for question_id 56 as is evident from question_id column. We can add the values of votes column and can see that it is 6.

That's how the pandas result gave us additional confidence that our queryset is behaving as intended.

The equivalent SQL join would be:

bombardill=# select * from polls_question inner join polls_choice on polls_question.id=polls_choice.question_id;
 id |            question_text            |         pub_date          | author  | id | choice_text | votes | question_id
----+-------------------------------------+---------------------------+---------+----+-------------+-------+-------------
 56 | Is the color of sky blue            | 2019-05-15 05:30:00+05:30 | roald   | 57 | maybe       |     0 |          56
 56 | Is the color of sky blue            | 2019-05-15 05:30:00+05:30 | roald   | 55 | yes         |     4 |          56
 59 | Who is your favorite GoT character? |                           |         | 58 | jon         |     2 |          59
 60 | Do you love charles dickens novels? |                           | tolkien | 59 | yes         |     1 |          60
 59 | Who is your favorite GoT character? |                           |         | 54 | ned         |     3 |          59
 56 | Is the color of sky blue            | 2019-05-15 05:30:00+05:30 | roald   | 56 | no          |     2 |          56
 61 | Is the color of sky blue            |                           |         | 60 | sure        |     0 |          61
(7 rows)

As you would have noticed, the SQL statement was longer than the equivalent pandas statement.

Gaining speed improvements

Suppose we are building a reporting dashboard for Question and Choices. We want to allow filtering and ordering in the dashboard.

In an ordinary view, we would run a ORM query everytime ordering on a column is requested. Say first request is order by num_votes ascending.

Question.objects.annotate(num_votes=Sum('choice__votes')).exclude(num_votes=0).order_by('num_votes')

Next request is order by num_votes descending. We would again execute an ORM query which would hit the database.

Question.objects.annotate(num_votes=Sum('choice__votes')).exclude(num_votes=0).order_by('-num_votes')

Instead of this, we can cache the question_and_choices dataframe. And just use the dataframe to order differently based on ordering parameter. This would avoid database calls.

This would make our views respond faster. This assumes that we update the cache everytime the database tables are updated or rows are inserted in the database table.

This can be done with a caching solution like Redis too but Pandas is a possibility too.

Answering stakeholder's queries

Very often my stakeholders ask me to run some ad-hoc queries against the production database.

This would involve writing long ORM queries. Or short pandas statements once I load the data into a dataframe :)

eg: Find out the question with maximum number of votes. The ORM way would be:

In [58]: Question.objects.annotate(num_votes=Sum('choice__votes')).exclude(num_votes=0).order_by('-num_votes')[0].question_text
Out[58]: 'Is the color of sky blue'

Next question would be 'Find out the question with minimum number of votes.'

In [58]: Question.objects.annotate(num_votes=Sum('choice__votes')).exclude(num_votes=0).order_by('num_votes')[0].question_text
Out[58]: 'Do you love charles dickens novels?'

Next, 'Find out the question with no choices'

In [62]: Question.objects.annotate(num_choices=Count('choice')).filter(num_choices=0).values_list('question_text', flat=True)
Out[62]: <QuerySet ['Is Samsung more reliable than iPhone?']>

With a Pandas approach, I could load the data into a dataframe, do a join and can quickly answer the successive questions.

In [63]: questions_pandas = pd.DataFrame(Question.objects.values())

In [64]: choice_pandas = pd.DataFrame(Choice.objects.values())

In [66]: question_and_choices = questions_pandas.merge(choices_pandas, left_on='id', right_on='question_id')

eg: Find out the question with maximum number of votes. The pandas way would be:

In [79]: group_and_sum = question_and_choices.groupby(['id_x', 'question_text']).sum()

In [81]: group_and_sum.sort_values('votes', ascending=False)
Out[81]:
                                          id_y  question_id  votes
id_x question_text
56   Is the color of sky blue              168          168      6
59   Who is your favorite GoT character?   112          118      5
60   Do you love charles dickens novels?    59           60      1
61   Is the color of sky blue               60           61      0

Find out the question with minimum number of votes.

In [81]: group_and_sum.sort_values('votes', ascending=True)

Pandas could be a powerful weapon in your arsenal if you are a Python web developer. Learning the basics of Pandas to do grouping and aggregation shouldn't take more than a few hours.

The Essential Plug

UrbanPiper is a growing team and we are looking for smart engineers to work with us on problems in the sphere of data, scale and high-throughput integrations.
If you are interested to be a part of our team, feel free to write in to us at [email protected]