Some experiences with Django optimized database queries

ORM saves us a lot of work, basically we don't need to write SQL, we can do a lot of CRUD operations, and the foreign key associations are automatically handled by ORM, which makes the development very efficient. I think Django's ORM is very good in ORM, especially the self-contained Django-admin, which saves a lot of work and is even better than the backend interface developed by many companies internally.

But ORM also brings some problems, the most serious is that these foreign key associations will automatically Fetch, resulting in very easy to write out N + 1 queries, plus if you use django-rest-framework, Serializer can help you easily to render the associated foreign keys, it is easier to write out N + 1 queries. For this reason, when I was working at Ant, the company basically forbade (some small internal projects can still use it) the use of automatic foreign key ORM like Hibernate, you need to write the SQL map by hand and Join by yourself. The problem is very well solved. This problem is very easy to solve, we just need to look at the interface Profile to see what SQL queries have been made to complete a request, and if we find N + 1, we can solve it.

This article describes some ways to debug and optimize database queries in Django, but there are similar tools for other languages and frameworks.

Let's start from the beginning and think about how to improve the performance of your website. First of all, the closest to the user is the front-end code, we can start perf from the front-end to see which page takes the longest time to render, whether the request is blocked, or the front-end components are written with poor performance. If it is a very important user-facing page, you can systematically use some monitoring tools to find performance problems. If it's internal, you can actually find performance problems as long as you go to each page and try a few times. If you don't feel the lag, it's basically enough.

For internal systems, if the front-end is not written surprisingly poorly, performance problems are generally caused by slow APIs. So this article will not be too much about the front-end performance optimization.

Slow API

Discovering time-consuming APIs

Django has an exporter library for Prometheus, django-prometheus, which you can use to expose metrics and then plot the P99/P95 of each view on Grafana to discover long APIs and then target them for optimization to debug what is slow in the end. Debug what is slow.

Debug

Django 's django-debug-toolbar is a very useful tool. After installing it, set up the IP for debug, and when you use this IP to access it, it will automatically profile the whole request, including the Cache, Template, Signal, and so on. The most useful one is SQL Profile.

It will list all the SQL involved in a request, including

  • How much time this SQL took
  • Which line of code triggered the SQL, similar to a traceback
  • How many similar SQLs, how many duplicate SQLs (if there are, there is usually a problem, which means the same SQL was queried multiple times)
  • Click Expl to easily see the Explain of the SQL
  • Click Sel to see the details of the SQL

As shown below.

django-debug-toolbar

Then you can optimize for the slow API.

Solve

N + 1 Query

N + 1 queries are the most common cause of API slowdown. For example, if you use the Nested relationships of the djagno-rest-framework, the actual query will

  1. first query the list of items to be displayed on the current list page
  2. for each item, look up its tag

This is an N + 1 query.

The solution is simple: use Django's prefetch_related( ), which works by querying all the foreign key related data at once using in and then doing a "join" in memory using Python, so that only two queries are generated.

  1. first query the list of items to be displayed on the list page
  2. query all the tags at once, using tag_id in (item_id, item2_id...)

Refer to an example in the official documentation: each Pizze has a different topping. .

 
 from django.db import models

 class Topping(models.Model):
    name = models.CharField( max_length=30 )

 class Pizza( models.Model):
    name = models.CharField( max_length=50 )
    toppings = models.ManyToManyField(Topping)

   def __str__(self):
       return "%s (%s)" % (
            self.name,
           ", ". join(topping.name for topping in self.toppings.all()),
        )

The following query is an N + 1 that looks up all the Pizze, and then for each Pizza, looks up its Toppings.

 
>> > Pizza.objects.all()
 ["Hawaiian (ham, pineapple)", "Seafood (prawns, smoked salmon)"...

If you use prefetch, there will be only 3 queries (because it is a many-to-many relationship, so one of them will be for the intermediate table).

 
>>> Pizza.objects.all ( ).prefetch_related('toppings')

Note that only .all() can be used

prefetch actually caches a queryset(), so if the query changes, Django must re-initiate the query. The cache for prefetch is not used in this usage.

 
 >>> pizzas  = Pizza.objects.prefetch_related ('toppings') 
 >>>  [list(pizza.toppings.filter( spicy=True )) for pizza in pizzas]

Since the prefetch has already queried all the data into memory, we should use it like this so that no new queries are triggered.

 
 >>> pizzas  = Pizza.objects.prefetch_related ('toppings') 
 >>>  [pizza for pizza in pizzas 
        if any(topping. spicy==True for topping in pizza.toppings)
    ]

Prefetch()

The argument received by prefetch_related() can be a string or a Prefetch() object, which can be used to more precisely control the queryset of the cache. For example, to sort.

 
>> > Restaurant.objects.prefetch_related(
...     Prefetch ('pizzas__toppings', queryset=Toppings.objects.order_by ('name') ))

It is also possible to prefetch multiple foreign keys at once (the order is important, see the documentation ), and prefetch() and string can be mixed.

 
 >>>  Restaurant.objects.prefetch_related( 
 ...  Prefetch ('pizzas__toppings', queryset=Toppings.objects.order_by ('name') ),
 ... "address" )

Many-to-many and nested foreign keys

For nested foreign keys, the names of the Model's attributes can be concatenated with __, such as this.

 
>> > Restaurant.objects.prefetch_related ('pizzas__toppings')

This way both pizzas and toppings will be prefetched.

select_related()

select_related () is also a function that has a similar effect, except that it differs from prefetch in that

 
 prefetch_related ()
 select_related ()

Obviously, select_related () triggers fewer queries and solves the problem in a single query. But it is also limited in its ability to support nested foreign key queries.

prefetch_related_objects()

The above two methods are used for queryset, and you can use this function if you are querying for objects.

For example, if we want to query the data associated with a recent order, we can use this.

 
 latest_order = self.orders.order_by ("-id" ).first()
models.prefetch_related_objects(
    [last_group],
   "item",
   "user_address",
)

Cached Property

Django provides a useful decorator @cached_property that replaces @property so that an object will only count once when reading the property, and the same object will use the cache when reading the property after the first time.

It's similar to @lru_cache in Python .

Reducing unnecessary display fields

Both the Cache and prefetch methods are somewhat complex. If the front-end user gets to some fields, there is no need to return them all at once.

When I first started writing Serializers in DRF, I tended to have one Serializer for each Model, and then these Serializers were all associated with each other. In the end, when querying a list page, every item related data, and the data related to these data, are all displayed at once. Even after optimization, it is difficult to maintain.

A better practice was later concluded that each Model has two Serializers.

  • ListSerialzer: for all the foreign keys only expand one layer, do not expand the foreign keys of the foreign keys
     in
  • DetailSerializer: Show all foreign keys on demand
    • for rendering detail pages
    • For each row associated with a foreign key, you may have to do another query to expand all the associated foreign keys for easy display. However, since there is only one object, it is not particularly slow. However, we still need to pay attention to N + 1. If the nesting is too deep, consider not displaying so many at once and provide a new API for querying

The advantage of this is that we can prefetch on demand. The API for List pages only needs to prefetch directly associated foreign keys, and the API for Detail can cascade prefetch on demand. The general principle is to avoid prefetching multiple foreign keys as much as possible.

It is worth mentioning that in django-rest-framework, it is possible to use different Serializers for different APIs in the same ModelViewSet.

 
 def get_serializer_class(self):
   if self.action == "list":
       return ExperimentListSerializer
   return super().get_serializer_class()

Using redundant fields

Now that storage is cheap, in the right scenario, you can consider saving queries by directly storing more copies of fields.

For example, one of my scenarios is that there is a parallel Execution in a group, and if all Executions are finished, the group is considered finished.

The previous implementation defined an is_running field on the group and returned group.execution_set.filter(is_running=True).exists(). This requires a foreign key query every time.

You can actually save an is_running field on the group and then update group.is_running when the Execution ends. ( Signal is actually not very maintainable, I prefer to call it explicitly).

The advantages of this are.

  1. Easy to query, business logic becomes simple

Disadvantages.

  1. There must be some place where the logic gets complicated because of synchronous updates
  2. Potential data inconsistency

Slow SQL

As more and more data is available, even if the development environment finds that the API is causing few and fast requests, the online environment may have problems as it runs.

So it is better to observe the SQL on line as well. The method is simple, just log out the SQL with query time >1 second (or other time). This can be done by setting up a logging configuration for the Django ORM.

Add a new logger and filter it similar to the following settings.

 
 "filters": {
   "slow_sql_above_50ms": {
       "()": "django.utils.log.CallbackFilter",
       "callback": lambda record: not hasattr(record, "duration" )
       or record.duration > 0.05, # output slow queries only 
    },
},

Then you can filter out the SQL logs and log only those with request durations > 50ms.

Finally, to end with a quote from Django.

 
Always profile for your use case!