When youâre starting a data project, you want to play around, test your assumptions, and explore. Itâs not the time to optimize. In fact, youâre almost certain not to know yet what to optimize.
This means that your queries that should run in milliseconds could take minutes. You know the feeling of triggering the query and instantly realizing that you made a mistake. You want to stop the query and fix it.
Until recently, Tinybird didnât allow you to cancel queries. Why not? It's such an obvious action. Well, we could argue that we wanted to encourage users to optimize their queries from the beginning. Weâre all about speed, and that means queries that run in milliseconds.
But we recognize speed isn't just about queries. It's also about how fast you develop. And if you're worried about optimizing too soon, you go slower.Â
Still, the truth is that cancelling queries presents some technical challenges. Letâs learn about them.
How to kill a query on the server
If you were directly connected to your database and wanted to cancel a query, it would be easy.
Tinybird uses ClickHouse as its analytical database. The process to cancel a query would be:
- Find the query ID. You need to query the
system.processes
table, usingclusterAllReplicas
if youâre operating a cluster. Here's how we might find the query based on a part of the SQL you just triggered, but you could use your own criteria to get the desired query ID.
- Call the
KILL QUERY
command with the givenquery_id
.
The important takeaway: you need the query_id
to kill the query.
When the query comes from the client
The scenario is not that straightforward when the query comes from another client, for example, the Tinybird web app or CLI.
When an app triggers a query, it typically follows this path:
- Client calls an API endpoint via an HTTP request
- API server forwards the request to the database
- Database processes the query
- Database passes result to API server
- API server sends response to the client
This is, of course, oversimplified. In the real world, youâll hit a load balancer in front of the API server and another one in front of the database cluster.
You may see the problem already. The UI doesnât know anything about whatâs happening under the hood.
You make a request, but the API and the database server deal with the query IDs. In Tinybird's case, itâs assigned at the API layer. The UI wonât know the assigned ID until the request is fulfilled. By that time, the query is already finished, so having that ID to kill an already finished query is useless.
The solution is simple: Instead of leaving it to the database to assign a query ID, delegate the task to the UI and pass it along to the API server to assign to the ClickHouse client operation. We use the HTTP interface to connect to ClickHouse, passing the query_id
as a parameter.
Now, the UI already knows the query ID (it created it), so we can ask for the query cancellation using this ID.
KILL QUERY
request.Ensuring safe query cancellation
Of course, it's not so simple. Nothing ever is. Letting users kill queries is risky. You cannot let users kill arbitrary queries, because the database server is not only running the queries coming from the UI, but also materializations, serving the queries of other endpoints, writes, etc. To reduce the surface area of this operation, we put in place several measures to keep things safe:
Unique query ID strategy
To avoid collisions with other queries, the âkillableâ queries use a unique ID format. We add a specific prefix to a ULID identifier. This eliminates the risk of accidentally killing critical background processes.
Endpoint restriction
We limit the query cancellation to a specific endpoint: our Query API, and only for requests coming from the UI app. We enforce this using the unique query ID format via parameter validation. This prevents unauthorized query termination coming from other endpoints.
Targeted cancellation
A special cautionary lesson: direct the KILL QUERY
to the correct server. In the past, we killed queries programmatically using the `ON CLUSTER` clause, and our approach caused some issues.
Why? ZooKeeper, the service we use to handle replication, has a single queue to handle all DDL operations. We overloaded that queue with KILL
queries.
We were determined not to let this happen again. So, instead of just throwing the KILL to the cluster, we find the particular server running the query and direct the cancellation there. This way, we avoid having to deal with replication problems. Since we are always cancelling READ queries, this is ok.
To achieve this, we ask for the hostname where the query is being run:
With that information, we run the KILL
query against that particular server.
As always, a simple feature in appearance has a lot more nuance than youâd expect. At Tinybird, we took this approach to safely allow cancelling queries from the UI client. I hope this is useful if youâre thinking of adding something similar to your product.