I used to think running large-scale analytics workloads on Postgres was like trying to use a screwdriver as a hammer - technically possible but probably not a great idea. But after years of running both OLTP and OLAP workloads on Postgres at scale, I've learned it's much more nuanced than that. Postgres can actually handle analytics workloads quite well, if you know its limits and how to work within them.
In my previous articles, we looked at handling growing data volumes and increasing user concurrency. Analytics workloads add another dimension of complexity - instead of simple lookups and CRUD operations, you're now dealing with complex aggregations across millions or billions of rows, joins across multiple large tables, and queries that need to scan huge chunks of your data.
The good news? With the right approach, Postgres actually can handle many analytics use cases, from basic reporting to complex business intelligence queries. I've seen it work well for datasets up to several terabytes, providing query response times from seconds to minutes. The bad news? You'll need to make some specific optimizations and architectural choices to get there, and it does require a lot of oversight, planning, and investment.
Understanding OLAP vs OLTP query patterns
Before we get into specific optimizations, let's understand what makes analytics queries fundamentally different from the transactional queries we've discussed in previous articles. Having run both types at scale, I can tell you the differences go way beyond just query complexity.
Here's what we're dealing with:
I think this table neatly summarizes why we need different strategies for OLAP workloads in Postgres. When your analytics queries start competing with your OLTP workloads for resources, you'll need to get creative with your approach - which is exactly what we'll cover next.
Postgres OLAP optimization strategies
Separate your workloads
The first rule of running analytics on Postgres: don't run analytics queries on your primary OLTP database. This isn't just best practice - it's survival. One complex analytics query scanning millions of rows can bring your entire application to a halt. I've seen it happen, and the incident reviews aren't fun.
Instead, set up a dedicated analytics replica. But not just any replica - one specifically configured for analytical workloads. Here's what this gets you:
- Complete isolation from production OLTP traffic
- Freedom to tune and optimize specifically for analytics
- Ability to scale your analytics infrastructure independently
- Protection against runaway queries impacting production
- Option to add more read replicas as analytics needs grow
Here's a basic setup using streaming replication:
- Configure the primary and replicas by updating the postgresql.conf:
- Update your application to point to the analytics replicas where appropriate.
Notice the differences from the suggested settings in the âOutgrowing Postgres: Handling increased user concurrencyâ article. The key difference is that youâll want to optimize these replicas for analytics workloads. Skip to the Performance tuning for analytics section below for more find-tuned settings for your replicas.Â
Key things youâll want to watch out for:
- Replication lag: Analytics queries can cause the replica to fall behind
- Resource contention: Even on a replica, poorly optimized queries can cause problems
- Data freshness: Your analytics will be slightly behind production due to replication
A common question I get is "How much replication lag is acceptable?" For most analytics workloads, a few seconds or even minutes of lag is fine. But if you need real-time analytics, you'll need to either accept higher costs from a more powerful replica setup or start looking at specialized analytics solutions.
Design your schema for analytics
Your schema design can make or break analytics performance. Let's look at several techniques that can dramatically improve query speed, starting with the basics and moving to more advanced approaches.
Materialized views for pre-aggregation
If you've read my first article in this series on handling growing data volumes, you're already familiar with materialized views. They're particularly valuable for analytics workloads because they let you pre-calculate expensive aggregations:
For a more thorough treatment of materialized views, including concurrent refreshes and maintenance strategies, check out the materialized views section of that article.
Table partitioning
Partitioning is another fundamental technique we covered in the data volumes article. For analytics workloads, time-based partitioning is particularly effective:
Take a look at the table partitioning section of that article for detailed partitioning strategies and maintenance approaches.
Vertical partitioning for analytics
Here's where things get interesting for analytics workloads. Vertical partitioning - splitting tables by columns rather than rows - can significantly improve analytics performance. While we touched on this in that same data volumes article, it deserves special attention for analytics.
Consider a wide events table:
You can split this into separate tables based on query patterns:
This approach has several benefits for analytics:
- Queries only need to scan the columns they need
- Better cache utilization for frequently accessed columns
- More efficient compression for similar data types
- Ability to use different storage strategies per table
But - and this is a big but - vertical partitioning isn't free. Here's what you're signing up for:
Write complexity
- Every insert now requires multiple queries
- Transactions need to span multiple tables
- More complex error handling and rollback scenarios
- Higher chance of constraint violations
- Need for careful transaction management to maintain data consistency
Read complexity
- If you need to read columns across the partitions, simple queries become joins
- Need to maintain indexes across multiple tables
- More complex query planning
- Potential for performance cliffs if joins aren't optimized
Maintenance headaches
- More tables to manage and monitor
- Complex cleanup for orphaned records
- Harder to modify schema (changes need to be coordinated)
- More complex backup and restore procedures
- Additional foreign key constraints to maintain
When should you actually use vertical partitioning? In my experience, it works best when:
- Your analytical queries have clear patterns (some columns always used together)
- You have very wide tables where most queries use a small subset of columns
- You have different access patterns for different column groups
- Your write volume is moderate (and you can handle the extra complexity)
- You have columns with very different storage needs (like text vs numeric data)
If you're just starting out with analytics on Postgres, I'd recommend exhausting simpler options first. Materialized views and regular partitioning can take you quite far with less complexity. But when you're hitting scan performance issues and you see clear patterns in how columns are accessed, vertical partitioning might be worth the trade-offs.
Columnar storage with Postgres plugins
Postgres has a powerful extension system that lets you add new capabilities to your database. Think of extensions like browser plugins - they extend functionality but run with the same privileges as Postgres itself. That's both powerful and potentially risky.
Understanding Postgres extensions
Extensions are shared libraries that can:
- Add new data types and functions
- Create new table access methods
- Implement new index types
- Add foreign data wrappers
They're typically written in C and need to be compiled against your specific Postgres version. While the official Postgres extension repository (PGXN) has some quality controls, you should still evaluate extensions carefully - they run with full database privileges.
Columnar storage options
For analytics workloads, the following columnar storage extensions stand out:
The battle-tested option, developed by Citus Data (now part of Microsoft):
Pros:
- Mature and stable
- Used in production by many companies
- Good documentation and community support
- Solid compression ratios
- Supports both read and write operations
- Works with standard Postgres backup tools
Cons:
- No support for indexes
- Limited to PostgreSQL 12 and older in the open-source version
- Writes are slower than regular tables
- No concurrent write support
- Limited query optimization capabilities
pg_analytics (formerly pg_lakehouse)
A DuckDB-powered option by ParadeDB focused on data lake integration:
Pros:
- Native integration with data lakes and object stores
- Supports Delta Lake and Apache Iceberg formats
- Query pushdown to DuckDB for performance
- Good compatibility with Postgres types
Cons:
- Read-only access currently
- Requires shared_preload_libraries configuration
- Limited write capabilities
- Early in development cycle
A relative newcomer by Hydra and MotherDuck that also embeds DuckDB in Postgres:
Pros:
- Leverages DuckDB's high-performance analytics engine
- Can directly query data in S3/object storage
- Supports Parquet, CSV, JSON file formats
- Integrates with MotherDuck cloud service
- Active development from DuckDB team
Cons:
- Currently read-only for object storage
- Some Postgres data types not yet supported
- Relatively new (v0.2.0 as of early 2024)
- Limited production deployments
The newest kid on the block by Mooncake Labs combines DuckDB execution with Iceberg/Delta Lake storage and is available on Neon:
Pros:
- Significantly faster analytics queries due to DuckDB's vectorized execution
- Better compression ratios
- Native integration with data lakes (can query Parquet files)
- Modern columnar storage format
- Active development
Cons:
- Relatively new and less battle-tested
- Limited write capabilities (no UPDATE/DELETE)
- Some Postgres features not supported yet
- More complex setup if using cloud storage
- Smaller community and fewer production deployments
When to use a columnar storage plugin
There are definitely some risks you need to consider before going down this route:
- Version Lock-in: Extensions need to match your Postgres version exactly
- Operational Complexity: Need to maintain extension updates separately
- Backup Considerations: Some backup tools may not handle extension data properly
- Performance Overhead: Extensions can add CPU and memory overhead
- Support Challenges: Limited commercial support options
Despite these challenges, columnar storage can be worth it when:
- You have large analytics tables (>100GB)
- Queries typically access a subset of columns
- You need better compression ratios
- Read performance is more important than write performance
- You can manage your own Postgres infrastructure
If you're on a managed platform that doesn't support these extensions, you might need to:
- Use regular Postgres tables with good indexing strategies
- Consider a separate analytics database
- Look at managed analytics solutions
- Evaluate moving to self-hosted Postgres
Remember: adding extensions is a significant architectural decision. Start with standard Postgres features first, and only add extensions when you have clear evidence they'll solve specific problems in your workload.
While these advanced techniques can give you a huge boost in analytics performance, they also add complexity to your system. Start with the basics (materialized views, partitioning, and analytics replicas) and only move to columnar storage when you have clear evidence they'll help your specific workload.
Analytics query optimization techniques
Up to this point Iâve focused on infrastructure and schema changes - in many situations however, smart query optimization often beats brute force.
Understanding analytics query patterns
Analytics queries are fundamentally different from OLTP queries. They often:
- Scan large portions of tables
- Perform complex aggregations
- Join multiple large tables
- Use window functions for trend analysis
- Need to handle significant intermediate results
Now Iâll walk you through how to optimize each of these patterns.
Window functions
Window functions are incredibly powerful for analytics. They let you perform calculations across sets of rows without the overhead of self-joins or complex subqueries:
Pro tip: Window functions are processed after aggregations. This means you can combine them with GROUP BY
:
Common table expressions (CTEs)
Common Table Expressions (CTEs) are often seen as just a way to make queries more readable. But they're also a powerful optimization tool because of how they work in Postgres but, that also makes them a double-edged sword:
The other side of the sword: By default, CTEs in Postgres are materialized - they're calculated once and reused. This is usually great for analytics but can backfire if your CTE returns a large result set that's only used for a small subset of rows.
Parallel query execution
Postgres can parallelize many analytics operations, but you need to tune it right:
Some operations that benefit most from parallelism:
- Sequential scans on large tables
- Parallel joins (with appropriate settings)
- Large aggregations
- Window functions
Keep in mind: While parallel queries use more total CPU, they can finish much faster. Monitor your system resources carefully when enabling parallelism.
Essential EXPLAIN analysis
When optimizing analytics queries, EXPLAIN ANALYZE
is your best friend. These are some of the specific things to watch for:
Common analytics query pitfalls
- Unneeded
ORDER BY
in subqueries
- Missing
DISTINCT ON
for latest values
- Not using date truncation indexes
Query optimization is iterative. Start with the basics, measure, then optimize further based on real usage patterns. And always test with realistic data volumes - what works for 1M rows might fall apart at 100M. Lastly, itâs amazing how far you can get by simply following âThe 5 rules of writing faster SQL queries.â
Indexing for analytics
BRIN indexes are Postgresâ secret weapon for analytical workloads. Unlike B-tree indexes that track every single row, BRIN (Block Range INdex) creates a tiny index that stores metadata about blocks of data. Think of it like chapter summaries in a book instead of a detailed index of every word.
For time-series data or any naturally ordered columns (like IDs that only go up), BRIN shines. It's perfect when physical storage order and the indexed column are highly correlated for your data.
In Crunchy Data's testing, on a 42MB table, the B-tree indexes were 21MB while the BRIN indexes were just 24KB! And for large result sets (100K+ rows), BRIN consistently outperformed B-tree indexes.
Here's how to create one:
The pages_per_range
parameter is key - it defaults to 128 but tuning it can really boost performance. For narrow queries (returning ~100 rows), smaller values like 8 or 16 work better. For broader analytical queries, stick with larger values.
But BRIN isn't magic. Skip it when:
- Your data is randomly distributed
- You need exact row lookups
- Your queries are highly selective (returning <1% of rows)
- Your table gets lots of
UPDATE
s in the middle
The sweet spot? Analytical queries that scan large chunks of naturally ordered data. Bonus points if you're tight on disk space - BRIN indexes are tiny.
Want to check if BRIN makes sense? Look at the correlation between storage order and your column:
The closer to 1.0, the better BRIN will perform.
Performance tuning for analytics
No matter how well you write your queries, your postgres instance needs proper tuning to handle analytics workloads efficiently. While I covered basic configuration in previous articles, analytics workloads have specific needs that require different trade-offs.
Memory settings
Memory configuration is where analytics workloads differ most from OLTP. You'll need larger memory allocations for complex operations, but you'll also need to be careful not to overallocate:
I/O configuration
Analytics queries often read large amounts of data. These settings help optimize I/O:
Query planning
The query planner needs different settings for analytics:
Vacuum settings
Large analytics tables need different vacuum settings:
Monitoring configuration
Monitoring becomes even more critical with analytics workloads:
Real-world example
Here's what this might look like in practice. Let's say you have:
- 32 CPU cores
- 128GB RAM
- NVMe storage
- Mixed analytics workload
Your configuration might look like:
Don't just copy these numbers and YOLO it. Monitor your actual memory usage using tools like pg_stat_statements and adjust based on your workload. Below Iâll call out some gotchas and guidance for testing your settings.
Configuration gotchas
Watch out for these common issues:
- Memory overallocation
- Too high work_mem can cause OOM kills
- Remember it's per operation, not per connection
- Excessive parallelism
- Too many parallel workers can cause thrashing
- Monitor CPU usage and adjust
- Aggressive vacuum
- Too aggressive settings can impact query performance
- Monitor vacuum timing and adjust cost limits
- Checkpoint tuning
- Too large max_wal_size can cause long recovery times
- Too frequent checkpoints can impact performance
Testing your configuration
Always test configuration changes:
- Start with a test environment
- Benchmark before and after
- Monitor system resources
- Test with realistic data volumes
- Test failure scenarios
Analytics workloads often reveal configuration problems that OLTP workloads miss. What works for millions of small transactions might fail spectacularly for one large analytical query.
Know your limits
Even with all these optimizations, you'll eventually hit limits running analytics on Postgres. Watch for:
- Queries taking minutes or longer
- High I/O wait times
- Memory pressure from large sorts/joins
- Vacuum struggling to keep up
- Replication lag on analytics replicas
When you see these signs, it's time to start thinking about dedicated analytics solutions. But that's a topic for an upcoming article!
Takeaways
Postgres can handle many analytics workloads if you:
- Isolate analytics traffic to dedicated replicas
- Optimize your schema and configuration
- Use materialized views strategically
- Use techniques for optimal analytics querying
- Leverage BRIN indexes where they make sense
- Monitor and maintain carefully
- Know when to quit and move to specialized tools
It's not a perfect analytics database. But with the right approach, it can take you surprisingly far. Just don't forget to keep an eye on those limits - they'll sneak up on you faster than an angry DBA during a production incident.
In the next article, Iâll walk you through when and how to move your analytics workloads off Postgres. Because sometimes, you really do need a hammer instead of that screwdriver.
Need to move your analytics off Postgres?Â
Tinybird is data infrastructure for software teams. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more.
You can try Tinybird for free with no time limit. Sign up here.