The Data Engineering Show

AI for Data and Data for AI: The Dual Frontier of Modern Data Engineering with Pranav Motarwar

The Firebolt Data Bros — Tue, 16 Jun 2026 12:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Pranav Motarwar, a data engineer who worked across major tech companies, and the intersection of AI and data infrastructure, to explore how artificial intelligence is fundamentally reshaping the data engineering landscape not by eliminating roles, but by bifurcating the field into two distinct, equally critical domains.

What You'll Learn:

- Why the "data engineering is dying" narrative is clickbait: Data engineers remain essential because 60% of use cases by 2027 will involve providing data to AI agents, while simultaneously human-facing analytics demands continue growing, meaning more work, not less.

- How to future-proof your career by mastering "AI for Data" AND "Data for AI": Modern AI Data Engineer roles now require both using AI agents to accelerate traditional ETL/DBT workflows AND building entirely new data pipelines (chunking, embedding, vector storage) designed specifically for agent consumption.

- The transformation framework breaking down how data pipelines for humans differ from pipelines for agents: Human-facing pipelines traditionally handled structured data; agent pipelines now require handling unstructured multimodal inputs (videos, audio, images), demanding completely different architectural approaches.

- Why individual contributors now own end-to-end pipelines that previously required 7-8 engineers: AI-assisted coding and low-code platforms like Databricks Cortex and Snowflake's GenAI tools reduce traditional pipeline development from one month to 3-4 weeks, freeing engineers to focus on product strategy, governance, and business impact.

- How the next-gen data stack will evolve: traditional tools (DBT, BI platforms) stay relevant, but new specialized systems emerge: Companies like Vespa handle multimodal retrieval serving, while emerging startups build data warehouses purpose-built for video and complex unstructured data - eventual consolidation will come once larger players (Databricks, Snowflake) evolve their offerings.

- The exponential data explosion argument that guarantees ongoing demand: Data generated by all humanity through 2008 is now created daily; even single engineers replacing five-person teams will find more work arriving as use cases expand across AI agents, real-time recommendations, robotics, and physical AI systems.

About the Guest(s)

Pranav Motarwar is a data engineer with extensive experience across leading tech companies, where he has worked in risk, product, privacy, and core data engineering roles. With a background spanning from traditional data engineering to cloud infrastructure and AI-driven systems, Pranav brings a unique perspective on the industry's rapid evolution. In this episode, he explores how AI is fundamentally transforming data engineering workflows, discussing the emergence of dual pipeline architectures for both human and AI consumption, and the critical skills data engineers need to remain relevant in 2025 and beyond. His insights on the shift from structured data pipelines to multimodal, AI-optimized infrastructure provide actionable guidance for engineers navigating the next generation of data stack technology.

Quotes

"I've worked across different product-based companies in different domains like risk and product, as well as privacy, and the core data engineering teams as well." - Pranav Motarwar

"Data engineering is completely segmented into two different categories: one where the end consumer is human or product, and another where you are building data engineering flow, pipelines, and design for agents to consume." - Pranav Motarwar

"What used to take one month to create an entire flow with DBT has now been reduced to almost 30% of the time we usually spent three to four years ago." - Pranav Motarwar

"Data engineers need to be aware of the process of chunking, embedding, and how you are planning the vector store and optimizing the entire process." - Pranav Motarwar

"The data which was generated by humans from humanity till the year 2008 is currently generated in a day—that's how the volume is exploding." - Pranav Motarwar

"There are two main aspects to data engineering right now: AI for data and data for AI, and both things are essential for an engineer to plan their future." - Pranav Motarwar

"You can't say that you should focus on AI for data rather than data for AI because both are going to be very much important for the next couple of years." - Pranav Motarwar

"Companies like Apple and Tyro are raising relevant job applications in the market known as AI data engineer, with requirements around creating data pipelines for agents and using AI agents in your data engineering flow." - Pranav Motarwar

"Traditionally, we were consuming and processing data in a very structured format, but now that is getting transformed for agents, where it will be unstructured files, audios, videos—it can be pretty much anything." - Pranav Motarwar

"If you want to cope with market dynamics, you need to understand the requirements in the market and gauge your skills according to the market dynamics." - Pranav Motarwar

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review

Resources

LinkedIn Profiles:

Pranav Motarwar's LinkedIn: https://www.linkedin.com/in/pranav-motarwar-648a55169
Benjamin's LinkedIn: https://www.linkedin.com/in/wagjamin

Company Websites:

Firebolt: firebolt.io

Tools & Platforms:

DBT – Data transformation and modeling tool for building analytics engineering workflows
Fivetran – Data integration platform for automating data pipeline ingestion
Snowflake – Cloud-based data warehouse for structured and unstructured data processing
Databricks – Unified data analytics platform supporting ETL, data science, and AI workloads
BigQuery – Google Cloud's data warehouse for analytics and machine learning
Looker – Business intelligence and visualization platform
Cortex – Snowflake's AI-powered tool for data pipeline automation
LangChain – Framework for building applications with language models and data processing layers
Vespa – Retrieval engine for fast vector search and multimodal data serving
AdaptDB – Analytical database system for building software products

Articles & Research Papers:

"MIT Technology Review Report on Data Engineering and AI" – Co-published with Snowflake (2023-2025 projections on AI use cases in data engineering)

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:

AI Won't Replace Engineers, But This Framework Will Change How They Build with Rohit Girme

The Firebolt Data Bros — Thu, 07 May 2026 12:00:00 +0000

Scaling AI from proof-of-concept to production requires more than just deploying models; it demands robust evaluation frameworks, human oversight, and a fundamental shift in how engineering teams approach development.

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Rohit Girme, Staff Software Engineer at Airbnb, to explore how Airbnb built a Gen AI evaluation platform to assess LLM outputs across product surfaces, from customer support bots to search and booking experiences. Rohit shares insights into Airbnb's infrastructure choices, evaluation workflows, and lessons learned about leveraging AI tools while maintaining human orchestration.

What You'll Learn:

- How to architect a multi-layer Gen AI evaluation platform using Python, VLLM, Kubernetes, and DAG-based workflows to systematically test LLM outputs in production

- Why splitting monolithic "virtual judges" into specialized LLM-powered metrics (content relevance, hallucination detection, policy adherence) dramatically improves evaluation accuracy and debugging

- The critical distinction between real-time evaluation (lightweight, sub-second latency) and offline evaluation (comprehensive, human-in-the-loop) and how to route outputs accordingly

- How to shift from traditional software engineering (deterministic, rule-based testing) to probabilistic AI evaluation where you validate outputs against golden datasets and human judgment benchmarks

- The framework for breaking down problems into smaller chunks and using AI tools as collaborators rather than end-to-end problem solvers—critical when working with codebases at massive scale

- Why documentation becomes infrastructure in an AI-driven workflow: LLMs need comprehensive, well-formatted docs to scale tribal knowledge across entire organizations

- The hard truth about AI and scaling: zero-to-one innovation is now commoditized, but one-to-n execution (the scaling part) still demands human judgment, orchestration, and product sense

- How to measure AI tool adoption beyond token usage instrument your development workflow to capture whether LLM suggestions actually made it into shipped code and added real value

About the Guest(s)

Rohit Girme is a Staff Software Engineer at Airbnb, where he has spent the last seven and a half years building infrastructure and platforms at scale. With deep expertise in search and machine learning infrastructure, Rohit leads efforts in GenAI evaluation and has pioneered Airbnb's approach to ensuring AI-powered features work reliably in production. In this episode, Rohit shares practical insights on building evaluation platforms for large language models, orchestrating AI in product workflows, and leveraging AI tools effectively in software development. His work on integrating LLMs into customer-facing products while maintaining quality and performance provides actionable strategies for engineering teams navigating the rapid adoption of AI, making this conversation essential for data engineers and platform builders looking to scale AI responsibly.

Quotes

"Zero to one is easy now, but the one to n, which is a scaling part, I think we still haven't figured that out. You still need humans for that." - Rohit

"With AI, it's a black box to us as well. We don't know how it's working underneath, so we have to figure out another way to evaluate the surface." - Rohit Girme

"Humans should be the orchestrators of these tools and not just hand off everything to these tools." - Rohit Girme

"If we hand off everything to the LLM, it will make a lot of assumptions because context is limited, and it doesn't know the code enough." - Rohit Girme

"Documentation has become even more relevant because now LLMs need to know everything so everyone can scale up." - Rohit Girme

"Measuring productivity in LLMs is not just about how many tokens people are using—you need to figure out if they're actually building something on top." - Rohit Girme

"Internet democratized information, and I think with LLMs, it's capability that would be democratized. If you have a good idea, you can build it very quickly." - Rohit Girme

"There's always going to be blind spots for every person, but with AI, it'll become even faster because you have this very short cycle of talking to the AI instead of talking to five humans." - Rohit Girme

"Shipping products or shipping features would become even faster—where earlier it took weeks or months, now it will be days." - Rohit Girme

"I have supercharged my workflow day to day either at work or at home with access to information that's so easy to get." - Rohit Girme

Resources

LinkedIn Profiles:

Rohit Girme's LinkedIn: https://www.linkedin.com/in/rohitgirme/
Benjamin's LinkedIn: https://www.linkedin.com/in/wagjamin

Company Websites:

Airbnb: airbnb.com
Firebolt: firebolt.io

Tools & Platforms:

VLLM – Open source inference framework for hosting and running LLM-based inference engines
Kubernetes – Container orchestration platform used for serving infrastructure
Apache Airflow – DAG-based workflow orchestration tool (originated from Airbnb)
GitHub Copilot – AI-powered code completion tool for software development
Claude – LLM tool referenced for code generation and development assistance

Cloud Services:

Azure – Hosted LLM services used at Airbnb
AWS – Hosted LLM services used at Airbnb

The Framework Canva Uses for 200M+ Designers with Paul Tune

The Firebolt Data Bros — Tue, 28 Apr 2026 11:27:00 +0000

AI agents are moving beyond simple automation into collaborative design workflows requiring fundamentally different approaches to user experience, model training, and infrastructure than traditional ML systems.

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Paul Tune, Staff Research Scientist at Canva, to explore how the design platform is building agentic workflows, managing multimodal data pipelines, and tackling the unique challenge of teaching machines to understand aesthetic taste alongside functional design.

What You'll Learn:

How to architect user experiences that match intent across expertise levels from seventh graders to professional designers by constraining uncertainty through progressive disclosure rather than forcing upfront specification
Why reinforcement learning infrastructure for creative tasks demands different optimization priorities than supervised fine-tuning, with network latency to external API services often dominating compute efficiency
The shift in modern ML workflows from "source data → train → deploy" to a verification and evaluation-first paradigm, especially for generative models where training cycles are measured in weeks, not hours
How to split ML team responsibilities across data sourcing, supervised fine-tuning, distributed systems tuning, and evaluation with evaluation becoming the critical path as model capabilities scale
The difference between LLM inference bottlenecks (token throughput is rarely limiting) versus image-based ML pipelines (where data movement and GPU saturation drive entirely different optimization equations)
Why aesthetic evaluation remains harder than mathematical verification and how 2024 will likely see meaningful progress in applying LLMs beyond verifiable domains like coding into subjective areas like design taste

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Paul is a Staff Research Scientist at Canva, bringing nine years of experience building machine learning systems that empower millions of users worldwide. With deep expertise in large language models, reinforcement learning, and generative AI applications, Paul leads Canva's post-training efforts on LLMs designed for agentic design workflows. In this episode, Paul shares insights into how modern ML teams balance competing priorities - from data efficiency and GPU optimization to evaluation frameworks for subjective tasks like design aesthetics. His work bridging the gap between casual users and professional designers offers valuable lessons for data engineers and ML practitioners looking to scale AI systems across diverse user bases and complex product surfaces.

Quotes

"What Canva is is that online graphic design platform for you to be able to design and kinda have this whole end-to-end process from the designing, the brainstorming, and using all sorts of tools in order to create a graphic design." - Paul Tune

"The whole vision really is to empower the world to design, and what that entails is to then have this entire end-to-end experience of designing on the platform." - Paul Tune

"I think a lot of it has to do with matching intent, so even for yourself, if you're using cloud code, some folks go down the side of, I want to plan very specific things about my design." - Paul Tune

"Whereas for a more casual user, they probably do come in without really having an idea of what they actually do want in the first place, and I think having a few options to kind of show, okay, these are kind of like a few designs that you might like as part of that, so that sort of helps to then eventually narrow down the intent." - Paul Tune

"I think there's a lot of very strong momentum around generative tools right now, and as part of that, Canva is also experimenting with adding generative tools within the product." - Paul Tune

"I think one of the bigger trends this year is there's been quite a bit of buzz around agents in particular, and Canva is no different in that aspect—we are working towards agentic workflows." - Paul Tune

"I think for us, the biggest challenge is that every time we do a rollout by an RL algorithm where we do have a sample that needs to be scored and then some level of feedback goes back to the model to then update its weights, we have to heat up specific APIs within different services at Canva." - Paul Tune

"I think the change has definitely shifted from when we started work on machine learning, where you kind of source the data and then train, to really like how do you evaluate because large language models have so many capabilities." - Paul Tune

"I think I try to keep focus because I don't think it's very feasible for me to cover every paper out there, even though there are lots and lots of exciting things that happen every day." - Paul Tune

"I think what I'm particularly excited about is applying these sorts of models into domains that are beyond what is very strongly verifiable, like mathematics and coding, because progress outside these domains has been a bit slower, but I do see at least some progress over time." - Paul Tune

Resources

Connect on LinkedIn:

Paul Tune - https://www.linkedin.com/in/paul-tune-0ba18116
Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:

Canva: https://www.canva.com
Canva Engineering Blog: https://www.canva.dev/blog/

Tools & Platforms:

Ray – Distributed training framework for machine learning on Kubernetes clusters
Argo – Workflow orchestration tool for managing data pipelines and model training
Snowflake – Data warehouse for structured data storage and event management
AWS S3 – Object storage for media files and unstructured data
Kubernetes – Container orchestration platform for managing distributed training clusters
RDS with MySQL – Relational database service for backing services
Canva Magic Studio – Generative AI tools suite within Canva, including image generation and LLM-powered writing assistance

Llama 2 & 3 Safety: Soumya Batra on Agentic AI Training

The Firebolt Data Bros — Wed, 08 Apr 2026 09:59:00 +0000

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Soumya Batra, founder and CEO of WisePort AI and former tech lead at Meta where she led safety efforts for Llama 2 and Llama 3, to explore the evolution of NLP, the complete lifecycle of foundation model training, and why the next AI frontier lies in natively agentic systems rather than simply scaling larger transformers.

What You'll Learn:

Why historical NLP work becomes obsolete with each paradigm shift: Understand how Bayesian networks, RNNs, and LSTMs each dominated until replaced - and why current transformer-scaling dogma will likely face the same fate
How to structure the foundation model training lifecycle for safety: Learn the three critical phases - pretraining (data mix optimization), supervised fine-tuning (instruction alignment), and reinforcement learning (human preference integration)—and where safety interventions deliver maximum leverage
The counterintuitive data strategy for pretraining safety: Discover why removing all toxic content actually weakens model robustness, and how maintaining a precise balance preserves the model's ability to classify and refuse harmful requests
How dual reward models maximize both helpfulness and safety: See why combining helpfulness and safety objectives (as done in Llama 3) ensures every training sample reinforces both capabilities simultaneously rather than creating trade-offs
What "natively agentic" means and why it matters more than LLM-powered agents: Learn how foundational agentic models dynamically explore action spaces at inference time instead of relying on fixed developer-defined scaffolding, unlocking domain-agnostic workflows
How to build a foundational AI startup without massive training datasets: Understand why synthetic data generation, deterministic task validation, and deep domain expertise can substitute for Internet-scale language corpora in the agentic space

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Soumya Batra is the Founder and CEO of WisePort AI, a foundational AI company specializing in agentic AI systems. With over twelve years of expertise in NLP and machine learning, she previously served as a Tech Lead and Applied Research Scientist at Meta, where she led safety and controllability efforts for both Llama 2 and Llama 3. Her career spans foundational work at Carnegie Mellon University, Microsoft, and Meta, establishing her as a pioneering voice in conversational AI and foundation model development. In this episode, Soumya demystifies the journey from traditional NLP to large language models, revealing how safety and controllability are embedded across the entire model lifecycle—from pretraining through reinforcement learning. Her insights on the future of agentic AI and the limitations of current scaling-only approaches provide essential perspective for data engineers and ML practitioners navigating the rapidly evolving AI landscape.

Quotes

"I did not know then that this would become my career for the next decade." - Soumya

"Whatever work that I've done in the past becomes irrelevant all of a sudden." - Soumya

"There is always a notion of, yes, this is the big thing, and then no, it's not anymore." - Soumya

"I really think that we are going to be proven wrong once again about scaling transformers being the only way to achieve general intelligence." - Soumya

"Safety was an issue even back then, even though we were training in such controlled settings." - Soumya

"If you don't put some toxic content there, then it will lose the ability to classify it and it'll be much easier to break the safety later on." - Soumya

"In the post training phase, we are giving it that ability to be able to answer users' questions." - Soumya

"The next unlock will now come from foundational agent models that are natively agentic, which will unlock use cases that look unimaginable to us right now." - Soumya

"Natively agentic means the foundational model itself needs to dynamically explore the action space, rather than scaffolding around existing LLMs." - Soumya

"The real unlock comes from creating your own use cases, creating your own synthetic data, and going deep into a few workflows." - Soumya

Resources

Connect on LinkedIn:

Soumya Batra - https://in.linkedin.com/in/soumyabatra
Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:

WisePort AI – https://www.wiseport.ai
Firebolt - https://www.firebolt.io

Articles & Research Papers:

LLaMA: Open and Efficient Foundation Language Models – Meta AI Research
Lima: Less Is More for Alignment – Stanford & Meta AI Research

Educational Institutions:

Carnegie Mellon University - Language Technologies Institute (ATI)

The Data Fusion Secret & Why Custom Query Engines Fail with Nikita Lapkov

The Firebolt Data Bros — Tue, 24 Mar 2026 11:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Nikita Lapkov, Senior Software Engineer at Cloudflare, to explore the architecture, design decisions, and future roadmap of R2 SQL- Cloudflare's new R2-based distributed query engine launched in September 2024.

What You'll Learn:

How to leverage existing query engines strategically: Why Cloudflare chose Apache Data Fusion for single-node query processing rather than building an analytical engine from scratch, freeing engineering resources for distributed orchestration challenges.
The stateless architecture pattern for global infrastructure: How to design compute nodes that hold zero persistent state by storing all metadata in a distributed catalog (Iceberg), enabling per-query worker provisioning across 300+ geographically dispersed data centers.
Why filter pushdown and metadata-driven pruning are non-negotiable optimizations: How to reduce data scanned from object storage before query execution begins by leveraging catalog statistics and range filtering - the foundation of R2 SQL's performance gains.
How to solve version compatibility at infrastructure scale: Why backward compatibility matters more than cross-version support when you can't control individual node upgrade timing, and how this constraint drives architectural decisions.
The shuffle strategy for point-to-point distributed joins: How to implement in-memory and disk-based shuffles within ephemeral worker clusters using network-addressable worker IDs, allowing stateless workers to forget completely after query completion.
Why adaptive query execution is the next frontier for petabyte-scale analytics: How collecting runtime data distribution statistics mid-query execution enables mid-flight plan reconfiguration - a technique worth the overhead investment when queries run for minutes or hours rather than milliseconds.

About the Guest(s)

Nikita is a Senior Software Engineer at Cloudflare, specializing in distributed query engines and data platform architecture. With extensive experience in database internals gained through roles at ClickHouse, Yandex, and MongoDB, Nikita has developed deep expertise in query optimization and system design at scale. At Cloudflare, he leads the development of R2 SQL, a distributed analytical query engine built on Apache Data Fusion, serving as a critical component of Cloudflare's data platform. In this episode, Nikita discusses the architecture, design decisions, and technical challenges of building a stateless, distributed SQL engine across Cloudflare's unique 300-location infrastructure, offering valuable insights for engineers working on large-scale data systems. Their work demonstrates how thoughtful architectural choices and infrastructure constraints drive innovation in distributed database systems.

Quotes

"It was my crash course into OS engineering. We encouraged every possible bug in this project. It was very painful and very hard." - Nikita Lapkov

"Collecting a stack trace is very hidden, especially if you're not writing in C or C++. It is actually a very complicated and involved process." - Nikita Lapkov

"What excites me is that it has free egress. Usually, you would pay per gigabyte to load your data. You don't have that with R2." - Nikita Lapkov

"What we explicitly wanted to avoid when building R2 SQL is building an analytical query engine again. We would much rather use something off the shelf and work on the interesting distributed parts." - Nikita Lapkov

"No matter how complex the query is, you can make a case that, with extreme cases, the throughput for a single load operation is relatively constant, no matter how complex the query is." - Nikita Lapkov

"We try to be as stateless as possible. All our state lives in the catalog itself, so we only need what's in the catalog and the query that comes from the request." - Nikita Lapkov

"The shuffles cannot really be reused unless you do some very fancy heuristics. Once we have picked the workers for a particular query, we can think of them as our little cluster." - Nikita Lapkov

"Joins consume your entire roadmap, and this is pretty much what will be happening with us at some point. We need to make sure that distributed joins work really well, no matter what your data distribution is like." - Nikita Lapkov

"We have potentially minutes to spare, and optimizing some even subparts of the query is worthy investigation because it could shave hours or something like that." - Nikita Lapkov

"Finding the safe points for replanning and doing this distributed coordination while we have 50 different workers working on different parts of the query is definitely the area we want to look at in the coming year." - Nikita Lapkov

Resources

Connect on LinkedIn:

Nikita Lapkov - https://www.linkedin.com/in/nikitalapkov
Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:

Firebolt – firebolt.io
Cloudflare – cloudflare.com
Apache Arrow DataFusion – datafusion.apache.org

Tools & Platforms:

R2 SQL – Cloudflare's R2-based query engine for analytical queries
Apache Arrow DataFusion – Analytical query engine used for single-node number crunching
Arroyo – Rust-based streaming solution built on DataFusion
R2 – S3-compatible object storage with free egress
Apache Iceberg – Catalog system for state management

How Zipline AI Turns Weeks of Engineering Into Minutes of SQL Queries ft. Nikhil Simha

The Firebolt Data Bros — Tue, 10 Mar 2026 11:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin sits down with Nikhil Simha, CTO of Zipline AI and co-author of Chronon, to explore how a declarative feature platform solves the speed-vs-scale paradox in modern ML infrastructure, from fraud detection at Airbnb to powering OpenAI's recommendation systems.

What You'll Learn:

How to eliminate the data scientist-to-ML engineer bottleneck by generating Spark, Flink, and orchestration pipelines automatically from simple SQL queries, enabling data scientists to ship features independently without waiting for engineering resources

Why fraud detection demands real-time feature iteration: The adversarial nature of fraud requires companies to build and deploy new detection models in days, not months- a timeline impossible with manual pipeline engineering

The "precompute everything" optimization principle for serving latency: Chronon minimizes query response time by batching feature computation upstream through stream and batch processing, then delivering pre-aggregated signals to models in milliseconds

How to safely ship feature versions in production using dual-write strategies that keep old and new feature versions running simultaneously, enabling A/B testing and instant rollbacks without service disruption

Why context engineering, not just RAG, powers modern LLM applications: ML model predictions (fraud risk scores, user signals, embeddings) feed directly into LLM prompts as structured context, improving decision quality for both human and AI agents

The critical gap in open-source data infrastructure: Modern systems need query engines that scale seamlessly from single-machine to distributed clusters - today's choice between lightweight tools (DuckDB) and heavyweight platforms (Spark) leaves mid-scale and product-embedded analytics underserved

About the Guest(s)

Nikhil Simha is the CTO at Zipline AI, bringing extensive experience from leadership roles at Airbnb and Facebook. He is a co-author of Chronon, an open-source feature engineering platform that automates the generation of ML infrastructure from declarative queries. With deep expertise in real-time data systems, fraud detection, and feature engineering at scale, Nikhil has architected solutions powering recommendation systems and risk detection across billions of user interactions. In this episode, he shares insights on building scalable ML infrastructure, integrating LLMs with real-time feature contexts, and the evolving data engineering landscape. His work has directly impacted how organizations from early-stage startups to Fortune 500 companies approach feature engineering and real-time ML serving, making this conversation essential for engineers building production AI systems.

Quotes

"Fraud is adversarial. Right? Like, someone comes up with a new way to do fraud somewhere around the world, and people at Airbnb need to react to it very quickly." - Nikhil

"Chronon, at its core, generates these systems from queries. So users write queries on Chronon, and we generate all of these under the hood." - Nikhil

"Chronon allows data scientists to operate independently." - Nikhil

"The main problem there was that the traditional model of data scientists writing some logic and ML engineers going and billing system out for that logic, that was too slow for fraud detection." - Nikhil

"They have to come up with a new model in a matter of days. They don't have, like, this three to five month period where they can sit and create the new model, build all of these pipelines." - Nikhil

"There is a real gap in the industry for an engine that goes all the way from single machine scale to thousands of machine scale seamlessly." - Nikhil

"Most people, for ninety-five percent of their queries, don't need Spark in RPA. Right? But there is that 5% usually, like, a lot of ML falls into that." - Nikhil

"We are handling query fragments. Right? We take query fragments, generate very specialized logic for that, and run that through Spark's distributed processing topologies." - Nikhil

"The new trend in the industry would be, like, towards these engines that can work at any scale and be useful for interactive and large processing workloads." - Nikhil

"I think Iceberg is great that way because you're not fragmenting to different proprietary data formats, different proprietary engines." - Nikhil

Resources

Connect on LinkedIn:

Nikhil Simha - https://www.linkedin.com/in/nikhilsimha
Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:

Zipline AI – zipline.ai
Firebolt – firebolt.io

Tools & Platforms:

Chronon – Feature engineering and real-time ML infrastructure platform for generating data pipelines from queries
Apache Spark – Distributed data processing engine for batch and large-scale processing workloads
Apache Flink – Stream processing engine for real-time data transformations
Redis – In-memory key-value store for feature serving
Apache Iceberg – Open table format for data lake storage
Airflow – Workflow orchestration platform for pipeline scheduling
DuckDB – Open-source analytical database for single-machine to moderate-scale processing
BigQuery – Google Cloud data warehouse
Snowflake – Cloud-based data warehouse platform
Kubernetes – Container orchestration platform

The Geo-Data Problem Nobody Talks About And How Voi Solved It ft. Magnus Dahlbäck

The Firebolt Data Bros — Thu, 19 Feb 2026 11:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin sits down with Magnus Dahlbäck, Senior Director of Data and Platform at Voi, to explore how a rapidly scaling European e-scooter company transformed its data infrastructure, adopted a metrics-first approach to analytics, and is now leveraging AI to solve real-time operational challenges across 150 cities and 150,000 vehicles.

What You'll Learn:

How to escape the "dashboard chaos" trap by adopting a metrics-first architecture with a semantic layer, reducing confusion from hundreds of conflicting dashboards to a single source of truth across the organization

Why replacing Tableau with Steep (a metrics-centric BI tool) unlocked self-service analytics for non-technical users, empowering teams to answer their own data questions without waiting months for custom dashboard builds

The real-world cost optimization challenge of managing Snowflake expenses that scale 1:1 with ride volume—and why data leaders must constantly rethink architecture to control FinOps in high-growth environments

How to architect for IoT at scale: processing billions of daily events from connected vehicles using micro-batch pipelines (5-minute intervals) while keeping real-time machine learning inference separate through cross-functional product teams

The decision framework for choosing traditional ML vs. LLMs: use traditional methods for accuracy-critical workloads (supply-demand forecasting for vehicle positioning) and LLMs for pattern discovery where 100% precision isn't required (analyzing rider feedback)

How to build proactive customer support powered by data and AI: leverage sensor data and ride telemetry to detect poor user experiences and reach out before customers complain, rather than waiting for refund requests

About the Guest(s)

Magnus Dahlbäck is Senior Director of Data and Platform at Voi, a leading European micro-mobility company, where he oversees the data analytics team, platform infrastructure, and AI initiatives. With over four years at Voi, Magnus has scaled the data organization from three people to a comprehensive team of platform engineers, data analysts, and data scientists while architecting a modern data stack centered on metrics-first analytics and semantic layers. In this episode, Magnus shares insights on building scalable data platforms for IoT-heavy, real-world products, including strategies for managing billions of daily events, implementing self-service analytics, and balancing traditional machine learning with large language models. His work at Voi—where the data platform powers both internal analytics and customer-facing product features—demonstrates how thoughtful data architecture drives measurable business impact, making this conversation essential for data leaders navigating AI integration and data democratization.

Quotes

"There are hundreds of dashboards, and I'm looking for some data, some metrics, and there are 10 dashboards that contain that, and they all show different numbers." - Magnus

"Metrics is a very natural way of interacting with data rather than dashboards that are named something randomly." - Magnus

"We're basically throwing man hours on slicing and dicing data, trying to find patterns, anomalies that we often miss, right, because it just takes too much time." - Magnus

"The way we work with data hasn't really changed that much in the last ten, twenty years to be completely fair, but now we're seeing new technologies, new approaches to it." - Magnus

"It comes down to the use case. What's the accuracy we need?" - Magnus

"We can see from the sensor data, from the IoT, from other data points during your ride if it was a good or bad experience, so why don't we reach out to you?" - Magnus

"Building software around physical objects is really cool when you're a techie guy like me, working at a company where it's a combination of software, B to C, hardware, IoT." - Magnus

"The biggest dataset that we process is IoT data—billions of events every day, basically, that we process." - Magnus

"We have cross functional teams where all the product teams have everything from back end to front end to data people, designers, and so on." - Magnus

"Metrics is kind of the business language that we use—we talk about rides, average ride charge, active vehicles—so metrics is a very natural way of interacting with data." - Magnus

Resources

Connect on LinkedIn:

Magnus Dahlbäck - https://www.linkedin.com/in/magnusdahlback/
Benjamin Wagner - https://www.linkedin.com/in/wagjamin/

Websites:

Guest's Company: Voi Technologies Website (voi.com)
Host's Company: Firebolt Website (firebolt.io)

Tools & Platforms:

Snowflake – Data warehouse for analytics and machine learning workloads
DBT (Data Build Tool) – Data transformation and modeling
Apache Airflow – Workflow orchestration
Steep – Metrics-first BI tool with semantic layer (Swedish startup)
GCP Vertex AI – Machine learning platform for model training and deployment

Why 99% of Data Teams Give Up on Real-Time And How Artie Changes That

The Firebolt Data Bros — Tue, 03 Feb 2026 02:46:00 +0000

In this episode of The Data Engineering Show, Benjamin sits down with Artie CTO and co-founder Robin Tang, to explore the complexities of high-performance data movement. Robin shares his journey from building Maxwell at Zendesk to scaling data systems at Open Door, highlighting the gap between business-oriented SaaS connectors and the rigorous demands of production database replication.

Robin dives deep into Artie’s architecture, explaining how they leverage a split-plane model (Control Plane and Data Plane) to provide a "Bring Your Own Cloud" (BYOC) experience that engineering teams actually trust. You’ll hear about the technical nuances of CDC, from handling Postgres TOAST columns to the "economy of scale" challenges of processing billions of rows for Substack, Artie’s first customer. Whether you're struggling with real-time ingestion costs or curious about the future of platform-agnostic partitioning, this conversation provides a masterclass in modern data movement.

What You'll Learn:

Why the data movement market is bifurcating: Managed vendors like Fivetran excel at SaaS integrations (hundreds of connectors), while specialized vendors like Artie focus on production databases at high volume - a fundamentally different job to be done requiring expertise in failure recovery, observability, and advanced use cases.
How to design CDC architecture that doesn't break production databases: Use online backfill strategies (DB log framework) instead of long-running transactions that hold write locks; implement table-level parallelism so a single table error doesn't halt the entire pipeline.
The split-plane architecture pattern for flexible deployment models: Build control plane and data plane separation from day one, allowing customers to choose between fully managed cloud deployments or bring-your-own-cloud (BYOC) without compromising UX or architecture.
Why database-specific expertise matters more than breadth: SQL Server CDC requires reverse engineering undocumented code; Postgres has TOAST columns; MongoDB allows invalid timestamp values - each data source has hidden complexity that justifies deep specialization over connector sprawl.
How to build trust with early-stage customers on mission-critical workloads: Walk prospects through architecture and failure modes before implementation; encourage them to stress-test with real data volumes; establish deep engineering partnerships where both teams debug problems together (not sales-driven relationships).
The platform-specific optimization trap and how to solve it: Instead of requiring customers to understand nuances of BigQuery time partitioning vs. Snowflake's lack thereof, build platform-agnostic features (like soft partitioning) that work consistently across destinations while handling platform-specific optimizations under the hood.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Robin is the CTO and cofounder of Artie, a data movement platform built for high-volume, low-latency production database replication. With over a decade of experience building large-scale data systems, including early work on Maxwell (an open-source CDC framework at Zendesk) and database architecture at venture-backed startups, Robin identified a critical gap: existing tools optimize for SaaS integrations, not production databases at scale. In this episode, Robin shares hard-won lessons from building mission-critical infrastructure, including architectural innovations that prevent data loss and failure modes that only surface under real-world production load. His work at Artie has powered reliable data replication for companies like Substack, making this conversation essential for engineering teams building or evaluating real-time data movement solutions.

Quotes

“Artie helps companies make data streaming accessible." - Robin

"I didn't want to make any sort of compromises and it just turned out to be a really hard problem, so then we started a company around this." - Robin

"The complexity is not just at the destination level, the complexity is also at the source level." - Robin

"Every pipeline that we touch is mission critical for customers, or else they would just use either their existing pipeline or a managed vendor that's out there." - Robin

"We handle the whole thing, whereas other vendors more or less provide a component and expect engineers to either build or attach additional pieces." - Robin

"I think the biggest bottleneck for real time right now is accessibility. When people think about real time, they immediately think it's not worth it because they implicitly have a cost associated with it." - Robin

"We use Kafka transactions, so we do not commit offsets until the destination tells us the data has actually been flushed." - Robin

"There's so much nuance with every single data source that it becomes a whack-a-mole problem." - Robin

"When there's sufficient pain on the other side and they buy into your vision, it's easier to overcome obstacles during technical implementation." - Robin

"We're spending more time developing platform-agnostic solutions so customers don't have to understand platform nuances." - Robin

Resources

Connect on LinkedIn:

Robin Tang - https://www.linkedin.com/in/tang8330/
Benjamin Wagner - https://www.linkedin.com/in/wagjamin/

Websites:

Tools & Platforms:

Maxwell – Open source CDC framework for MySQL to read binlog into Kafka
Kafka – Distributed event streaming platform for data movement
WarpStream – Cost-optimized Kafka alternative using object storage
Streamsy – Kubernetes-native Kafka deployment tool
Apache Iceberg – Open table format for data lakehouse architecture
Delta Live Tables – Databricks' data movement and transformation tool
ClickPipes – ClickHouse's native data ingestion platform
Snowpipe Streaming – Snowflake's real-time data ingestion service
Google Datastream – Google Cloud's CDC and data movement service
AWS MSK Tiered Storage – Amazon managed Kafka with tiered storage capabilities

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

The Firebolt Data Bros — Tue, 16 Dec 2025 11:00:00 +0000

In this episode of the Data Engineering Show, host Benjamin Wagner sits down with Ritesh Varyani, Staff Software Engineer at Lyft, to explore how the company manages a sophisticated multi-engine data stack serving thousands of engineers, while simultaneously integrating AI across infrastructure and user-facing analytics.

What You'll Learn:

How to architect a polyglot data platform that serves fundamentally different workloads, Spark for ML training and massive parallel processing, Trino for dashboarding and medium-scale ETL, and ClickHouse for sub-second OLAP queries without creating operational chaos
Why unification matters more than expansion: Lyft's 2026 strategy prioritizes consolidating and simplifying the data stack rather than adding new tools, reducing maintenance burden and improving reliability for end users
The dual-layer AI strategy that simultaneously enhances user analytics (semantic layer v2 with AI-native support) while automating platform operations (intelligent job failure diagnosis, adaptive resource allocation, and agentic workflow optimization)
How to fund innovation from the bottom-up: Lyft's model encourages individual engineers to experiment with AI on their own time, prove business value through POCs, and secure leadership buy-in through demonstrated alignment with company strategy
Why vendor selection now includes AI explainability and debuggability as standard RFP requirements, even when AI isn't the primary driver of a purchasing decision
The framework for deciding open-source investment vs. managed services: Prioritize business-critical goals first, then determine whether in-house ownership or vendor solutions accelerate that mission, AI becomes the accelerant, not the decision driver

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ritesh is a Staff Software Engineer at Lyft, bringing six years of experience architecting and scaling the company's data platform. With a background spanning Microsoft's data and cloud infrastructure, including work on Hadoop, Azure, and SaaS products. Ritesh leads Lyft's critical data systems including Trino, Spark, and ClickHouse. In this episode, Ritesh shares insights on building scalable, AI-native data platforms that serve diverse organizational needs, from batch processing and analytics to real-time marketplace operations. His strategic approach to unifying complex data stacks while integrating AI-driven reliability and user experience improvements provides actionable guidance for data engineers and platform leaders navigating infrastructure modernization at scale.

Quotes

"The goal of our platform is to give our users access to the data as fast as possible so that they can drive the meaning from the data that they are getting and take better data driven decisions." - Ritesh

"We are a Hive format shop. We are going to be moving to other open table formats in the future, but at this point, we are a hive table format." - Ritesh

"Our main goal at this point is primarily understanding how we see the data platform running five years from now, three years from now, and how we are able to future proof it." - Ritesh

"In this world of AI, we should not be falling behind in any way, and bringing AI in the right places within our platform." - Ritesh

"We want to make our semantic layer ready for the AI native side of things so that our teams are able to drive the best meaning possible from the data that they see." - Ritesh

"Big data systems are distributed systems by nature, and where AI can help you is very clearly understand how the patterns are changing and what is a good action to take." - Ritesh

"Rather than thinking of this as an AI versus an open source thing, it's about a question of what work is the most business critical and how do you go 100% behind it." - Ritesh

"Not everybody is working on AI initiatives at this point, but where it makes sense according to our business strategy, if it aligns with it, then obviously we go and invest." - Ritesh

"If you are the one who's going to take on the initiative, probably spend a few hours outside of what you're already working on, and that is how you will discover AI and the tooling for it." - Ritesh

"We are trying to consolidate into a single direction of providing different kinds of models so that you are easily able to integrate and focus on the value you want to provide to your customers." - Ritesh

Resources

Connect on LinkedIn:

Ritesh Varyani - https://www.linkedin.com/in/riteshvaryani/
Benjamin Wagner - https://www.linkedin.com/in/wagjamin/
Eldad Farkash - https://www.linkedin.com/in/eldadfarkash/

Websites:

Lyft - https://www.lyft.com

Tools & Platforms:

Apache Spark – Batch processing engine for ML training jobs, large-scale data processing, and GDPR operations
Trino – Query engine for BI dashboarding, ETL workflows, and SQL-based data access
ClickHouse – Columnar database for sub-second query latency and real-time analytics
Amazon S3 – Data lake storage for parquet tables and offline data processing
AWS EKS (Elastic Kubernetes Service) – Kubernetes infrastructure for hosting Spark and Trino
ClickHouse Cloud – Managed ClickHouse offering used by Lyft
Hive Table Format – Current table format for organizing parquet files in S3
Kubernetes Operators – Infrastructure for managing ClickHouse deployments

60 Billion Predictions Daily: Inside Credit Karma’s Agentic Data Layer with Maddie Daianu

The Firebolt Data Bros — Wed, 19 Nov 2025 11:00:00 +0000

What does MLOps look like when you are deploying 60 billion machine learning predictions a day?

Maddie Daianu, Head of Data and AI at Intuit Credit Karma, joins the Data Bros to pull back the curtain on one of the most high-volume data environments in FinTech. With a 100-person team serving 140 million members, standard data practices break down.

Maddie shares how her team manages terabytes of daily data on Google Cloud and explains the massive strategic pivot they are undertaking right now: The move from "Information" to "Agency."

What You'll Learn:

Extreme Scale: How to architect a system that handles 80 billion daily predictions without latency.
The Unified Consumer Profile: The hackathon project that unlocked real-time personalization across Credit Karma and TurboTax.
The "Done-For-You" Future: Why they are building an "Agentic Data Layer" to move from recommending financial products to actively managing them for the user.

If you want to know what the future of high-scale AI infrastructure looks like, this is the blueprint.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Maddie Daianu is the Head of Data and AI at Intuit Credit Karma, where she leads the teams responsible for AI science, machine learning engineering, data engineering, and the experimentation platform. She brings a background that spans academic research in biomedical engineering and machine learning, and experience at both smaller companies and Meta. Her current focus is on building the data and AI infrastructure that drives highly personalized financial experiences for Credit Karma's 140 million members and contributes to Intuit's broader consumer ecosystem.

Quotes

"The key elements and ingredients of making this app successful is data and AI." - Maddie

"We have and we process and transform multiple terabytes of information daily for our 140,000,000 members every single day." - Maddie

"We have our models that essentially, lead to almost 60,000,000,000 daily predictions for our 140,000,000 member base every single day." - Maddie

"We want to take this to the next level. So Intuit as a whole believes... in creating done for you experiences for our users." - Maddie

"If you don't structure your data in a semantically, well structured way, you are not likely able to provide the most highly relevant and personalized experiences for users." - Maddie

"One thing that we've been building, in the last year or so it's called the unified consumer profile." - Maddie

"Intuit has been investing in tremendously over the last, few years... the generative AI operating system... to move fast and continuously disrupt ourselves, especially in the age of AI." - Maddie

Resources

Connect on LinkedIn:

Websites:

Credit Karma

Tools & Platforms:

BigQuery – Data warehouse for processing multiple terabytes of information daily
Bigtable – Operational serving layer for real-time data access
Vertex AI – Machine learning platform for model training and deployment
Alchemy – Feature online feature store for real-time transformations and aggregations
Generative AI Operating System – Centralized platform for democratizing Gen AI adoption across Intuit products

Products & Services Mentioned:

TurboTax – Tax preparation and filing software
Debt Agent – AI-powered tool for debt consolidation and management assistance
Unified Consumer Profile – Semantic graph depicting financial journey across Credit Karma and TurboTax

Block Bad Data Before the Write with Nike’s Ashok Singamaneni

The Firebolt Data Bros — Tue, 07 Oct 2025 11:00:00 +0000

In this episode of The Data Engineering Show, Benjamin and Eldad are joined by Ashok Singamaneni, a Principal Data Engineer at Nike. Ashok dives deep into his work on the open-source projects BrickFlow and Spark Expectations. He shares his journey from mechanical engineering to data engineering and the lessons learned over a decade of tackling production data quality issues that lead to costly recomputes.

Ashok explains the philosophy behind Spark Expectations: treating the ingestion and transformation layers of a data pipeline (Bronze/Silver) as a software product rather than just a data engineering product. This means implementing rigorous checks like data quality, unit testing, and integration testing before the data is written to the final layer. He details the implementation using a Python decorator pattern within Spark jobs, allowing engineers to define rules that check for everything from basic column validation to complex referential integrity and aggregation consistency. The discussion also covers the trade-offs of using generative AI tools like Cursor for data engineering and the growing industry trend of prioritizing upfront data quality due to the rise of AI-powered analytics and direct leadership access to data.

What You'll Learn:

Why the ingestion and transformation layers (Bronze/Silver) of a data pipeline should be treated as a software product with rigorous testing.
How Spark Expectations moves data quality checks to before data is written to the final tables to prevent mission-critical failures and recomputes.
The three types of checks in Spark Expectations: row-level, aggregation-level, and query DQ (for referential integrity).
How the tool handles failures with options to ignore, drop the record, or fail the entire job.
Why big data quality is becoming a prime focus across the industry due to AI integrations and direct executive-level access to data.
Ashok’s lessons on using Generative AI tools (like Cursor/Cloud Code) in data engineering projects and the necessity of restrictive permissions.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ashok Singamaneni is a Principal Data Engineer at Nike, with over twelve years of experience in the data space across the banking, healthcare, and retail domains. He is the creator of the popular open-source frameworks Spark Expectations and BrickFlow, which focus on improving data quality and pipeline reliability. Ashok advocates for treating data ingestion and transformation as a software product, ensuring checks and balances are in place early in the pipeline. He holds a background in mechanical engineering.

Quotes

"DLT expectations gave an idea to the industry that you can do data quality before actually writing the data into your final tables." - Ashok

"I think over the time, in my experience, what I learned is this ingestion layer and the transformation layer, you should treat that as a software product, not like a data engineering product." - Ashok

"If it's mission critical, then you fail the job, not process the data, and don't put that data into the final table so that you don't need to recompute that again." - Ashok

"As the scale of the product increases, it becomes even more difficult for us to find exactly where the issue went wrong... it takes time for you to debug and see, like, lot of human effort also involved." - Ashok

"Data observability and quality is becoming prime because of AI integrations that are happening." - Ashok

"Ultimately, at the end of the day, you are responsible when you're checking in the code. It's not Claude or Karsar that will be blamed if something goes wrong." - Ashok

"The leadership is directly looking at the data and if there is something wrong in the data, then there can be some serious repercussions happening on the business decisions." - Ashok

"Rather than having bad data in the tables and then recomputing or reclarifying things, let's not put that data first in the first place." - Ashok

"You can drop the record and put that in an error table and give that alert to the engineering team that there is some error in the error table you can look at." - Ashok

"The road eq checks that happens are very fast. It should happen as a pretty standard checks that happens on the scale." - Ashok

Resources

Projects:

Spark Expectations - Data quality framework
BrickFlow - Open source project for data pipelines

Tools & Technologies:

Apache Spark
Databricks DLT (Delta Live Tables)
Great Expectations - Post-processing data quality tool
Cursor / Cloud Code - Generative AI coding tools
SQLMesh

For Feedback & Discussions on Firebolt Core:

Primary Speakers:

Postgres vs. Elasticsearch: The Unexpected Winner in High-Stakes Search for Instacart with Ankit Mittal

The Firebolt Data Bros — Wed, 17 Sep 2025 16:32:00 +0000

In this episode of The Data Engineering Show, Benjamin Wagner sits down with Ankit Mittal, former Senior Engineer at Instacart, to explore how they revolutionized their search infrastructure by transitioning from Elasticsearch to PostgreSQL. Learn how Instacart tackled the unique challenges of fast-moving grocery inventory, achieved high-performance search capabilities, and leveraged PostgreSQL extensions for complex retrieval operations. Whether you're scaling search functionality or optimizing database performance, this deep dive offers valuable insights into building robust, production-ready search systems using PostgreSQL.

Discover why Instacart moved from Elasticsearch to PostgreSQL for retailer search
Learn about handling real-time inventory updates and search optimization
Explore PostgreSQL extensions, sharding strategies, and data flow architecture
Understand the trade-offs between different search infrastructure approaches

What You'll Learn:

How Instacart managed fast-moving grocery inventory data by consolidating search, ranking, and filtering into a single PostgreSQL cluster
Why pushing compute closer to the data layer can significantly improve search performance and reduce network calls
The architecture decisions behind using PostgreSQL extensions like PG Vector and custom solutions for search functionality
How to implement efficient data ingestion through S3-based pipelines and bulk writes instead of real-time updates
Why table maintenance operations like PGD pack are crucial for optimizing read throughput in production environments
The trade-offs between traditional search engines and relational databases for complex search implementations
The challenges of maintaining self-hosted PostgreSQL in a predominantly cloud-managed environment

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Ankit is a Software Engineer at ParadeDB and former Senior Engineer at Instacart, where he specialized in PostgreSQL infrastructure and search systems. With extensive experience in database optimization and search architecture, he played a key role in modernizing Instacart's search infrastructure by transitioning from Elasticsearch to a custom PostgreSQL solution. In this episode, Ankit shares deep insights into building and scaling high-performance search systems for e-commerce, particularly focusing on the unique challenges of grocery retail's fast-moving inventory. His work at Instacart revolutionized their single-retailer search functionality, demonstrating how traditional relational databases can be adapted for complex search operations. His expertise in database systems and their practical applications in high-scale environments makes this conversation particularly valuable for engineers interested in modern search architecture and database optimization.

Quotes

"Think about it. If there's a lot of things that you can get the database to do, then the applications become simpler." - Ankit

"My non-Instacart experience has largely been in pre-PMF startups where the approach of abuse your database to its absolute limits works wonders." - Ankit

"Almost everything that we got retrieved had to be filtered out. So we go back to Elasticsearch again." - Ankit

"We traded off the quality of retrieval, hardcore core retrieval, with the whole system reducing the network calls." - Ankit

"It's a place to go to find what item is available, in what store, what item is available, at what price, including full product taxonomy graph and product and ontology." - Ankit

"The grand theme here is that we wanted more control over the cluster, how to spin it off, what kind of disks it would have." - Ankit

"We tell teams who want to have their data in this cluster, create an s3 home, create either a bucket or a home, whatever they want to do, and tell us that we would sync ourselves." - Ankit

"What we found is that the read throughput, we can throw more data if the tables are repacked nicely." - Ankit

"Most engineers who want to work on search, they are more used to the Elasticsearch shape of the query." - Ankit

"The relevance is better because they could join more things in the database. They also saw the cost of the normalized data reduced." - Ankit

Resources

Company Websites:

- Instacart - Grocery delivery platform

- ParadeDB - Database technology company

- Firebolt - Cloud data warehouse (firebolt.io)

Tools & Technologies:

- PostgreSQL - Database system

- Elasticsearch - Search engine

- PG Cat/PG Dog - PostgreSQL proxy tools

- PG Vector - PostgreSQL vector extension

- PG Repack - PostgreSQL table repacking tool

- ClickHouse - Column-oriented DBMS

- TantiVy - Rust-based search engine library

Articles:

- Instacart Search Modernization Blog Posts (Series on hybrid retrieval)

- Target's AlloyDB Migration Blog Post

For Feedback & Discussions on Firebolt Core:

Primary Speakers:

Is Self-Service BI a False Promise? Lei Tang of Fabi.ai Thinks So

The Firebolt Data Bros — Thu, 28 Aug 2025 11:00:00 +0000

Explore the future of AI-powered business intelligence with Lei Tang, CTO and Co-founder of Fabi.ai, as he discusses the evolution from traditional self-service BI to "Vibe-analytics." Learn how AI is transforming data accessibility, enabling anyone to perform sophisticated analytics without deep technical expertise. From building trust in AI-generated insights to creating intelligent semantic layers, discover how modern BI platforms are bridging the gap between data teams and business stakeholders. Tune in to understand why static dashboards are becoming obsolete and how AI agents will soon proactively surface business opportunities and insights.

Key points:

The limitations of traditional self-service BI and how AI is addressing them
Building secure, context-aware AI systems for data analysis
The future of human-AI interaction in business intelligence
Technical insights into modern BI platform architecture
Vision for proactive, AI-driven business insights

What You'll Learn:

Why traditional self-service BI has failed to deliver on its promises and how AI can bridge the gap
How to build an AI-native BI platform that combines SQL, Python, and natural language processing
The framework for implementing "Vibe-analytics" - a new paradigm of AI-powered visual analytics
Why context engineering and semantic understanding are crucial for accurate AI-driven analysis
How to balance security and accessibility when deploying AI-powered analytics tools
The future of BI platforms as proactive insight generators rather than passive dashboards
Why caching and stateful environments are essential for responsive AI-powered analytics
How to leverage AI to translate business questions into accurate technical queries while maintaining data integrity

About the Guest(s)

Lei is the Co-founder and CTO of Fabi.ai, where he leads the development of AI-native business intelligence solutions. With a PhD in machine learning and over a decade of experience in the data domain, Lei has held significant roles, including positions at Yahoo, Walmart, Lyft (as Director of Data Science), and Clari (as Chief Data Scientist). His expertise spans machine learning, data engineering, and business analytics, with a particular focus on making data analysis more accessible and efficient. In this episode, Lei shares insights on the evolution of self-service BI and how AI is transforming business intelligence, drawing from his experience building Fabi.ai, a platform that combines SQL, Python, and AI to democratize data analysis. His work in developing "Vibe AI" (AI-powered BI) represents a significant advancement in making complex data analysis accessible to non-technical users while maintaining data accuracy and trust.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Quotes

"For the past decade, it's really difficult to make sure the self-service BI can work. And then now with AI, the worst part is that it can run properly, but the numbers are wrong." - Lei

"If you talk to anybody working in the BI space, like self-service BI, that has been termed for maybe for the past decade. But I have to say that is a false promise." - Lei

"We're saying that we really want those data team to be able to, like, say, what type of data is exposed to, like, say, less technical folks." - Lei

"In order to build AI native BI, I would say the focus should be how human interact with AI." - Lei

"We believe that, essentially, this BI system or, like, AI BI system would be more like a agent, and then it'll actually looking for, like, business opportunities and insight and surface to you." - Lei

"The one common theme I have been experiencing is that normally would work with other business stakeholders, could be marketing, could be operations, could be sales." - Lei

"We strongly believe that BI should be stored as code." - Lei

"Enterprise data tends to be very noisy, very complex." - Lei

"The semantics of itself becomes part of the context for the AI engine." - Lei

"Most organizations, the data, like the schema, the kind of business, like metrics and logic, has been constantly evolving." - Lei

Resources

Fabi.ai - AI-native BI platform
Firebolt (firebolt.io) - Cloud data warehouse platform

Tools & Technologies:

Firebolt Core - Free self-hosted query engine
Looker - BI Platform
Tableau - BI Platform
Sisense - BI Platform
Snowflake - Data Warehouse
BigQuery - Data Warehouse
PostgreSQL - Database
SQL Alchemy - Database toolkit
Pandas - Data analysis library

For Feedback & Discussions on Firebolt Core:

Primary Speakers:

Building Uber's AI Assistant: How Genie Revolutionizes On-Call Support with Paarth Chothani from Uber

The Firebolt Data Bros — Tue, 22 Jul 2025 12:00:00 +0000

Journey inside Uber's innovative AI assistant "Genie" with Paarth Chotani, Staff Engineer at Uber, as he shares how they're revolutionizing on-call support using LLMs and vector search. From processing massive amounts of internal documentation to building scalable RAG pipelines, discover how Uber tackles the challenges of implementing AI assistants at scale. Get insights into the evolution from traditional chatbots to agent-based solutions, and learn practical lessons about staying current in the rapidly evolving AI landscape. Whether you're building AI-powered tools or scaling data infrastructure, this episode offers valuable perspectives on balancing innovation with real-world implementation.

• Building and scaling RAG pipelines at enterprise scale

• Evolution from traditional chatbots to AI agents

• Practical insights on data processing and vector search implementation

• Leveraging open-source technologies in production environments

• Navigating rapid technological changes in AI development

What You'll Learn:

How Uber transformed its on-call support system by building an AI assistant that searches across internal documentation, wikis, and code
Why combining multiple data sources with vector databases creates more accurate and contextual responses for enterprise support
The evolution from basic RAG implementation to agent-based architecture for handling complex support scenarios
How to scale AI processing pipelines using Apache Spark for large-scale data chunking and embedding generation
Why customization and internal data sources are crucial for enterprise AI assistant effectiveness
The future of AI assistants: moving from documentation lookup to automated problem resolution through multi-agent systems
How to balance rapid AI innovation with setting realistic customer expectations in fast-moving tech environments

Paarth is a Staff Engineer at Uber, where he works on Michelangelo, Uber's machine learning platform. With over four years at Uber, he specializes in feature store development, online serving at scale, and GenAI implementations. He has been instrumental in developing Genie, an AI-powered on-call assistant that revolutionizes how Uber's engineering teams handle support requests and documentation access. In this episode, Paarth shares valuable insights on building and scaling RAG-based systems, vector search implementations, and the evolution of AI assistants from traditional chatbots to sophisticated agent-based solutions. His experience spanning both AWS chatbot development and current GenAI innovations at Uber offers listeners a unique perspective on the rapid advancement of AI-powered enterprise solutions.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Quotes

"Think of Genie as your on-call assistant. Different infra teams have their Slack channels, and because these technologies are widely used, you have to wait a lot." - Paarth

"What we realized is for our engineers to really get help, data sources really should be internal only because we customize lot of these open source engines for making it work at Uber scale." - Paarth

"Instead of building a mega scale pipeline that just ingest all data sources and then keeps a central data source solution, we instead are giving users the flexibility to ingest what data sources they want." - Paarth

"We had to scale our you can say the whole infrared layer to chunk data faster to be able to create embedding set scale." - Paarth

"It almost felt like they're doing what EMR was doing. You have your Hadoop and big data technology, and we needed these pipelines to basically process all this data quickly." - Paarth

"We've even evolved from just giving you the right documentation to starting to evolve into a situation where we'll also start taking actions on your behalf." - Paarth

"That intuition that comes from building this kind of bot, I feel like that intuition came again as we were starting to see this technology come, and we're like, hey, this looks like where you can pretty much fit all these pieces together." - Paarth

"What we have seen with several use cases is agentic genie works well when designed well, when you've analyzed the problem of which type of subproblems the bot should resolve per channel, per use case." - Paarth

"I think having a problem in mind always helps that way, the energy is little bit focused and directed." - Paarth

"Whatever you're building is not enough because the expectation has already gone to the next level, so the pace is too fast right now." - Paarth

Resources

Companies & Platforms:
Uber - ML Platform & Engineering
Firebolt - Cloud Data Warehouse (firebolt.io)

Tools & Technologies:

Michelangelo - Uber's ML Platform
Genie - Uber's On-Call Assistant Bot
Cursor - Developer IDE
OpenSearch - Vector Database
LangGraph - Agent Framework

Notable Projects Mentioned:

MetaMate (Meta)
Query Copilot (Uber)
Scale at AI (Meta Meetup)

Company Blogs:

Uber Engineering Blog - Genie and Query Optimization articles

Primary Speakers:

Paarth Chotani - Staff Engineer, Uber
Benjamin - Firebolt
Eldad - Firebolt

For Feedback & Discussions on Firebolt Core:

From Zero to 100M Users: Inside Notion’s Data Stack and AI Strategy with Sumit Gupta

The Firebolt Data Bros — Tue, 10 Jun 2025 11:00:00 +0000

AI's transformative impact on data engineering and analytics is reshaping how professionals create value, shifting focus from technical skills to strategic thinking and communication.

In this episode of The Data Engineering Show, the bros talk with Sumit Gupta, Lead BI Engineer at Notion, about his journey through prominent tech companies, modern data stacks, and how AI is revolutionizing data workflows and professional development.

What You'll Learn:

How modern data stacks are evolving with tools like Snowflake, dbt, Iceberg, and Hex
Why transferable skills are becoming more crucial than technical expertise in the AI era
How to leverage AI tools strategically
The framework for automating content creation workflows using AI tools and APIs
Why this is "the worst AI will ever be" and how to prepare for accelerating change
How to balance AI automation with authentic human connection in content creation
Why modern data professionals must embrace AI while maintaining ethical considerations
How companies like Notion are implementing AI for improved customer insights and engagement

This episode offers valuable insights into the practical application of AI in data workflows, content creation, and professional development, while addressing both the opportunities and challenges in the evolving tech landscape.

Highlights:

[03:19] - Modern Data Stack Evolution for Scale

[10:05] - AI-Powered Customer Intelligence Platform

[15:17] - Future of Data Careers in the AI Era

[18:49] - Automated Content Creation Workflow

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest

Sumit Gupta is a Lead BI Engineer at Notion, where he spearheads reporting and dashboarding initiatives for marketing and sales teams. With over a decade of experience in data and analytics, including notable roles at industry leaders like Snowflake and Dropbox, he brings deep expertise in modern data stack implementation and AI integration. In this episode, Sumit shares valuable insights on the evolution of data engineering, the impact of AI on analytics workflows, and how he leverages various AI tools to enhance productivity both professionally and as a content creator with 21,000+ Instagram followers. His unique perspective on balancing technical expertise with transferable skills in the age of AI, combined with his experience at "Bay Area Darlings" like Notion, Snowflake, and Dropbox, makes this conversation particularly relevant for data professionals navigating the rapidly evolving tech landscape.

Quotes

"The scariest part about the whole AI boom is this is the worst AI will ever be." - Sumit

"If you are someone who's starting new in data field, the value of your technical skills that used to be very valuable until 2021 is not as much - your transferable skills or soft skills comes into picture." - Sumit

"Every bit is expensive - all the servers are cheap, but when you're dealing with hundred million users and trillions of rows of data a day, you have to find that one percent saving." - Sumit

"AI has made me a lot more productive, but at the same time, it has also made me dumber." - Sumit

"If you are someone new, especially in data, scared of AI or skeptic of AI, I would say jump in - if you don't jump onto the bandwagon right now, you might be left out in a year or so." - Sumit

Resources

For Feedback & Discussions on Firebolt Core:

How Rising Wave Is Redefining Real-Time Data with Postgres Power

The Firebolt Data Bros — Wed, 07 May 2025 11:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with Yingjun Wu, founder and CEO of Rising Wave, to explore the evolution of stream processing systems and the innovations his company is bringing to the space.

What you’ll learn:

Yingjun's journey from academic research in stream processing to founding Rising Wave, and the challenges of building trust in a new database system.
How Rising Wave's architecture, using S3 as primary storage, delivers second-level scalability, while other systems can take hours to scale.
The competitive landscape of stream processing, with Rising Wave's Postgres compatibility providing a significant advantage in ease of use.
How one major company reduced its CPU requirements from 20,000 to just 600 by switching from a traditional stream processing system to Rising Wave.
The rising importance of Apache Iceberg as a destination for stream processing output, helping companies avoid vendor lock-in.
How streaming systems fit into modern data stacks, especially as companies seek to avoid being locked into proprietary systems.

Yingjun Wu is the founder and CEO of Rising Wave, a stream processing system built in Rust and designed with a cloud-native architecture. With a PhD focused on stream processing and database systems, Yingjun previously worked at Redshift and IBM Research before founding Rising Wave. His company has developed a system that achieves significant performance and resource efficiency advantages over traditional stream processing solutions, while maintaining Postgres compatibility for ease of use.

Episode Highlights:

The Origins of Rising Wave (00:30)

Yingjun shares his background in stream processing from his PhD days and explains how his experience at Redshift revealed the need for better stream processing solutions, especially since many data warehouse workloads involve data ingested from streaming sources like Kinesis or Kafka.

Building a System from Scratch (04:10)

Yingjun describes the challenging first 2-3 years of developing Rising Wave without customers, highlighting how trust is a major barrier for new database systems. After 2.5 years, they secured their first customers, including a startup and several larger companies, which helped establish Rising Wave's credibility.

The Current Stream Processing Landscape (07:47)

Benjamin asks about the current stream processing space, with Yingjun positioning Rising Wave as a leader, particularly for SQL-based workloads. He highlights several key advantages of Rising Wave, including its Rust-based implementation and S3-based storage architecture.

S3 as Primary Storage (10:27)

Yingjun explains their decision to use S3 as primary storage from day one, despite its slowness and expense. He discusses how they've optimized for these challenges and would still make the same architectural choice today due to benefits like simplified state management and superior elastic scaling.

The Business Model (13:52)

Rising Wave offers open-source, cloud, and on-premise versions of its product. Yingjun notes that many highly regulated industries require on-premise deployment, including customers in the banking and aerospace sectors.

Typical Users and Competitive Advantages (15:01)

When asked about their typical users, Yingjun explains they directly compete with Flink but have advantages in ease of use due to Postgres compatibility. Their users are either new to stream processing or are migrating from systems like Spark Streaming or Flink due to performance issues or development complexity.

Apache Iceberg Integration (19:25)

Yingjun discusses how Apache Iceberg is emerging as an important destination for Rising Wave output, as companies seek to avoid vendor lock-in with proprietary data warehouses. He explains how Rising Wave typically performs ETL functions before data is sent to Iceberg tables.

The Future of Data Management (32:06)

The conversation concludes with a discussion about Iceberg becoming a "single source of truth" for data, with multiple specialized query engines potentially accessing the same data. Yingjun and Eldad share perspectives on how this shift away from proprietary data lock-in is changing the data ecosystem.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Episode Resources:

For Feedback & Discussions on Firebolt Core:

Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach

The Firebolt Data Bros — Tue, 08 Apr 2025 10:00:00 +0000

In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources.

What You’ll Learn:

How Apache Gravitino differs from others like Unity catalog and Polaris by being able to support multiple catalog systems.
What the “Push-Down Permission Management” security model is and how to implement it across different data systems.
How to maintain consistent governance across various query engines like Spark, Trino, and Flink.
Why interoperability, flexibility and open source ecosystem are becoming an important dynamics of data infrastructure rather than performance benchmarking.
How to evaluate new data tools based on their real-world adoption rather than the social media hype.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts instructions on how to do this here [insert link].

Lisa Cao is a Product Manager at DataStrato, specializing in AI/ML product partnerships and developer relations. With deep expertise in data catalog technologies and open-source ecosystems, she plays a key role in developing Apache Gravitino, an ASF incubating project that provides a unified governance and security layer for diverse data systems. Her work in developing extensible catalog frameworks has helped organizations manage complex data environments across multiple platforms.

Episode Highlights:

What is Apache Gravitino? (01:24)

Apache Gravitino is a meta-catalog that serves as a unified data governance and security layer used to manage different data systems. Lisa shares that Gravitino was the first to release an iceberg rest catalog and ended up open sourcing for the general community to use and as time passed, Polaris and Unity Catalog were also announced in open source. She highlights that although Gravitino, Polaris and Unity Catalog are very similar, Gravitino differs in that it is able to support multiple catalogs.

Unifying AI/ML and Big Data Stack (03:15)

One of the interesting things about Gravitino is that it offers more than just a catalog of data models and these model catalogs are the first step into looking at how to merge two worlds of AI and ML catalogs. Lisa shares the goal of effective management, that is, creating a system that can store and manage different types of data models, track changes to the models, and control access to the models.

Simplifying Data Governance (10:49)

Think of Gravitino as a “traffic cop” that helps to manage and secure data from multiple sources. It is crucial to have a system that provides unified access control across all data sources, allowing teams to manage access and data governance so that ML teams don't have to worry about access. Lisa says that Apache Gravitino is the system that makes data accessible to different teams and users while making sure that it is secure and governed appropriately.

The Gravitino’s Query Engine Solution (21:34)

Every query engine has its own way of managing data, which makes it difficult to switch between engines - you have to reconfigure everything. Lisa highlights that Gravitino solves the problem by providing a single layer of data governance that works across multiple query engines.

Navigating the Fast-Paced World of Data Engineering (24:41)

Lisa talks about how fast the data engineering space is moving and shares some insights to catching up;

Don’t try to learn everything at once.
Don't get too deep into every tool
Look for real-world adoption

She warns against the social media hype that can amplify the messaging around new tools, making it seem everyone is using it, when in reality, that can’t be easily seen.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Episode Resources:

Apache Gravitino website

For Feedback & Discussions on Firebolt Core:

Database Technology in the Age of AI with DuckDB Labs co-creator Hannes Mühleisen

The Firebolt Data Bros — Wed, 19 Mar 2025 11:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with CEO DuckDB Labs and co-creator DuckDB, Hannes Mühleisen.

Together, they:

Talk about the journey of DuckDB, an open-source analytical database system designed as a universal wrangling tool.
Explain how DuckDB differs from SQLite, highlighting the analytical and transactional use cases.
Discuss DuckDB’s special feature and its approach to innovation including creating their Parquet Reader.
Explore the simple and efficient ecosystem of DuckDB, allowing developers to add custom functionality without changing its core stability.
Consider Hannes' perspective on the role of AI in databases.
Delve into the system’s infrastructure, design choices and the dedication of the team to ensure a continuous, reliable database system.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts, instructions on how to do this are [insert link].

Hannes Mühleisen is the CEO of DuckDB Labs and a Professor in The Netherlands, renowned for co-creating DuckDB, an open-source analytical database system. With a background in database architecture and research from CWI database architectures group, he has pioneered the development of DuckDB as a universal data wrangling tool that can run everywhere from phones to space satellites. Under his leadership, DuckDB has achieved remarkable success, reaching 10 million downloads monthly and becoming a go-to solution for analytical database needs. His commitment to keeping DuckDB lightweight, portable, and hardware-agnostic while maintaining high performance has revolutionized how developers approach analytical database solutions. As both an academic and technology leader, Hannes brings unique insights into database architecture, open-source development, and the future of analytical data processing.

Episode Highlights:

The Purpose of DuckDB (01:04)

Hannes gives a full description of what DuckDB is as well as what it is designed to do. He describes the tool as one that understands SQL and is specifically designed to simplify complex analytical use cases.

SQLite vs DuckDB (02:53)

Hannes compares two different tools stating that SQLite is an amazing system that is not meant for analytical queries but for transactional use cases while DuckDB is specifically designed for that exact purpose - analytical use cases.

The Importance of Collaboration (08:14)

Hannes states the need for community collaboration as the database engine space seems to have hundreds of brilliant people trying to solve the same problems. He shares his profound admiration for a team in Munich, praising them for their exploits in implementing concepts only described in paper.

The Component-Based Architecture of DuckDB (11:25)

Hannes highlights a special feature in DuckDB, that is, it can be used as a component and he explains that the in-process architecture is a success because of the memory of data sharing that can be achieved.

The Parquet Reader Journey (17:51)

Hannes explains how he built his Parquet Reader out of necessity, although he would have preferred not to. He shares how a creator named Ove Korn from Germany donated the reader to a project named “The Arrow Project” and managed it to the degree that the entire project depended on the use of the Parquet Reader and it became an issue to use both independently. Hannes adds that a parquet reader that is competent has no choice but to become a database engine which is one of the interesting things about development.

The Role of AI in Database Interaction (22:41)

Hannes states that he doesn’t think that AI has a place in a database engine but rather, it is needed for optimization because the researchers who built their careers on optimization are out of jobs. He explains that the role of AI should be for assistance tasks and not for a total execution.

SQL - A Defined Interface (29:20)

Hannes introduces us to a tool that allows us to pro-programmatically build a query called relational API stating that it helps to simplify the tasks of a programmer. Although, Hannes agrees that using a well-defined interface is important for components like databases, he also argues that SQL can provide a relatively defined behavior within a single system.

The Golden Age of Database (38:57)

Hannes concludes the episode by appreciating Firebolt and other engineers for taking on core engine tasks. He shares his excitement for the golden age of databases where there is a showcasing of what is possible.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Quotes:

“DuckDB is a universal data wrangling tool. It is a relational data management system that speaks SQL designed to do well on analytical use cases.”

“We call ourselves the SQLite for analytics because it explains the original design goal of DuckDB very well.”

“Within the database engine space, we are all working to solve the same problems, and that's like, a hundred of us on the planet.”

“It actually turns out in order to make a competent parquet reader, you do need query execution. There is just no way around it.”

“I really like this golden age of databases we are in and personally, as somebody who really likes tables and SQL, I'm quite happy to see things like firebolt and others really working on core engine stuff.”

For Feedback & Discussions on Firebolt Core:

AI and Data Movement: Trends and Best Practices with Estuary’s Daniel Pálma

The Firebolt Data Bros — Tue, 11 Feb 2025 10:16:00 +0000

In this episode of The Data Engineering Show, the bros sit with Daniel Pálma, Head of Marketing at Estuary.

Join them as they:

Talk about Daniel’s career transition from data engineering to marketing and how his background in data engineering has been a tremendous help to his marketing competence.
Discuss the role of AI in the evolution of data movement ensuring a faster and easier process of creating data pipelines.
Shine light on the challenges of vector databases and structured data in AI applications.
Delve into the future of Apache Iceberg and data lakehouses, highlighting their current challenges.
Shares insights on the golden age of data expressing the need for more data engineers, data analysts and data practitioners in the data space.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts, instructions on how to do this are here.

Daniel Pálma serves as Head of Marketing at Estuary, bringing a unique blend of technical expertise and marketing acumen to the data integration space. With nearly a decade of experience as a data engineer across startups, enterprises, and consulting roles, Daniel made a strategic pivot to marketing to help bridge the gap between complex technical solutions and their practical applications for data practitioners. His background in data engineering enables him to deeply understand the customers' challenges and create authentic, education-focused marketing content that resonates with technical audiences. Daniel’s thought leadership and content creation in the data engineering space, combined with his hands-on technical experience, positions him as a valuable voice in conversations about the evolution of data infrastructure and integration technologies.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

For Feedback & Discussions on Firebolt Core:

AI and Data Change Management with Chad Sanderson, CEO Gable AI

The Firebolt Data Bros — Tue, 07 Jan 2025 10:00:00 +0000

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with Chad Sanderson, CEO and co-founder of Gable AI to explore the interesting world of data change management.

Join them as they:

Delve into challenges of data quality, how it degrades over time and the one-sided data quality checks on the “last mile” of the data supply chain.
Talk about how Gable works through a 3-layer flow of technology which is to identify data production points, trace the data flow and communicate the impact of changes before they reach production.
Explain why the gap between data producers and consumers need to be bridged and how Gable continues to emphasize the need for effective communication and understanding data change management across teams
Shine light on how AI can enhance data management by extracting semantics from code and effectively manage the translation output.
Discuss Chad’s vision for 2025 which is to help companies start to care about data and how the changes made to data affect other people.

Chad Sanderson is the CEO and co-founder of Gable AI, a data change management platform. Chad has over a decade of experience in data engineering and infrastructure space, holding significant roles at major companies like Microsoft, Oracle, Sephora where he focused on data quality and governance challenges. He is a former Head of Data at Convoy, a LinkedIn writer, and a published author. He lives in Seattle, Washington, and is the Chief Operator of the Data Quality Camp. His journey from data scientist to data engineer and ultimately to CEO was driven by a desire to transform how organizations manage and utilize data. Gable AI addresses the complexities of the data supply chain, by providing tools for code scanning, data contracts and governance as code, enabling teams to proactively manage data changes and impact.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

Episode Resources

Gable AI website
Chad Sanderson on LinkedIn

For Feedback & Discussions on Firebolt Core:

Tech Stacks and Tradeoffs: Xudo's Founder on Picking the Right Tools for BI Success

The Firebolt Data Bros — Tue, 26 Nov 2024 10:15:00 +0000

Wouter Trappers is the founder of Xudo and shares his slightly unconventional path from philosopher to data consultant with the Bros in this latest episode of The Data Engineering Show. Wouter’s grounding in philosophy has proved to be a shaping influence on his approach to business intelligence. Much more than just a software solution, for Wouter, BI is all about change management and aligning leadership with data projects.

They discuss:

From Excel to Expert: From basic Excel tasks to a full mastery of BI tools like QlikView, Wouter has blended his technical and philosophical approaches to data to become a bona fide expert.
Data Strategy as Transformation: Good change management principles have to be adhered to if a BI project is going to bear fruit. Focus on leadership alignment, KPI clarity, and user empowerment instead of simply implementing software.
Challenges of Starting Small: Wouter has some tips to offer smaller companies around bootstrapping their data journey using existing tools, practical education, and even Gen AI.
Balancing Scales: Smaller startups compared to large enterprises face a very different set of challenges.

Wouter’s combination of philosophy and pragmatism brings fresh takes to building effective data solutions.

Data Rewind: Conversation Highlights from Zach Wilson, Matthew Housley, Joe Reis, and Krishnan Viswanathan

The Firebolt Data Bros — Thu, 31 Oct 2024 13:41:00 +0000

In this special roundup episode of The Data Engineering Show, the Bros revisits some of the best bits from episodes with data thought leaders Zach Wilson, Matthew Housley, Joe Reis, and Krishnan Viswanathan, spotlighting essential trends and lessons learned across the evolving data engineering landscape. From data observability to bridging academia with real-world practice, this episode covers perspectives on where data engineering is heading and why certain challenges persist.

Topics include:

Foundations of Data Engineering: Zach Wilson emphasizes the importance of core, tech-agnostic skills in data modeling, quality assurance, and storytelling. By sharing his experiences at Airbnb and in education, he reveals that effective data engineering hinges on creating robust data models, quality controls, and persuasive narratives rather than expertise in any single tool or language.
Bridging Academia and Practice: Matthew Housley and Joe Reis delve into the need for better data education, emphasizing hands-on experience and data fundamentals over tool-specific training, and advocate for apprenticeships and real-world collaborations in educational settings.
Legacy Meets Modern in Data Engineering: Krishnan Viswanathan reflects on recurring themes in data engineering and the importance of adapting legacy approaches to new data needs, underscoring the challenges and benefits of vendor-built versus in-house solutions.

Join the Bros for a well-rounded exploration of current themes in data engineering, filled with practical advice for data professionals at any stage of their journey.

The Resurgence of SQL: Insights from Ryanne Dolan from LinkedIn

The Firebolt Data Bros — Tue, 24 Sep 2024 10:00:00 +0000

In this episode of The Data Engineering Show, the bros, Eldad and Benjamin are joined by Ryanne Dolan from LinkedIn to discuss the innovative Hoptimator (H2) project. This conversation reveals how LinkedIn has improved its data pipelines by automating the setup and management of complex workflows.

Together they cover:

Automated Data Pipelines: Ryanne explains how Hoptimator allows users to create and manage data pipelines using just a simple SQL SELECT query, streamlining the process of setting up Kafka topics, Flink jobs, and schemas.
Integration with Kubernetes: The project utilizes Kubernetes to handle infrastructure tasks, treating Kubernetes as a database for managing state. This integration simplifies the orchestration of data workflows and automates routine tasks.
Consumer-Driven Model: Ryanne discusses the shift from a producer-driven to a consumer-driven data model, emphasizing the importance of understanding and addressing consumer needs to reduce engineering complexity and optimize data systems.
Future of Data Engineering: The conversation touches on the ongoing experimental nature of Hoptimator and its potential to transform data engineering practices, highlighting its impact on LinkedIn's data infrastructure.

Vector Databases Won’t Replace SQL - Andy Pavlo

The Firebolt Data Bros — Tue, 04 Jun 2024 00:25:06 +0000

SQL’s slow. SQL’s stupid. We hear these claims every time a new shiny tool enters the market, only to realize five years later when the hype dies down that SQL is actually a good idea.

In this super techie episode of the Data Engineering Show, Andy Pavlo, Associate Professor at Carnegie Mellon University, joins the bros to delve into database internals and optimization.

Andy discusses leveraging ML for autonomous database optimization, using Postgres for practical applications, tuning production databases safely, and why SQL is here to stay.

How ZoomInfo transitioned from data graveyards to ROI-driven data projects

The Firebolt Data Bros — Tue, 16 Apr 2024 03:49:13 +0000

Too often expensive resources and manhours are spent on dashboards no one uses, resulting in zero ROI. Philip Philip Zelitchenko, VP of Data & Analytics at ZoomInfo met the bros to talk about adopting product management principles to ensure data projects have value, and provide an unfiltered peak into ZoomInfo’s data stack and unique tech culture.

Matthew Weingarten from Disney Streaming about Data Quality Best Practices

The Firebolt Data Bros — Tue, 26 Mar 2024 00:54:45 +0000

Matthew Weingarten, Lead Data Engineer at Disney Streaming, talks about principles essential for data quality, cost optimization, debugging, and data modeling, as adopted by the world's leading companies.

Joseph Machado, Senior Data Engineer @ LinkedIn talks best practices

The Firebolt Data Bros — Thu, 29 Feb 2024 01:52:57 +0000

Data engineering should be less about the stack and more about best practices. While tools may change, foundational principles will remain constant. Joseph Mercado, Senior Data Engineer at LinkedIn, is on The Data Engineering Show to talk about principles that are key to success, leveraging AI for automation, and adopting software engineering methods.

Professors Joe Hellerstein and Joseph Gonzalez on LLMs

The Firebolt Data Bros — Wed, 24 Jan 2024 04:44:14 +0000

Joe Hellerstein is the Jim Gray Professor of Computer Science at Berkeley and Joseph Gonzalez is an Associate Professor in the Electrical Engineering and Computer Science department.

They’ve inspired generations of database enthusiasts (including Benji and Eldad) and have come on the show to talk about all things LLM and RunLLM which they co-founded.

If you consider yourself a hardcore engineer, this episode is for you.

Megan Lieu on powerful notebooks that enable collaboration

The Firebolt Data Bros — Mon, 01 Jan 2024 06:43:29 +0000

There are two types of data influencers on LinkedIn:

1. Those who talk directly about the products and companies they work for
2. Those that provide more general guidance, tips and opinions

Can influencers actually be passionate about the products they’re developing and straightforwardly talk about them without sounding salesly?

We’re kicking off 2024 with the amazing Megan Lieu on a new Data Engineering Show episode.

Megan is one of those influencers that combine the two approaches, and with almost 100K followers, her content seems to be resonating with many data folks.

She talked to the bros about her approach to data advocacy as well as the power of notebooks, especially when they become broader and enable collaboration.

Transitioning from software engineering to data engineering

The Firebolt Data Bros — Wed, 22 Nov 2023 06:50:27 +0000

Every data team should have at least one data engineer with a software engineering background. This time on The Data Engineering Show, Xiaoxu Gao is an inspiring Python and data engineering expert with 10.6K followers on Medium.

She’s a data engineer at Adyen with a software engineering background, and she met the bros to talk about why both software and data engineering skills are so important.

Without software engineering skills you’ll be limited to the rigid capabilities of your stack. But without data engineering skills you’ll find it hard to be cost effective and see the bigger picture.

Vin Vashishta explains why we should stop using dashboards

The Firebolt Data Bros — Wed, 04 Oct 2023 03:59:27 +0000

Vin Vashista, the guy we all love to follow, has never seen a dashboard with positive ROI. This time on The Data Engineering Show, he met the bros to talk about the difference between BI dashboards and analytics that actually introduce knowledge. It’s no longer just about the data volume, it’s about quality and relevance.

Joe Reis and Matt Housley on the fundamentals of data engineering

The Firebolt Data Bros — Wed, 06 Sep 2023 04:38:25 +0000

After co-writing the best-selling book ‘Fundamentals of Data Engineering’, Joe Reis and Matt Housely joined the bros for some much-needed ranting, priceless data advice, and good laughs. So why are we still talking about providing business value and dashboards, even though we don’t really have anything new to say? If there are so many great tools in the data stack, why are we still so troubled? How can we focus more on things like data governance and data quality that’ll actually push the industry forward?

Bill Inmon, the Godfather of Data Warehousing

The Firebolt Data Bros — Tue, 08 Aug 2023 04:07:23 +0000

As people in the data industry go, Bill Inmon is among the top, often seen as the godfather of the data warehouse. In this Data Engineering Show episode, Bill Inmon talks about surviving rabbit holes throughout the evolution of data, the data modeling renaissance, and why ChatGPT is not Textual ETL.

Large-scale data engineering at Momentive.ai - Meenal Iyer

The Firebolt Data Bros — Wed, 12 Jul 2023 01:06:52 +0000

As companies scale, data gets messy. The data team says one thing, the business team says something completely different. Meenal Iyer, VP Data at Momentive.ai, Met the Data Bros to talk about enforcing collaboration in large organizations to ensure what she considers the three most important data factors: Adoption, Trust, and Value.

Data engineering from the early 2000s till today - BlackRock

The Firebolt Data Bros — Thu, 08 Jun 2023 06:55:59 +0000

When it comes to data management, have we come a long way since the early 2000s? Or has it simply taken us 20 years to finally realize that you can’t scale properly without data modeling. With over 20 years of experience in the data space, leading engineering teams at Cisco, Oracle, Greenplum, and now as Sr. Director of Engineering at BlackRock, Krishnan Viswanathan talks about the data engineering challenges that existed two decades ago and still exist today.

Zach Wilson on what makes a great data engineer

The Firebolt Data Bros — Thu, 27 Apr 2023 01:59:31 +0000

How good you are at Spark or Flink ≠ how good you are at data engineering. After years of data engineering experience at Airbnb, Netflix, and Facebook, Zach Wilson is now focused on spreading the knowledge in EcZachly and all over social media. He met Benjamin Wagner to explain why data modeling and storytelling are more important than the actual tech, why data engineering is going to see more job growth than data science, and what brought him to start creating content, reaching over 250K followers on LinkedIn.

How ZipRecruiter and Yotpo power self-service data platforms that work

The Firebolt Data Bros — Thu, 23 Mar 2023 05:57:24 +0000

Data engineers are not paid to do support. Liran Yogev, Director of Engineering at ZipRecruiter, and Doron Porat, Director of Infrastructure at Yotpo talk about building resilient self-service products that keep customers happy and engineers calm.

They walked the bros through their data stacks and explained how ZipRecruiter is completely rebuilding its data layer from scratch.

Data Observability with Millions of Users - Barr Moses

The Firebolt Data Bros — Wed, 08 Feb 2023 05:50:54 +0000

Barr Moses, CEO of Monte Carlo explains the difference between data quality and data observability, and how to make sure your data is accurate in a world where so many different teams are accessing it.

How Amplitude Engineers Process 5 Trillion Real-time Events

The Firebolt Data Bros — Thu, 05 Jan 2023 01:39:23 +0000

Weichen Wang, Senior Engineering Manager at Amplitude, came to meet the bros to talk about Amplitude's cutting-edge data stack and how it processes 5 Trillion real-time events while dealing with mutable data and massive scale.

Making Observability a Key Business Driver

The Firebolt Data Bros — Tue, 29 Nov 2022 02:44:27 +0000

80% of the code that you write doesn’t work on the first try. And that’s fine. But knowing which 80% is not working and which 20% is working is the actual challenge. After 10 years at Facebook, managing and scaling the Seattle site to over 6000 engineers(!) Vijaye Raji founded Statsig to make observability automated and real-time. How is the semantic layer managed? How was the Statsig team able to build an observability product that handles real-time ever-changing metadata? What are Vijaye’s main takeaways from engineering at Facebook? Tune in.

A ClickHouse Review from a Practitioner’s Point of View

The Firebolt Data Bros — Thu, 01 Sep 2022 03:05:05 +0000

Sudeep Kumar, Principal Engineer at Salesforce is a ClickHouse fan. He considers the shift to Clickhouse as one of his biggest accomplishments during his eBay days and walks Boaz through his experience with the platform. How on one hand it handled 2B events per minute, but also how it required rollups which compromised granularity when extending time windows.

Besides a ClickHouse review from a practitioner’s point of view, Sudeep tells us about interesting use-cases he’s working on at Salesforce.

The Creator of Airflow About His Recipe for Smart Data-Driven Companies

The Firebolt Data Bros — Wed, 03 Aug 2022 00:43:50 +0000

According to Maxime Beauchemin, CEO & Founder at Preset and Creator of Apache Superset and Apache Airflow, it's not so straight-forward to understand what you're really getting into and the vastness of the skills that are required in order to build a thriving company.

Picking the right system and services is key for a successful start, and can help you avoid the chaos of having too many tools spread across multiple teams.

Plus, Max walks the bros through the genesis of Airflow, Superset & Presto, and Airflow's old school marketing approach that won the hearts of developers across the world. And just like the terminator, once the machine takes over, you can't stop.

How Similarweb Delivers Customer Facing Analytics Over 100s of TBs

The Firebolt Data Bros — Wed, 13 Jul 2022 23:56:28 +0000

According to Yoav Shmaria, VP R&D Platform at Similarweb, the best way to manage data warehouse costs is to tag every table, database or ETL running to have good granularity over every feature.

Besides handy cost management tips, Yoav walks the bros through the tech stack he implemented to analyze 100s of TBs of web data to serve fast customer-facing analytics.

Full disclosure, Similarweb is a Firebolt customer, but the bros kept it objective, and there’s no Firebolt talk in this episode.

How Klarna Designed a New Data Platform in the Cloud

The Firebolt Data Bros — Thu, 09 Jun 2022 04:51:21 +0000

Klarna is one of the leading fintech companies in the world, valued at $45B.

While many corporations are “stuck” on-prem, Klarna made the move and today is a cloud-only company. Gunnar Tangring, Klarna’s Lead Data Engineer tells Boaz what this new modernized stack looks like.

How Eventbrite is Modernizing its Data Stack

The Firebolt Data Bros — Mon, 23 May 2022 02:46:02 +0000

Archana shares Eventbrite’s data stack modernization process, and how you get engineers to adopt new technologies like dbt which may be outside their comfort zone.

A Deep Dive into Slack's Data Architecture

The Firebolt Data Bros — Tue, 10 May 2022 23:15:55 +0000

Growing from a startup to an IPOed and then an acquired company meant that Slack’s sales org was scaling rapidly.
Apun Hiran, Slack’s Director of Software Engineering explains how the data stack and architecture evolved to support this growth with more reliable and timely metrics.

Speaker: Apun Hiran, Director of Software Engineering (Data), Slack
Hosts: Eldad and Boaz Farkash, CEO and CPO, Firebolt

Transitioning Scopely’s 5.5 PB Data Platform to the Modern Data Stack

The Firebolt Data Bros — Tue, 12 Apr 2022 05:08:42 +0000

Should data engineering AND BI be handled by the same people? According to Jonathan Palmer, VP Data Platform at Scopely – YES. By Analytics Engineers. His team of Analytics Engineers is in the final stages of transitioning 5.5 PBs of data which include 15B evens per day to the modern data stack. Tune in to learn how they did it.

Getting rid of raw data with Jens Larsson

The Firebolt Data Bros — Tue, 22 Mar 2022 00:32:47 +0000

Why would you create ugly data? According to Jens Larsson, don’t even go near raw data. Jens started off at Google, continued to manage data science at Spotify, caught the startup bug at Tink, and recently joined an exciting new company called Ark Kapital, together with Spotify’s former VP Analytics. Jens explains how he and his team killed the notion of raw data at Tink and walks us through the Google, Spotify and Ark Kapital data stacks.

How Zendesk engineers manage customer-facing data applications

The Firebolt Data Bros — Thu, 17 Feb 2022 06:15:21 +0000

This time on the data engineering show, Eldad abandoned his brother Boaz but it’s ok because Boaz got the full 30 minutes to talk to one of the most interesting people in the data space.

Ananth Packkildurai is Principal Software Engineer at Zendesk and runs one of the strongest newsletters in data – Data Engineering Weekly. He talked about data applications at Zendesk and how they’re built, technologies that excite him like data lineage and data catalog, and the best routes for software engineers to get their hands dirty in the data world.

INTERVIEWER: Boaz Farkash.

ZENDESK GUEST: Ananth Packkildura - Principal Software Engineer.

How are those data intensive customer facing apps engineered at Gong?

The Firebolt Data Bros — Thu, 20 Jan 2022 05:37:19 +0000

Gong manages hundreds of thousands of videoconferences and millions of emails PER DAY, which add up to hundreds of TBs.

The Data Bros met Yarin Benado, Gong’s engineering manager to understand what is required to move to a modern data stack to support all this, what this stack looks like, and why it all comes down to data quality at the end of the day.

How Bolt Engineers Are Designing Its Next-Gen Data Platform

The Firebolt Data Bros — Tue, 14 Dec 2021 02:03:59 +0000

Bolt's ride-hailing app serves 2B users in Europe and Africa and handles 500K queries every day.

Erik Heintare along with Bolt's engineering team is in the midst of designing a new next-gen data platform and is sharing how it's going to solve their biggest data challenges.

Guest: Erik Heintare - Senior Analytics Engineer at Bolt
Hosts: Eldad and Boaz Farkash, AKA The Data Bros

How did Agoda scale its data platform to support 1.5T events per day?

The Firebolt Data Bros — Tue, 23 Nov 2021 05:49:12 +0000

Scaling a data platform to support 1.5T events per day requires complicated technical migrations and alignment between hundreds of engineers. What to see how Agoda did it.

Guests:
Amir Arad, Director of Machine Learning, Agoda
Shaun Sit, Senior Dev Manager, Agoda

Hosts:
The Data Bros - Eldad and Boaz Farkash

Diving Into GitHub's Data Stack

The Firebolt Data Bros — Thu, 21 Oct 2021 05:46:21 +0000

It’s the mother of all development projects. You use it daily. And so do 65M developers around the world. This time on the Data Engineering Show – A deep dive into GitHub’s data stack. Arfon Smith KimYen (Truong) Ladia shared GitHub’s data engineering challenges and solutions and explained why every developer should know and adopt the ADR protocol.

Building Data Products For Data Engineers

The Firebolt Data Bros — Thu, 09 Sep 2021 04:55:32 +0000

How does a tech stack that always needs to be at the forefront of technology look like? Roy Miara from Explorium talks about building data products for the audience that can’t be fooled – Data Engineers.

How Vimeo Keeps Data Intact with 85B Events Per Month

The Firebolt Data Bros — Wed, 18 Aug 2021 07:10:08 +0000

How does the Viemo data team deal with 2 PBs of data and 85B events per month? What made them recently build a data ops team? What data tool does the team love? And why (the hell) did they call their legacy platform Fatal Attraction?
Guest: Lior Solomon, VP Data Engineering at Vimeo.

How Substack's Data Stack Supports 500K Paying Subscribers

The Firebolt Data Bros — Tue, 03 Aug 2021 07:11:54 +0000

Substack is an amazing — if not the most amazing — content publishing platform out there. Essentially, it allows anyone to become a journalist or to start their own newsletters and charge subscriptions for them. So how did they build a data stack that can support all of their 500K paying subscribers? Guest: Mike Cohen, Data Engineer at SubStack Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt

A Technical Deep Dive to Yelp's Data Infrastructure - With Steven Moy

The Firebolt Data Bros — Tue, 11 May 2021 23:50:24 +0000

As an expert in query engines and performance-related challenges, Steven Moy explains how Yelp handled its huge data growth in the past ten years. Guest: Steven Moy, Software Engineer at Yelp Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt

How Canva's Data Engineers and Analysts Support 55M Active Users

The Firebolt Data Bros — Tue, 11 May 2021 23:47:05 +0000

Canva is one of the hottest, if not the hottest, graphic design platforms out there. Only a week ago it was announced that they reached a staggering 16 Billion dollar valuation, after having seen even stronger growth during the pandemic. With 55 million active users and around 500 million dollars in annual revenue, it seems that Canva is unstoppable. So how do Canva analysts and engineers scale their data platforms to meet the company's insane growth? Guest: Krishna Naidu, Data Engineer at Canva Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt

How AppsFlyer Delivers Sub-Second BI to 1000 Looker Users - With Alexandra Sudilovsky

The Firebolt Data Bros — Tue, 11 May 2021 23:45:33 +0000

AppsFlyer has exploded in size, growing from a small company of 200 people to 1000 people in just three years. Dealing not only with a huge amount of data on a daily basis but doing so while growing quickly as a company can come with many challenges. Guest: Alexandra Sudilovsky, Senior BI Expert at AppsFlyer Hosts: The Data Bros, Eldad and Boaz Farkash, CEO and CPO at Firebolt

The Data Engineering Show - Coming Soon...

The Firebolt Data Bros — Mon, 05 Apr 2021 06:57:51 +0000

The Data Engineering Show is a podcast for data engineering and BI practitioners to go beyond theory, and learn from the biggest influencers in tech about their practical day to day data challenges and solutions in a casual and fun setting.