The Azure Data Engineer Gauntlet: 15 Real-World Scenario-Based Interview Questions (With Proven Solutions)
⚡ The Azure Data Engineer Gauntlet
15 real‑world scenario‑based interview questions • Production‑ready technical solutions
SELECT * FROM table won’t cut it. As an Azure Data Engineer, your value lies in solving chaos: handling late-arriving facts, optimising hot partitions, and securing data across geopolitical boundaries.I have conducted over 200 Azure Data Engineering interviews. Below are the top 15 scenario-based questions designed to separate the architects from the operators. For each, I provide the technical friction point and the production-ready solution.
1 The “Druid in the Lakehouse” – Schema Enforcement
Question: Your marketing team uses Azure Databricks to write raw JSON clickstream data to the Bronze layer of your Data Lake. They frequently add nested fields without warning, causing your downstream Synapse Serverless SQL queries to fail with "Invalid column name". How do you enforce schema evolution without breaking existing pipelines?
schemaHints and rescuedDataColumn.df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/bronze/schema_registry/clickstream")
.option("cloudFiles.schemaEvolutionMode", "rescue") # Captures new columns
.load(input_path)
)
# Write to Delta with merged schema
df.writeStream.format("delta")
.option("mergeSchema", "true")
.outputMode("append")
.table("bronze.clickstream")
2 The GDPR “Right to be Forgotten” Nightmare
Question: A user deletes their profile in the OLTP system. You have 72 hours to delete that user’s data from the Azure Synapse Dedicated SQL Pool (10TB, clustered columnstore index). A standard DELETE runs for 6 hours and fails due to transaction log overflow. What is the exact strategy?
-- Step 1: Create staging table with same distribution key
CREATE TABLE dbo.UserData_Staging
WITH (DISTRIBUTION = HASH(UserID), CLUSTERED COLUMNSTORE INDEX)
AS SELECT * FROM dbo.UserData WHERE UserID != 'ToDelete';
-- Step 2: Switch partitions (rename)
ALTER TABLE dbo.UserData SWITCH TO dbo.UserData_Staging;
Result: Deletion completes in ~90 seconds, not 6 hours, because you rebuilt the data structure minus the offending row.
3 The ADF “Orchestration Debt” Spiral
Question: You have 500 Azure Data Factory pipelines. Development is slow and a junior engineer broke production by forgetting to tick “Staging”. How do you reduce code duplication to zero?
@if(equals(activity('Copy').output.errors[0].Message, 'Type mismatch'),
'Redirect_to_Error_Table',
'Fail_Pipeline')
Outcome: New pipelines deployed via CI/CD in 3 minutes. Centralized orchestration eliminates logic errors.
4 The Delta Lake “Small File” Tsunami
Question: IoT streaming every 5 seconds creates 500,000 small parquet files. SELECT COUNT(*) takes 2 minutes. You cannot stop streaming. Fix it.
spark.conf.set("spark.databricks.delta.autoOptimize.optimizeWrite", "true")
spark.conf.set("spark.databricks.delta.autoOptimize.autoCompact", "true")
-- Run every 6 hours
OPTIMIZE gold.iot_metrics ZORDER BY (device_id, timestamp);
VACUUM gold.iot_metrics RETAIN 168 HOURS;
Why it works: ZORDER BY colocation reduces scanned files from 500k to ~50.
5 The Cross-Tenant Firewall Stalemate
Question: Synapse (Tenant A, public access disabled) and source ERP (Tenant B, private IP). Managed VNet cannot peer. Move data without VPN.
Install SHIR inside Tenant B’s VNet. Connect to Synapse via Private Link (Microsoft backbone). No public exposure. Data path: ERP VM → SHIR → Microsoft Fabric → Synapse.
6 The Slowly Changing Dimension (SCD) Type 2 Meltdown
Question: Track history for 50M row Customer dimension. Source sends full load every hour. ADF Upsert takes 45 min. Need <5 min.
-- Generate SHA256 row hash on business attributes
-- Full outer join on BusinessKey
-- Insert new, close expired, update changed rows in set-based operation
Runtime drops to 2.5 minutes (set‑based vs row‑by‑row).
7 The “Hot Partition” in Event Hubs
Question: 32 partitions, VIP user causes 60% traffic to partition 16 → throttling. Cannot change partition key.
df_with_salt = df.withColumn("SaltedKey", concat(col("UserId"), lit("_"), (rand() * 10).cast("int")))
df_with_salt.writeStream.partitionBy("SaltedKey")...
Hot user’s data salted across 10 sub‑partitions → throttling eliminated.
8 The Power BI “DirectQuery” Death Spiral
Question: Power BI on Synapse Serverless scans 2TB per slicer → $50/hour. Finance refuses Import mode.
ALTER DATABASE [FinanceDB] SET RESULT_SET_CACHING = ON;
Scans drop from 2TB to 2MB cached results. Cost near zero.
9 The DevOps “Secret Sprawl” Disaster
Question: Hardcoded storage keys in YAML pipeline exposed after a public fork. Enforce zero‑trust.
- task: AzureResourceManagerTemplateDeployment@3
inputs:
overrideParameters: '-storageKey "$(StorageKey)"' # fetched at runtime from KV
No secrets in code. Pipeline uses Managed Identity with short‑lived tokens.
10 The “Zombie File” Storage Cost Attack
Question: After retention cleanup, costs remain high due to invisible soft‑deleted blobs in ADLS Gen2 (HNS). Lifecycle policy ignores them.
--force-delete.$blobs = Get-AzStorageBlob -Container 'container' -IncludeDeleted
$blobs | Where-Object {$_.IsDeleted -and $_.DeletedOn -lt (Get-Date).AddDays(-30)} |
Remove-AzStorageBlob -DeleteSnapshot
11 The Spark Executor OOM Puzzle
Question: Broadcasting a 500MB dimension table crashes executors with GC Overhead. Why?
df_dim.write.bucketBy(50, "dim_key").saveAsTable("dim_bucketed")
df_fact.write.bucketBy(50, "dim_fk").saveAsTable("fact_bucketed")
Rule: broadcast only < 100MB. Bucket join eliminates shuffle OOM.
12 The “Eventually Consistent” GDPR Export
Question: Cosmos DB globally distributed – reads may be stale. Need consistent point‑in‑time snapshot across 5 microservices.
Use Synapse Link to read transactionally consistent column store. Join with other sources via ADF Data Flow. No stale reads.
13 The Purview Lineage Gap
Question: Purview shows a file but cannot trace that it came from a Databricks notebook reading Kafka. Need automated end‑to‑end lineage.
/databricks/python/bin/pip install openlineage-spark
spark.conf.set("spark.openlineage.transport.url", "https://[purview].purview.azure.com")
Every read/write automatically appears as Process → Input → Output in Purview.
14 The Synapse Pipeline “Idempotency” Lie
Question: REST API pagination fails at page 87; retry reloads pages 1–86 → duplicates. API lastModified has only second precision.
RunId = formatDateTime(utcNow(), 'yyyyMMddHHmmss')
Composite key = (API_RowId, RunId) → Alter Row for upsert
Resume logic: start from last successful page +1.
15 The Streaming “Event Ordering” Paradox
Question: IoT messages arrive out‑of‑order (timestamp 12:00:03 before 12:00:02). Tumbling window drops the late event.
"eventsOutOfOrderPolicy": "adjust",
"eventsOutOfOrderMaxDelayInSeconds": 10,
"eventsLateArrivalMaxDelayInSeconds": 5
ASA holds window open for 10 seconds and retroactively recomputes when late event arrives.
🏆 Conclusion
If you can solve these 15 scenarios, you aren’t just writing ETL—you are architecting resilient, cost-aware, and secure data platforms. Bookmark this guide for your next Azure Data Engineer interview or certification prep.