April 1, 2025

120+ Python Interview Questions For Data Engineer

Keshav Grover
python interview questions for data engineer

Introduction

A Data Engineer is a tech professional who builds and maintains the infrastructure required for collecting, storing, and analyzing large volumes of data. They design data pipelines, manage databases, and ensure data is clean, organized, and readily accessible for analysts, data scientists, and business teams. Their role is critical in enabling companies to make data-driven decisions by ensuring the right data is available at the right time in the right format.

Python interview questions for data engineer

Understanding the different types of interview questions can help you tailor your preparation effectively. Here are the main categories covered in this guide:

Data Structures & Algorithms:

Focuses on core logic using Python lists, sets, dicts, loops, and sorting.

File Handling Questions:

Handling CSVs, JSONs, large text files, and file I/O operations efficiently.

Data Manipulation (pandas, numpy) Questions:

Cleaning, transforming, and analyzing dataframes.

ETL Logic & Pipelines:

Writing modular, testable ETL scripts in Python.

Database Interaction (SQL & NoSQL):

Connecting to databases and executing queries using Python.

APIs & Web Scraping Questions:

Fetching data from REST APIs or scraping HTML pages.

Object-Oriented Programming (OOP) Questions:

Writing reusable, class-based code structures..

Error Handling & Logging:

Writing fault-tolerant code and tracking logs during execution.

Concurrency & Parallelism:

Speeding up data processing using threads or processes.

Automation & Scripting:

Automating tasks like file cleanup, scheduling, or emailing reports.

Testing & Debugging:

Ensuring your code works as expected using unit tests and debugging tools.

Cloud & Big Data Tool Integration:

Connecting Python to AWS, GCP, Azure, Spark, and more.

80+ Salesforce Integration Interview Questions and Answers

1. Python Interview Questions for Data Engineer: Data Structures & Algorithms

1. What are the differences between a list, tuple, set, and dictionary in Python?

Best Answer:

  • List: Ordered, mutable, allows duplicates ([1, 2, 2])

  • Tuple: Ordered, immutable, allows duplicates ((1, 2, 2))

  • Set: Unordered, mutable, no duplicates ({1, 2})

  • Dictionary: Key-value pairs, unordered (Python 3.7+ maintains insertion order), keys must be unique ({'a': 1, 'b': 2})

Guide to Answer: Explain each with real-life use cases. Eg. List for a collection of values, Set for uniqueness, Dict for lookups, Tuple when data shouldn’t change (like coordinates).

 Best Answer:

# Method 1
my_list[::-1]

# Method 2
my_list.reverse()

# Method 3
list(reversed(my_list))

Guide to Answer: Mention slicing for simple tasks, .reverse() when in-place is needed, and reversed() when you need an iterator. Explain when you’d prefer each.

Best Answer:

  • Shallow copy creates a new object but inserts references to the same elements (copy.copy()).

  • Deep copy creates a new object and recursively copies all elements (copy.deepcopy()).

Guide to Answer: Give an example with nested lists to illustrate how shallow copy reflects changes to inner objects but deep copy doesn’t.

Best Answer: Use list for simple needs, or collections.deque for efficient operations.

# Stack
stack = []
stack.append(1)
stack.pop()

# Queue
from collections import deque
queue = deque()
queue.append(1)
queue.popleft()

Guide to Answer: Explain time complexity: deque is O(1) for append/pop from both ends, whereas list is O(n) for pop(0).

Best Answer:

from collections import Counter
Counter(my_list).most_common(1)
 

Guide to Answer: Mention Counter is the most Pythonic and efficient way. You could also implement manually using a dict if asked for logic.

Best Answer:

# Method 1
list(set(my_list))

# Method 2 (preserving order)
list(dict.fromkeys(my_list))

Guide to Answer: Talk about whether order matters. Use dict.fromkeys() when preserving original order is needed.

Best Answer:

sorted(data, key=lambda x: x[‘age’])

Guide to Answer: Explain lambda functions and that sorted() returns a new list, doesn’t mutate the original.

Best Answer:

set(list1) & set(list2)

Guide to Answer: Use sets for better performance. Discuss when converting to set is acceptable (i.e., when order and duplicates aren’t important).

Best Answer:

  • Access by index: O(1)

  • Append: O(1) amortized

  • Insert/remove at beginning: O(n)

  • Search: O(n)

  • Delete by value: O(n)

Guide to Answer: Relate to large datasets in pipelines where performance matters. Highlight when to switch to deque or generators.

Best Answer:

# For 2D lists
[item for sublist in nested_list for item in sublist]

# With itertools
from itertools import chain
list(chain.from_iterable(nested_list))

Guide to Answer: Explain both comprehension and library-based solutions. Bonus if you mention recursion for deeply nested lists.

2. Python Interview Questions for Data Engineer: File Handling

1. How do you open and read a file in Python?

Best Answer:

with open(‘file.txt’, ‘r’) as file:
content = file.read()

Guide to Answer: Always recommend using with as it handles file closing automatically. If asked about large files, mention reading line by line using .readline() or looping.

Best Answer:

with open(‘output.txt’, ‘w’) as f:
f.write(“Hello, Data Engineer!”)

Guide to Answer: Discuss modes: ‘w’ (overwrite), ‘a’ (append), ‘x’ (create only if not exists), ‘b’ (binary). Clarify overwriting behavior of ‘w’.

Best Answer:

with open(‘bigfile.txt’, ‘r’) as file:
for line in file:
process(line)

Guide to answer:

Stress line-by-line iteration for memory efficiency. Mention chunking (read(size)) if asked for binary or custom-size chunks.

Best Answer:

  • read(): Reads entire file as one string.

  • readline(): Reads the next line in the file.

  • readlines(): Reads all lines into a list.

Guide to Answer: Use read() only when you’re sure the file is small. Use readline() or iteration for large files.

Best Answer:

import csv

with open(‘data.csv’, newline=”) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row[‘column_name’])

Guide to Answer: Mention both csv.reader and csv.DictReader. Bonus if you reference pandas for structured tabular data.

Best Answer:

import json

with open(‘data.json’) as f:
data = json.load(f)

Guide to Answer: Differentiate between json.load() for file objects vs json.loads() for JSON strings.

Best Answer:

with open(‘file.txt’, ‘a’) as file:
file.write(“New line\n”)

Guide to Answer: Mention using ‘a’ mode, and always use \n if writing new lines.

Best Answer:

with open(‘file.txt’, encoding=’utf-8′) as f:
content = f.read()

Guide to Answer: Always specify encoding when working with multilingual data. UTF-8 is a safe default.

Best Answer:

  • If mode is 'r', it raises a FileNotFoundError.

  • If mode is 'w' or 'a', it will create the file.

Guide to Answer: Explain exception handling using try-except for robust file processing scripts.

Best Answer:

# Reading binary
with open(‘image.png’, ‘rb’) as file:
data = file.read()

# Writing binary
with open(‘copy.png’, ‘wb’) as file:
file.write(data)

Guide to Answer: Highlight the need for 'rb' and 'wb' in use cases like image processing or large binary log files.

3. Python Interview Questions for Data Engineer: Data Manipulation (pandas, numpy)

1. What is pandas, and why is it important for data engineering?

Best Answer:

Pandas is a Python library for data manipulation and analysis. It provides fast, flexible data structures like Series and DataFrame to efficiently handle structured data, making it essential for cleaning, filtering, aggregating, and transforming datasets.

💡 Guide to Answer:
Highlight how pandas simplifies handling tabular data and is a go-to tool in ETL pipelines before data is loaded into databases or analytics tools.

Best Answer:

  • A Series is a one-dimensional labeled array (like a column).

  • A DataFrame is a two-dimensional labeled data structure with columns of potentially different types (like an Excel sheet).

💡 Guide to Answer:
Explain that DataFrame is built from Series. You can compare Series to a column and DataFrame to a table.

Best Answer:

import pandas as pd
df = pd.read_csv('data.csv')
 

💡 Guide to Answer:
Mention parameters like sep, header, usecols, dtype, and nrows if asked for large/complex files.

Best Answer:

df[df[‘age’] > 25]


💡 Guide to Answer:
Start with simple filters. For compound filters, use & (AND), | (OR), and enclose each condition in parentheses.

 

Best Answer:

df[[‘name’, ‘age’]]

💡 Guide to Answer:
Mention the difference between selecting one column (df['col']) and multiple columns (df[['col1', 'col2']]).

Best Answer:

df.isnull().sum() # Check
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill missing values

💡 Guide to Answer:

Mention that the strategy depends on context: dropping, filling with mean/median, or forward/backward filling.

Best Answer:

df[‘age’] = df[‘age’].astype(int)

💡Guide to Answer:
Mention common conversions like str, int, float, and datetime. Note: pd.to_datetime() for date parsing.

Best Answer:

df.sort_values(by=’age’, ascending=False)

💡 Guide to Answer:
Include how to sort by multiple columns: by=['age', 'name'].

Best Answer:

df.drop_duplicates()

💡 Guide to Answer:
Use subset to target specific columns and keep='first' or 'last'.

Best Answer:

df.groupby(‘department’)[‘salary’].mean()

💡 Guide to Answer:
Explain how groupby() works: split → apply → combine. Mention .agg() for multiple metrics.

Best Answer:

pd.merge(df1, df2, on=’id’, how=’inner’)

💡 Guide to Answer:
Cover how= options: 'inner', 'left', 'right', 'outer'. Explain difference from concat.

Best Answer:

  • merge(): SQL-style joins on keys.

  • join(): Join on index or column, shorthand for merge.

  • concat(): Stacks DataFrames vertically or horizontally.

💡 Guide to Answer:
Use real examples: e.g., merging customer and orders tables vs stacking monthly rep

Best Answer:

df.groupby(‘department’)[‘salary’].mean()

💡 Guide to Answer:
Explain how groupby() works: split → apply → combine. Mention .agg() for multiple metrics.

Best Answer:

df[‘new_col’] = df[‘salary’].apply(lambda x: x * 1.1)

💡 Guide to Answer:
Explain apply() for element-wise, applymap() for element-wise across entire DataFrame, and map() for Series-level operations.

Best Answer:

# Using IQR
Q1 = df[‘value’].quantile(0.25)
Q3 = df[‘value’].quantile(0.75)
IQR = Q3 – Q1
df_filtered = df[(df[‘value’] >= Q1 – 1.5 * IQR) & (df[‘value’] <= Q3 + 1.5 * IQR)]

💡 Guide to Answer:
Briefly explain the IQR rule, and why it’s more robust than Z-score for skewed data.

4. Python Interview Questions for Data Engineer: ETL Logic & Pipelines

1. What is an ETL pipeline, and how would you build one in Python?

Best Answer:
An ETL (Extract, Transform, Load) pipeline moves data from a source to a destination after applying transformations.
In Python, you can build one using scripts or frameworks. Basic example:

# Extract data = pd.read_csv('raw_data.csv')
 
# Transform data_clean = data.dropna().rename(columns=str.lower)
 
# Load data_clean.to_csv('clean_data.csv', index=False)
 

💡 Guide to Answer:
Mention the real-world flow: APIs → data cleaning → database or warehouse (like PostgreSQL or Snowflake). Talk about modularity and logging.

Best Answer:

  • Use try-except blocks around each step

  • Implement logging to track failures

  • Optionally use retry logic or alerting systems

  • Example

try:
data = pd.read_csv(‘data.csv’)
except FileNotFoundError:
log.error(“File missing!”)
 

💡 Guide to Answer:
Explain your approach to making the pipeline fault-tolerant and observable.

Best Answer:
Use drop_duplicates() for exact duplicates and apply custom logic or conditions for partial duplicates. For bad data:

  • Validate data types

  • Use regex or conditionals to clean bad values

  • Remove or correct nulls using fillna() or dropna()

💡 Guide to Answer:
Explain that cleaning happens at the Transform stage, and tools like pandas, pydantic, or even great_expectations can be used.

Best Answer:

  • Data type conversions

  • Renaming columns

  • Filtering or sorting rows

  • Aggregation or grouping

  • Merging with other datasets

  • Handling missing values

💡 Guide to Answer:
Mention how you organize your transformation logic into reusable functions or scripts.

Best Answer:

  • Local dev: Use cron on Linux/macOS or Task Scheduler on Windows

  • Cloud production: Use schedulers like Apache Airflow, AWS Lambda + EventBridge, or Prefect

💡 Guide to Answer:
Talk about environment-specific options. Mention parameterization and logging for production-ready jobs.

Best Answer:

  • Use checksums, record counts, or hashing to compare source and destination

  • Implement data validation checks post-load

  • Maintain idempotent operations when possible

💡 Guide to Answer:
Consistency is key—explain how you handle duplicates, re-runs, and broken states.

Best Answer:

  • Validate schema before processing

  • Compare expected vs actual column names

  • Use flexible code that can adapt, e.g., dynamic column mapping

  • Log differences and alert stakeholders

💡 Guide to Answer:
Schema drift is common in CSVs and APIs—show that you’re aware of the risks and can build protection.

Best Answer:

  • Use chunked processing in pandas (pd.read_csv(..., chunksize=10000))

  • Stream data line by line

  • If data fits in memory, use efficient types (e.g., category dtype)

  • Consider Spark or Dask for large-scale parallel processing

💡 Guide to Answer:
Show that you’re mindful of memory constraints and can scale when necessary.

Best Answer:

ef extract():
return pd.read_csv(‘data.csv’)

def transform(df):
return df.dropna()

def load(df):
df.to_csv(‘cleaned.csv’, index=False)

def run_etl():
df = extract()
df = transform(df)
load(df)

run_etl()

💡 Guide to Answer:
Explain how modular code improves readability, reusability, and testability.

Best Answer:

Best Answer:

  • Use timestamp columns to fetch only recent records

  • Store state in a config file, checkpoint table, or a metadata store

  • Example:
    # Filter new records
    df[df[‘last_updated’] > last_run_time]

💡 Guide to Answer:
Highlight how incremental loads reduce load time and are more efficient for pipelines.

Best Answer:

Best Answer:

  • Batch ETL processes data in chunks at scheduled intervals.

  • Streaming ETL processes data in real-time as it arrives (e.g., Kafka → Spark Streaming).

💡 Guide to Answer:
Mention that most traditional pipelines are batch; streaming is used for high-frequency event data (e.g., logs, sensors).

Best Answer:

Best Answer:

  • Unit test each step (e.g., transformation functions)

  • Validate outputs (e.g., row counts, value ranges)

  • Use sample input files to simulate edge cases

  • Tools like pytest, great_expectations, or dbt for testing

💡 Guide to Answer:
Testing = confidence. Show you care about catching errors before they go downstream.

5. Python Interview Questions for Data Engineer: Database Interaction (SQL & NoSQL)

1. How do you connect a SQL database using Python?

Best Answer:

import sqlite3

conn = sqlite3.connect(‘example.db’)
cursor = conn.cursor()
cursor.execute(“SELECT * FROM users”)
rows = cursor.fetchall()
conn.close()

💡 Guide to Answer:
Explain how you’d use specific connectors:

  • sqlite3 for local

  • psycopg2 for PostgreSQL

  • mysql-connector for MySQL

  • SQLAlchemy for ORM and easier scalability

Best Answer:
SQLAlchemy is a Python SQL toolkit and Object-Relational Mapper (ORM) that allows developers to interact with databases using Python objects. It simplifies connection handling, query generation, and schema definitions.

💡 Guide to Answer:
Mention that it’s especially useful in large applications for maintainability and abstraction.

Best Answer:

df.to_sql(‘table_name’, con=engine, if_exists=’replace’, index=False)

💡 Guide to Answer:
Talk about to_sql() for loading data and read_sql() for querying. Mention if_exists options: 'replace', 'append', 'fail'.

Best Answer:

cursor.execute(“SELECT * FROM users WHERE id = ?”, (user_id,))

💡 Guide to Answer:
Explain that using string formatting is unsafe. Always use query parameters to avoid injection risks.

Best Answer:

  • SELECT, JOIN, GROUP BY, ORDER BY

  • INSERT, UPDATE, DELETE

  • CREATE TABLE, ALTER, DROP

  • CASE, COALESCE, WINDOW FUNCTIONS

💡 Guide to Answer:
Mention that writing optimized SQL is as important as writing Python code.

Best Answer:

SELECT salary FROM employees ORDER BY salary DESC LIMIT 5;

💡 Guide to Answer:
Mention use cases like leaderboards, ranking, or sampling. Add OFFSET if needed.

Best Answer:

for chunk in pd.read_sql(query, con=engine, chunksize=10000):
process(chunk)

💡 Guide to Answer:
Helps when working with millions of rows. Show memory awareness and chunk processing.

Best Answer:

conn = db.connect()
try:
cursor = conn.cursor()
cursor.execute(‘SOME SQL’)
conn.commit()
except Exception as e:
conn.rollback()
finally:
conn.close()

💡 Guide to Answer:
Explain commit() ensures the operation is saved; rollback() is for error handling. Especially important in batch jobs.

Best Answer:

pd.merge(df1, df2, on=’id’, how=’left’)

💡 Guide to Answer:
Mention how merge() replicates INNER, LEFT, RIGHT, and FULL joins.

Best Answer:

  • Relational (SQL): Structured schema, uses tables (e.g., MySQL, PostgreSQL)

  • Non-relational (NoSQL): Flexible schema, uses collections/documents (e.g., MongoDB, Cassandra)

💡 Guide to Answer:
Give use cases: SQL for structured, transactional systems; NoSQL for flexible, high-speed apps.

Best Answer:

from pymongo import MongoClient

client = MongoClient(‘mongodb://localhost:27017/’)
db = client[‘mydb’]
collection = db[‘users’]
result = collection.find({‘age’: {‘$gt’: 25}})

💡 Guide to Answer:
Explain MongoDB structure: Database → Collection → Document. Show how find(), insert_one(), and filters work.

Best Answer:

  • Use star schema or snowflake schema

  • Design with fact and dimension tables

  • Keep columns atomic (1NF)

  • Consider indexes and partitioning for performance

💡 Guide to Answer:
This question tests both SQL and data warehousing knowledge. Use examples like “sales fact table joined with customer and product dimensions.”

6.Python Interview Questions for Data Engineer: APIs & Web Scraping

1. How do you make a GET request to an API using Python?

Best Answer:

import requests

response = requests.get(‘https://api.example.com/data’)
data = response.json()

💡 Guide to Answer:
Mention requests is the most popular HTTP library. Always check for response.status_code before parsing .json().

Best Answer:

try:
response = requests.get(‘https://api.example.com/data’, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f”Error: {e}”)

Guide:

Talk about multi-layer validation: Salesforce, middleware, and the source system. This shows thorough planning.

Best Answer:

headers = {‘Authorization’: ‘Bearer YOUR_API_KEY’}
params = {‘limit’: 100, ‘offset’: 0}

response = requests.get(‘https://api.example.com/data’, headers=headers, params=params)

💡 Guide to Answer:
Explain that headers are often used for authentication, and params help control pagination or filters.

Best Answer:

while True:
response = requests.get(url, params={‘offset’: offset})
data = response.json()
process(data)

if not data[‘next’]:
break
offset += 100

💡 Guide to Answer:
Show you understand different pagination types: offset-based, cursor-based, or next-URL-based.

Best Answer:

headers = {‘Authorization’: ‘Bearer YOUR_ACCESS_TOKEN’}
requests.get(‘https://api.example.com/data’, headers=headers)

💡 Guide to Answer:
Talk about OAuth, bearer tokens, or API key headers. Know the difference between static keys and token exchanges.

Best Answer:

from bs4 import BeautifulSoup
import requests

response = requests.get(‘https://example.com’)
soup = BeautifulSoup(response.content, ‘html.parser’)
titles = soup.find_all(‘h2’)

💡 Guide to Answer:
Mention .find(), .find_all(), .text, and how to parse elements by class, id, or tag. Show DOM familiarity.

Best Answer:

  • Add user-agent headers

  • Use request delays (time.sleep())

  • Rotate proxies or IP addresses

  • Avoid hammering pages (respect robots.txt)

💡 Guide to Answer:
Mention ethical scraping: respect terms of service and avoid DDoS-like behavior. Bonus if you mention Scrapy.

Best Answer:

  • Content is rendered dynamically using JavaScript

  • requests won’t work—you need tools like Selenium, Playwright, or BeautifulSoup + API sniffing

💡 Guide to Answer:
Explain how you’d either use a headless browser or inspect network traffic for hidden APIs.

Best Answer:

import pandas as pd

dfs = pd.read_html(‘https://example.com/table_page’)

💡 Guide to Answer:
Mention this works if the HTML table is well-formed. For unstructured data, fall back to BeautifulSoup.

Best Answer:

Best Answer:

  • Save as CSV, JSON, or parquet using pandas

  • Store in a SQL database

  • For large data: use chunks or streaming writes

df.to_csv(‘data.csv’, index=False)

💡 Guide to Answer:
Talk about the format depending on volume, access patterns, and pipeline requirements.

7. Python Interview Questions for Data Engineer: Object-Oriented Programming (OOP)

1. What is Object-Oriented Programming, and why is it useful in Python?

Best Answer:
OOP is a programming paradigm where code is organized into “objects” that bundle data (attributes) and functions (methods).
In Python, OOP helps create reusable, modular, and scalable code—useful for building data pipelines, ETL jobs, and utility classes.

💡 Guide to Answer:
Don’t just define OOP. Show how it helps you manage complexity, especially in multi-step data processes.

Best Answer:

  1. Encapsulation – Bundling data and methods together

  2. Abstraction – Hiding internal implementation details

  3. Inheritance – Reusing code from a parent class

  4. Polymorphism – Different behaviors for the same method name across classes

💡 Guide to Answer:
Explain with a real example, like a generic DataConnector class and specialized SQLConnector, MongoConnector classes.

Best Answer:

class DataCleaner:
def __init__(self, data):
self.data = data

def remove_nulls(self):
return self.data.dropna()

# Instantiate
cleaner = DataCleaner(df)
cleaned_df = cleaner.remove_nulls()

💡 Guide to Answer:
Focus on writing clean, understandable class-based code. Relate this to reusability in ETL or cleaning tasks.

Best Answer:
self refers to the current instance of the class and is used to access attributes and methods within the class.

💡 Guide to Answer:
Mention that while it’s not a keyword, it’s a convention, and required as the first parameter of instance methods.

Best Answer:

class DataSource:
def connect(self):
print(“Connecting to source…”)

class SQLSource(DataSource):
def connect(self):
print(“Connecting to SQL database…”)

💡 Guide to Answer:
Show how inheritance reduces code duplication. Customize behaviors using method overriding.

Best Answer:

Best Answer:

  • Instance method: Takes self, used for object-level data

  • Class method: Takes cls, used for class-level operations

  • Static method: No self or cls, acts like a regular function inside the class

@classmethod
def from_json(cls, json_str): …

@staticmethod
def validate_date(date): …

💡 Guide to Answer:
Give use cases like loading config files (classmethod) or reusable validators (staticmethod).

Best Answer:

  • Overriding: Subclass redefines a method from parent class

  • Overloading: Python doesn’t support true overloading but you can use default or variable arguments

def add(self, a, b=0): …

💡 Guide to Answer:
Emphasize overriding is used in polymorphism; Python mimics overloading using default values or *args.

Best Answer: Use a single underscore _attr (convention) or double underscore __attr (name mangling).

💡 Guide to Answer:
Mention that Python doesn’t enforce access control, but the underscore signals intent to others.

Best Answer: Structure your ETL pipeline with reusable classes:

class Extractor:
def extract(self): …

class Transformer:
def transform(self, data): …

class Loader:
def load(self, data): …

💡 Guide to Answer:
This question is all about design—show modular thinking and how OOP scales when you build larger tools or micro-frameworks.

Best Answer:

Dunder methods (double underscore) are special methods like __init__, __str__, __len__, used to define behavior of built-in Python functions on your objects.

def __str__(self):
return f”Object with data: {self.data}”

💡 Guide to Answer:
Mention they make your classes more “Pythonic” and integrate with the language’s syntax more naturally.

8. Python Interview Questions for Data Engineer: Error Handling & Logging

1. How do you handle exceptions in Python?

Best Answer: Use try-except blocks to catch exceptions and handle them gracefully.

try:
df = pd.read_csv(‘file.csv’)
except FileNotFoundError as e:
print(f”File not found: {e}”)

💡 Guide to Answer:
Always catch specific exceptions (like ValueError, TypeError) instead of using a generic except:. Add finally for cleanup if needed.

Best Answer:

  • try-except: Catches and handles exceptions

  • finally: Runs no matter what, useful for cleanup (like closing files or database connections)

try:
conn = db.connect()
# some logic
except Exception as e:
print(e)
finally:
conn.close()

💡 Guide to Answer:
Explain the use of finally to avoid resource leaks, especially in file or DB operations.

Best Answer: Logging records events and errors that occur during program execution. It’s essential for debugging, monitoring, and auditing ETL pipelines.

💡 Guide to Answer:
Mention logging is more robust than print() because it supports severity levels, time stamps, file outputs, etc.

Best Answer:

import logging

logging.basicConfig(level=logging.INFO, filename=’etl.log’,
format=’%(asctime)s – %(levelname)s – %(message)s’)

logging.info(“ETL process started”)
logging.error(“Failed to connect to DB”)

💡 Guide to Answer:
Explain the use of different log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL.

Best Answer:

  • print() is for simple console output (good for quick debugging).

  • logging allows you to record events with timestamps, levels, and save logs to a file—ideal for production.

💡 Guide to Answer:
Say you use print() for ad-hoc debugging and logging for everything else—especially for error tracing.

Best Answer:

import logging

try:
risky_code()
except Exception as e:
logging.exception(“An error occurred”)

💡 Guide to Answer:
logging.exception() automatically adds the full traceback, useful for debugging complex pipelines.

Best Answer:

Structure your script with modular functions, wrap each with try-except, and log both successes and failures.

def extract():
try:
# logic
logging.info(“Extract successful”)
except Exception as e:
logging.error(f”Extract failed: {e}”)

💡 Guide to Answer:
Highlight how this keeps logs clean and makes it easier to pinpoint failures in multi-step pipelines.

Best Answer:

try:
df = pd.read_csv(‘file.csv’)
except pd.errors.ParserError as e:
logging.error(f”Parsing error: {e}”)


💡 Guide to Answer:
Mention that pandas has its own exceptions. You can skip bad rows using error_bad_lines=False (deprecated, but still asked in interviews) or on_bad_lines=’skip’ in newer versions.

9. Python Interview Questions for Data Engineer: Concurrency & Parallelism

1. What is the difference between concurrency and parallelism?

Best Answer:

  • Concurrency is when multiple tasks are in progress (but not necessarily running at the same time). Ideal for I/O-bound operations.

  • Parallelism is when multiple tasks run at the same time, usually on multiple cores. Ideal for CPU-bound tasks.

💡 Guide to Answer:
Use simple examples: downloading 100 files (I/O → concurrency), transforming 100 images (CPU → parallelism). Mention Python’s GIL and how it affects this.

Best Answer:
The GIL is a mutex that allows only one thread to execute Python bytecode at a time—even on multi-core systems. It restricts true parallel execution in multi-threading for CPU-bound tasks.

💡 Guide to Answer:
Clarify that GIL affects multi-threading, but not multi-processing. Libraries like numpy or pandas are optimized internally and not impacted as much.

Best Answer:
For I/O-bound tasks like:

  • Downloading files

  • Reading from APIs

  • Writing to disk or DBs

💡 Guide to Answer:
Explain that while threads share memory and are light-weight, they won’t speed up CPU-bound operations due to GIL.

Best Answer:

import threading

def download_file(url):

thread1 = threading.Thread(target=download_file, args=(‘url1’,))
thread2 = threading.Thread(target=download_file, args=(‘url2’,))

thread1.start()
thread2.start()
thread1.join()
thread2.join()


💡 Guide to Answer:
Mention .start() vs .join(), and how threads don’t block each other—good for parallel I/O.

Best Answer:
For CPU-bound tasks like:

  • Data transformations

  • Image/video processing

  • Complex calculations

💡 Guide to Answer:
Processes don’t share memory, which avoids GIL. Great for parallel execution on multiple CPU cores.

Best Answer:

from multiprocessing import Pool

def square(x):
return x * x

with Pool(4) as p:
results = p.map(square, [1, 2, 3, 4])


💡 Guide to Answer:
Mention Pool.map() is similar to the built-in map(), but runs in parallel. Ideal for batch processing

Best Answer:

  • Thread: Lightweight, shares memory, good for I/O

  • Process: Heavyweight, runs in separate memory space

  • Pool: A pool of worker processes or threads, used for parallel batch tasks

💡 Guide to Answer:
Show you know when to use each. Use Thread for I/O, Process for CPU, Pool for collections of either.

Best Answer:

  • concurrent.futures (modern threading and multiprocessing)

  • multiprocessing

  • joblib (great for parallel loops)

  • dask (scales pandas across cores/clusters)

  • pyspark (for distributed computing)

💡 Guide to Answer:
Talk about your familiarity with any of them. concurrent.futures.ThreadPoolExecutor is especially interviewer-friendly.

Best Answer:

  • Use ThreadPoolExecutor or asyncio for non-blocking I/O

  • Use batching and chunking

  • Compress and stream data instead of loading everything at once

💡 Guide to Answer:
Mention how you’ve optimized API calls, file downloads, or DB writes using async or threads.

Best Answer:

  • Watch out for race conditions in threads

  • Use locks or queues for safe communication

  • Don’t mutate shared data

  • For processes: beware of high memory consumption and serialization overhead

💡 Guide to Answer:
Mention you’ve handled deadlocks, memory bloat, and inconsistent outputs when first implementing parallel pipelines.

10. Python Interview Questions for Data Engineer: Automation & Scripting

1. How do you automate a daily task using Python?

Best Answer:

  • Write a script that performs the task (e.g., file cleanup, API data pull)

  • Use cron (Linux/macOS) or Task Scheduler (Windows) to run the script on a schedule

  • Example cron job:

0 2 * * * /usr/bin/python3 /path/to/script.py

💡 Guide to Answer:
Mention real tasks you’ve automated: like data refresh, backup, or file renaming. Bonus points if you mention using Python’s schedule or APScheduler for programmatic scheduling.

Best Answer:

import os

for filename in os.listdir(‘reports/’):
new_name = filename.replace(‘ ‘, ‘_’)
os.rename(f’reports/{filename}’, f’reports/{new_name}’)

💡 Guide to Answer:
Highlight how automation saves time and reduces human error. You can also mention using glob for pattern-based selection.

Best Answer:

import smtplib
from email.message import EmailMessage

msg = EmailMessage()
msg.set_content(‘ETL job completed’)
msg[‘Subject’] = ‘Daily ETL Report’
msg[‘From’] = ‘you@example.com’
msg[‘To’] = ‘manager@example.com’

with smtplib.SMTP(‘smtp.example.com’, 587) as server:
server.starttls()
server.login(‘you@example.com’, ‘password’)
server.send_message(msg)

💡 Guide to Answer:
Use this for alerting on failures or completion. Mention that in production, you’d store credentials securely (e.g., in .env, AWS Secrets Manager, or key vaults).

Best Answer:

# Option 1: Import and call
import my_script
my_script.run()

# Option 2: Use subprocess
import subprocess
subprocess.run([‘python’, ‘script.py’])

💡 Guide to Answer:
Talk about using imports for modular code and subprocess when running CLI tools or external scripts.

Best Answer:

  • Use cron for hourly tasks:

0 * * * * /usr/bin/python3 /home/user/my_script.py

Or use the schedule library:

import schedule
import time

schedule.every().hour.do(run_etl)

while True:
schedule.run_pending()
time.sleep(60)

💡 Guide to Answer:
Mention cron for production and schedule for in-script testing or lightweight automation.

Best Answer:

import zipfile

with zipfile.ZipFile(‘archive.zip’, ‘w’) as zipf:
zipf.write(‘data.csv’)

💡 Guide to Answer:
Useful when archiving logs or results after ETL jobs. You can also use shutil for entire folders.

Best Answer:

import os

api_key = os.getenv(‘API_KEY’)

💡 Guide to Answer:
Critical for security—don’t hardcode credentials! Mention .env files with python-dotenv for local development.

Best Answer:

est Answer:

  • Use modular functions (extract(), transform(), load())

  • Add command-line arguments with argparse

  • Log output and exceptions

  • Version the script with Git

  • Schedule using cron or orchestrators like Airflow

💡 Guide to Answer:
This is a systems design-style question. Talk about how you build for reusability, observability, and repeatability.

11. Python Interview Questions for Data Engineer: Testing & Debugging

1. Why is testing important in data engineering?

Best Answer:
Testing ensures that your data pipelines work correctly, catch regressions early, and maintain data quality. It helps avoid costly mistakes like data loss, duplication, or silent failures.

💡 Guide to Answer:
Mention how tests provide confidence in deployments, and make debugging faster during failures.

Best Answer:

  • Unit Tests – Test individual functions or components (e.g., transform logic)

  • Integration Tests – Verify interaction between components (e.g., DB write + read)

  • End-to-End Tests – Run the full pipeline on test data

  • Data Validation Tests – Check for nulls, duplicates, schema mismatches

💡 Guide to Answer:
Explain when and where you’d use each. Mention that you focus most on unit + data validation in ETL projects.

Best Answer: Use the built-in unittest or the more popular pytest.

def add(x, y):
return x + y

def test_add():
assert add(2, 3) == 5

💡 Guide to Answer:
Mention pytest for simplicity, unittest for built-in coverage. Talk about organizing tests into separate folders or files (test_*.py).

Best Answer: Use the unittest.mock module:

from unittest.mock import patch

@patch(‘module.api_call_function’)
def test_api(mock_api):
mock_api.return_value = {‘status’: ‘success’}
assert process_api_data() == ‘success’

💡 Guide to Answer:
Talk about how mocking avoids making real API/DB calls during tests, speeding up and isolating the tests.

Best Answer:

  • Use print() for quick debugging

  • Use the pdb module or IDE breakpoints for step-by-step inspection

  • Use logging for real-time visibility in production scripts

💡 Guide to Answer:
Mention that while print() is helpful in dev, logging is better for tracing issues post-deployment. Bonus: mention tools like VS Code debugger or PyCharm debugger.

Best Answer:

def clean_data(df):
return df.dropna()

def test_clean_data():
input_df = pd.DataFrame({‘a’: [1, None]})
expected_df = pd.DataFrame({‘a’: [1.0]})
pd.testing.assert_frame_equal(clean_data(input_df), expected_df)

💡 Guide to Answer:
Use pandas.testing.assert_frame_equal() for checking equality. Mention edge cases (nulls, empty DataFrames, wrong types).

Best Answer:

With pytest:

pytest tests/

Or with unittest:

python -m unittest discover

💡 Guide to Answer:
Mention organizing tests under a /tests directory and using CI tools like GitHub Actions, Travis CI, or GitLab CI to automate testing.

Best Answer:

  • Check row counts before and after

  • Use hash totals or checksums on key fields

  • Validate schema and data types

  • Use data testing tools like Great Expectations

💡 Guide to Answer:
Show that data correctness > code correctness in your workflow. That’s the mindset of a great data engineer.

12. Python Interview Questions for Data Engineer: Cloud & Big Data Tool Integration

1. How do you connect to AWS S3 using Python?

Best Answer:

import boto3

s3 = boto3.client(‘s3’)
s3.download_file(‘my-bucket’, ‘data.csv’, ‘data.csv’)

💡 Guide to Answer:
Mention the need for AWS credentials (stored in ~/.aws/credentials or as env vars). Highlight using boto3 for all AWS services and mention IAM permissions.

Best Answer:

s3 = boto3.client(‘s3’)
response = s3.list_objects_v2(Bucket=’my-bucket’)

for obj in response.get(‘Contents’, []):
print(obj[‘Key’])

💡 Guide to Answer:
Show your comfort with SDK documentation. Explain pagination if asked about buckets with large data sets.

Best Answer:

from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket(‘my-bucket’)
blob = bucket.blob(‘data.csv’)

# Download
blob.download_to_filename(‘data.csv’)

# Upload
blob.upload_from_filename(‘local.csv’)

💡 Guide to Answer:
Mention needing a service account key and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

Best Answer:

from google.cloud import bigquery

client = bigquery.Client()
job = client.load_table_from_dataframe(df, ‘project.dataset.table’)
job.result()

💡 Guide to Answer:
Explain how BigQuery scales large inserts. Mention to_gbq() from pandas-gbq for simpler usage in smaller jobs.

Best Answer:

from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(“YOUR_CONN_STRING”)
blob_client = blob_service_client.get_blob_client(container=”mycontainer”, blob=”data.csv”)

with open(“data.csv”, “rb”) as data:
blob_client.upload_blob(data)

💡 Guide to Answer:
Highlight the importance of secure credentials and rotating connection strings or tokens (SAS keys).

Best Answer:
PySpark is the Python API for Apache Spark. It allows you to process big data in a distributed computing environment using Python.

💡 Guide to Answer:
Mention it’s ideal for large-scale ETL, joins, and transformations across distributed clusters. Bonus: mention dataframes vs RDDs.

Best Answer:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“MyApp”).getOrCreate()
df = spark.read.csv(“data.csv”, header=True, inferSchema=True)
df.write.parquet(“output.parquet”)

💡 Guide to Answer:
Show familiarity with reading various formats (CSV, JSON, Parquet) and writing partitioned outputs for performance.

Best Answer:

  • Use chunked processing in pandas (chunksize)

  • Use Dask or Vaex for out-of-core dataframes

  • Use PySpark or SQL-based engines for distributed data handling

💡 Guide to Answer:
Show that you’re aware of scaling bottlenecks and can switch to bigger tools when needed.

Best Answer:
Dask is a parallel computing library that scales pandas-like syntax for out-of-memory or multi-core processing. It’s good for large dataframes, delayed execution, and cluster-based processing.

💡 Guide to Answer:
Mention Dask’s similarity to pandas and how it works well in local clusters or cloud environments. Optional: contrast with PySpark.

Best Answer:

import boto3

lambda_client = boto3.client(‘lambda’)
response = lambda_client.invoke(
FunctionName=’my-lambda’,
InvocationType=’Event’,
Payload=b'{}’
)

💡 Guide to Answer:
Explain Event is async, RequestResponse is sync. Lambda is perfect for lightweight ETL or event-driven pipelines.

Best Answer:

  • Use Apache Airflow (with Python DAGs)

  • Use AWS Step Functions or Cloud Composer (GCP’s managed Airflow)

  • Use Prefect for simpler orchestration with Pythonic syntax

💡 Guide to Answer:
Mention how you define dependencies, retries, notifications, and logs in your orchestrator.

Best Answer:

  • Scalable and serverless (e.g., BigQuery, S3, GCS)

  • Highly available and secure

  • Cost-effective for storage and compute

  • Integrate easily with SDKs and orchestration tools

💡 Guide to Answer:
Highlight how cloud tools remove the burden of infrastructure so engineers can focus on logic and scale.

Conclusion

Mastering Python is non-negotiable if you’re aiming for a data engineering role. But more than just writing code, you need to understand how to use Python in real data environments—from transforming massive datasets to automating workflows and integrating with cloud platforms.

This blog gave you a deep dive into all the critical Python interview questions for data engineer roles, along with practical examples to help you answer them confidently.

If you’re serious about cracking your next interview, bookmark this guide or share it with someone who needs it!

Frequently Asked Questions

Is Python enough for a data engineering role?

Python is one of the most essential languages for data engineers due to its simplicity, rich ecosystem, and support for data processing libraries like pandas, SQLAlchemy, PySpark, and more. However, you’ll also need skills in SQL, cloud platforms, and big data tools.

Focus on:

  • Data structures & algorithms

  • File and data handling (pandas, numpy)

  • ETL scripting

  • API usage

  • SQL & database interactions

  • Cloud integrations (e.g., S3, BigQuery)

  • Error handling & automation

  • Building an end-to-end ETL pipeline

  • Automating data cleanup tasks

  • Integrating Python with cloud services (e.g., uploading to S3, querying BigQuery)

  • Parallel processing scripts using multiprocessing or PySpark

Yes, especially at product-based companies. You’ll often face a coding round focused on data manipulation, file parsing, or simple algorithms using Python. SQL and system design rounds are also common.

Related Articles