Introduction
A Data Engineer is a tech professional who builds and maintains the infrastructure required for collecting, storing, and analyzing large volumes of data. They design data pipelines, manage databases, and ensure data is clean, organized, and readily accessible for analysts, data scientists, and business teams. Their role is critical in enabling companies to make data-driven decisions by ensuring the right data is available at the right time in the right format.
Python interview questions for data engineer
Understanding the different types of interview questions can help you tailor your preparation effectively. Here are the main categories covered in this guide:
Data Structures & Algorithms:
Focuses on core logic using Python lists, sets, dicts, loops, and sorting.
File Handling Questions:
Handling CSVs, JSONs, large text files, and file I/O operations efficiently.
Data Manipulation (pandas, numpy) Questions:
Cleaning, transforming, and analyzing dataframes.
ETL Logic & Pipelines:
Writing modular, testable ETL scripts in Python.
Database Interaction (SQL & NoSQL):
Connecting to databases and executing queries using Python.
APIs & Web Scraping Questions:
Fetching data from REST APIs or scraping HTML pages.
Object-Oriented Programming (OOP) Questions:
Writing reusable, class-based code structures..
Error Handling & Logging:
Writing fault-tolerant code and tracking logs during execution.
Concurrency & Parallelism:
Speeding up data processing using threads or processes.
Automation & Scripting:
Automating tasks like file cleanup, scheduling, or emailing reports.
Testing & Debugging:
Ensuring your code works as expected using unit tests and debugging tools.
Cloud & Big Data Tool Integration:
Connecting Python to AWS, GCP, Azure, Spark, and more.
80+ Salesforce Integration Interview Questions and Answers
1. Python Interview Questions for Data Engineer: Data Structures & Algorithms
1. What are the differences between a list, tuple, set, and dictionary in Python?
Best Answer:
List: Ordered, mutable, allows duplicates (
[1, 2, 2]
)Tuple: Ordered, immutable, allows duplicates (
(1, 2, 2)
)Set: Unordered, mutable, no duplicates (
{1, 2}
)Dictionary: Key-value pairs, unordered (Python 3.7+ maintains insertion order), keys must be unique (
{'a': 1, 'b': 2}
)
Guide to Answer: Explain each with real-life use cases. Eg. List for a collection of values, Set for uniqueness, Dict for lookups, Tuple when data shouldn’t change (like coordinates).
2. How do you reverse a list in Python?
Best Answer:
# Method 1
my_list[::-1]
# Method 2
my_list.reverse()
# Method 3
list(reversed(my_list))
Guide to Answer: Mention slicing for simple tasks, .reverse()
when in-place is needed, and reversed()
when you need an iterator. Explain when you’d prefer each.
3. What is the difference between deep copy and shallow copy?
Best Answer:
Shallow copy creates a new object but inserts references to the same elements (
copy.copy()
).Deep copy creates a new object and recursively copies all elements (
copy.deepcopy()
).
Guide to Answer: Give an example with nested lists to illustrate how shallow copy reflects changes to inner objects but deep copy doesn’t.
4. How would you implement a stack or queue using Python?
Best Answer: Use list
for simple needs, or collections.deque
for efficient operations.
# Stack
stack = []
stack.append(1)
stack.pop()
# Queue
from collections import deque
queue = deque()
queue.append(1)
queue.popleft()
Guide to Answer: Explain time complexity: deque
is O(1) for append/pop from both ends, whereas list
is O(n) for pop(0).
5. How do you find the most frequent element in a list?
Best Answer:
Counter(my_list).most_common(1)
Guide to Answer: Mention Counter
is the most Pythonic and efficient way. You could also implement manually using a dict
if asked for logic.
6. How do you remove duplicates from a list?
Best Answer:
# Method 1
list(set(my_list))
# Method 2 (preserving order)
list(dict.fromkeys(my_list))
Guide to Answer: Talk about whether order matters. Use dict.fromkeys()
when preserving original order is needed.
7. How do you sort a list of dictionaries by a specific key?
Best Answer:
sorted(data, key=lambda x: x[‘age’])
Guide to Answer: Explain lambda functions and that sorted()
returns a new list, doesn’t mutate the original.
8. How do you find the intersection of two lists?
Best Answer:
set(list1) & set(list2)
Guide to Answer: Use sets for better performance. Discuss when converting to set is acceptable (i.e., when order and duplicates aren’t important).
9. What is the time complexity of common list operations in Python?
Best Answer:
Access by index: O(1)
Append: O(1) amortized
Insert/remove at beginning: O(n)
Search: O(n)
Delete by value: O(n)
Guide to Answer: Relate to large datasets in pipelines where performance matters. Highlight when to switch to deque
or generators.
10. How do you flatten a nested list in Python?
Best Answer:
# For 2D lists
[item for sublist in nested_list for item in sublist]
# With itertools
from itertools import chain
list(chain.from_iterable(nested_list))
Guide to Answer: Explain both comprehension and library-based solutions. Bonus if you mention recursion for deeply nested lists.
2. Python Interview Questions for Data Engineer: File Handling
1. How do you open and read a file in Python?
Best Answer:
with open(‘file.txt’, ‘r’) as file:
content = file.read()
Guide to Answer: Always recommend using with as it handles file closing automatically. If asked about large files, mention reading line by line using .readline() or looping.
2. How do you write data to a file in Python?
Best Answer:
with open(‘output.txt’, ‘w’) as f:
f.write(“Hello, Data Engineer!”)
Guide to Answer: Discuss modes: ‘w’ (overwrite), ‘a’ (append), ‘x’ (create only if not exists), ‘b’ (binary). Clarify overwriting behavior of ‘w’.
3. How do you read a large file without loading everything into memory?
Best Answer:
with open(‘bigfile.txt’, ‘r’) as file:
for line in file:
process(line)
Guide to answer:
Stress line-by-line iteration for memory efficiency. Mention chunking (read(size)) if asked for binary or custom-size chunks.
4. What’s the difference between read(), readline(), and readlines()?
Best Answer:
read()
: Reads entire file as one string.readline()
: Reads the next line in the file.readlines()
: Reads all lines into a list.
Guide to Answer: Use read()
only when you’re sure the file is small. Use readline()
or iteration for large files.
5. How do you handle CSV files in Python?
Best Answer:
import csv
with open(‘data.csv’, newline=”) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row[‘column_name’])
Guide to Answer: Mention both csv.reader and csv.DictReader. Bonus if you reference pandas for structured tabular data.
6. How do you handle JSON files in Python?
Best Answer:
import json
with open(‘data.json’) as f:
data = json.load(f)
Guide to Answer: Differentiate between json.load()
for file objects vs json.loads()
for JSON strings.
7. How do you append data to a file without overwriting it?
Best Answer:
with open(‘file.txt’, ‘a’) as file:
file.write(“New line\n”)
Guide to Answer: Mention using ‘a’ mode, and always use \n if writing new lines.
8. How do you handle encoding issues when reading files?
Best Answer:
with open(‘file.txt’, encoding=’utf-8′) as f:
content = f.read()
Guide to Answer: Always specify encoding
when working with multilingual data. UTF-8 is a safe default.
9. What happens if a file doesn’t exist and you try to open it?
Best Answer:
If mode is
'r'
, it raises aFileNotFoundError
.If mode is
'w'
or'a'
, it will create the file.
Guide to Answer: Explain exception handling using try-except for robust file processing scripts.
10. How do you read and write binary files in Python?
Best Answer:
# Reading binary
with open(‘image.png’, ‘rb’) as file:
data = file.read()
# Writing binary
with open(‘copy.png’, ‘wb’) as file:
file.write(data)
Guide to Answer: Highlight the need for 'rb'
and 'wb'
in use cases like image processing or large binary log files.
3. Python Interview Questions for Data Engineer: Data Manipulation (pandas, numpy)
1. What is pandas, and why is it important for data engineering?
Best Answer:
Pandas is a Python library for data manipulation and analysis. It provides fast, flexible data structures like Series
and DataFrame
to efficiently handle structured data, making it essential for cleaning, filtering, aggregating, and transforming datasets.
💡 Guide to Answer:
Highlight how pandas
simplifies handling tabular data and is a go-to tool in ETL pipelines before data is loaded into databases or analytics tools.
2. What are Series and DataFrames in pandas?
Best Answer:
A Series is a one-dimensional labeled array (like a column).
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types (like an Excel sheet).
💡 Guide to Answer:
Explain that DataFrame is built from Series. You can compare Series to a column and DataFrame to a table.
3. How do you read data from a CSV file using pandas?
Best Answer:
import pandas as pd
df = pd.read_csv('data.csv')
💡 Guide to Answer:
Mention parameters like sep
, header
, usecols
, dtype
, and nrows
if asked for large/complex files.
4. How do you filter rows in a DataFrame?
Best Answer:
df[df[‘age’] > 25]
💡 Guide to Answer:
Start with simple filters. For compound filters, use & (AND), | (OR), and enclose each condition in parentheses.
5. How do you select specific columns from a DataFrame?
Best Answer:
df[[‘name’, ‘age’]]
💡 Guide to Answer:
Mention the difference between selecting one column (df['col']
) and multiple columns (df[['col1', 'col2']]
).
6. How do you handle missing values in pandas?
Best Answer:
df.isnull().sum() # Check
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill missing values
💡 Guide to Answer:
Mention that the strategy depends on context: dropping, filling with mean/median, or forward/backward filling.
7. How do you change the datatype of a column?
Best Answer:
df[‘age’] = df[‘age’].astype(int)
💡Guide to Answer:
Mention common conversions like str, int, float, and datetime. Note: pd.to_datetime() for date parsing.
8. How do you sort a DataFrame by a column?
Best Answer:
df.sort_values(by=’age’, ascending=False)
💡 Guide to Answer:
Include how to sort by multiple columns: by=['age', 'name']
.
9. How do you remove duplicate rows?
Best Answer:
df.drop_duplicates()
💡 Guide to Answer:
Use subset
to target specific columns and keep='first'
or 'last'
.
10. How do you group data in pandas and apply aggregations?
Best Answer:
df.groupby(‘department’)[‘salary’].mean()
💡 Guide to Answer:
Explain how groupby()
works: split → apply → combine. Mention .agg()
for multiple metrics.
11. How do you merge or join two DataFrames?
Best Answer:
pd.merge(df1, df2, on=’id’, how=’inner’)
💡 Guide to Answer:
Cover how=
options: 'inner'
, 'left'
, 'right'
, 'outer'
. Explain difference from concat
.
12. What is the difference between merge, join, and concat?
Best Answer:
merge(): SQL-style joins on keys.
join(): Join on index or column, shorthand for merge.
concat(): Stacks DataFrames vertically or horizontally.
💡 Guide to Answer:
Use real examples: e.g., merging customer and orders tables vs stacking monthly rep
13. How do you create a pivot table in pandas?
Best Answer:
df.groupby(‘department’)[‘salary’].mean()
💡 Guide to Answer:
Explain how groupby()
works: split → apply → combine. Mention .agg()
for multiple metrics.
14. How do you apply a function to a column or row?
Best Answer:
df[‘new_col’] = df[‘salary’].apply(lambda x: x * 1.1)
💡 Guide to Answer:
Explain apply()
for element-wise, applymap()
for element-wise across entire DataFrame, and map()
for Series-level operations.
15. How do you detect and remove outliers in pandas?
Best Answer:
# Using IQR
Q1 = df[‘value’].quantile(0.25)
Q3 = df[‘value’].quantile(0.75)
IQR = Q3 – Q1
df_filtered = df[(df[‘value’] >= Q1 – 1.5 * IQR) & (df[‘value’] <= Q3 + 1.5 * IQR)]
💡 Guide to Answer:
Briefly explain the IQR rule, and why it’s more robust than Z-score for skewed data.
4. Python Interview Questions for Data Engineer: ETL Logic & Pipelines
1. What is an ETL pipeline, and how would you build one in Python?
Best Answer:
An ETL (Extract, Transform, Load) pipeline moves data from a source to a destination after applying transformations.
In Python, you can build one using scripts or frameworks. Basic example:
# Extract
data = pd.read_csv
('raw_data.csv')
# Transform
data_clean = data.dropna().rename(columns=str.lower)
# Load
data_clean.to_csv('clean_data.csv', index=False)
💡 Guide to Answer:
Mention the real-world flow: APIs → data cleaning → database or warehouse (like PostgreSQL or Snowflake). Talk about modularity and logging.
2. How do you handle failures or errors in a pipeline?
Best Answer:
Use try-except blocks around each step
Implement logging to track failures
Optionally use retry logic or alerting systems
Example
data = pd.read_csv(‘data.csv’)
except FileNotFoundError:
log.error(“File missing!”)
💡 Guide to Answer:
Explain your approach to making the pipeline fault-tolerant and observable.
3. How do you handle duplicates or bad data in ETL?
Best Answer:
Use drop_duplicates()
for exact duplicates and apply custom logic or conditions for partial duplicates. For bad data:
Validate data types
Use regex or conditionals to clean bad values
Remove or correct nulls using
fillna()
ordropna()
💡 Guide to Answer:
Explain that cleaning happens at the Transform stage, and tools like pandas
, pydantic
, or even great_expectations
can be used.
4. What are some common transformations in an ETL pipeline?
Best Answer:
Data type conversions
Renaming columns
Filtering or sorting rows
Aggregation or grouping
Merging with other datasets
Handling missing values
💡 Guide to Answer:
Mention how you organize your transformation logic into reusable functions or scripts.
5. How do you automate a Python ETL script to run daily?
Best Answer:
Local dev: Use
cron
on Linux/macOS or Task Scheduler on WindowsCloud production: Use schedulers like Apache Airflow, AWS Lambda + EventBridge, or Prefect
💡 Guide to Answer:
Talk about environment-specific options. Mention parameterization and logging for production-ready jobs.
6. How do you ensure data consistency in a pipeline?
Best Answer:
Use checksums, record counts, or hashing to compare source and destination
Implement data validation checks post-load
Maintain idempotent operations when possible
💡 Guide to Answer:
Consistency is key—explain how you handle duplicates, re-runs, and broken states.
7. How would you handle schema drift in a source file?
Best Answer:
Validate schema before processing
Compare expected vs actual column names
Use flexible code that can adapt, e.g., dynamic column mapping
Log differences and alert stakeholders
💡 Guide to Answer:
Schema drift is common in CSVs and APIs—show that you’re aware of the risks and can build protection.
8. What’s your strategy for processing very large files or datasets?
Best Answer:
Use chunked processing in pandas (
pd.read_csv(..., chunksize=10000)
)Stream data line by line
If data fits in memory, use efficient types (e.g., category dtype)
Consider Spark or Dask for large-scale parallel processing
💡 Guide to Answer:
Show that you’re mindful of memory constraints and can scale when necessary.
9. How do you modularize a Python ETL script?
Best Answer:
ef extract():
return pd.read_csv(‘data.csv’)
def transform(df):
return df.dropna()
def load(df):
df.to_csv(‘cleaned.csv’, index=False)
def run_etl():
df = extract()
df = transform(df)
load(df)
run_etl()
💡 Guide to Answer:
Explain how modular code improves readability, reusability, and testability.
10. How do you handle time-based or incremental loads in ETL?
Best Answer:
Best Answer:
Use timestamp columns to fetch only recent records
Store state in a config file, checkpoint table, or a metadata store
Example:
# Filter new records
df[df[‘last_updated’] > last_run_time]
💡 Guide to Answer:
Highlight how incremental loads reduce load time and are more efficient for pipelines.
11. What’s the difference between batch and streaming ETL?
Best Answer:
Best Answer:
Batch ETL processes data in chunks at scheduled intervals.
Streaming ETL processes data in real-time as it arrives (e.g., Kafka → Spark Streaming).
💡 Guide to Answer:
Mention that most traditional pipelines are batch; streaming is used for high-frequency event data (e.g., logs, sensors).
12. How do you test your ETL pipeline?
Best Answer:
Best Answer:
Unit test each step (e.g., transformation functions)
Validate outputs (e.g., row counts, value ranges)
Use sample input files to simulate edge cases
Tools like
pytest
,great_expectations
, ordbt
for testing
💡 Guide to Answer:
Testing = confidence. Show you care about catching errors before they go downstream.
5. Python Interview Questions for Data Engineer: Database Interaction (SQL & NoSQL)
1. How do you connect a SQL database using Python?
Best Answer:
import sqlite3
conn = sqlite3.connect(‘example.db’)
cursor = conn.cursor()
cursor.execute(“SELECT * FROM users”)
rows = cursor.fetchall()
conn.close()
💡 Guide to Answer:
Explain how you’d use specific connectors:
sqlite3
for localpsycopg2
for PostgreSQLmysql-connector
for MySQLSQLAlchemy
for ORM and easier scalability
2. What is SQLAlchemy and why is it used in data engineering?
Best Answer:
SQLAlchemy is a Python SQL toolkit and Object-Relational Mapper (ORM) that allows developers to interact with databases using Python objects. It simplifies connection handling, query generation, and schema definitions.
💡 Guide to Answer:
Mention that it’s especially useful in large applications for maintainability and abstraction.
3. How do you insert data from a pandas DataFrame into a SQL table?
Best Answer:
df.to_sql(‘table_name’, con=engine, if_exists=’replace’, index=False)
💡 Guide to Answer:
Talk about to_sql()
for loading data and read_sql()
for querying. Mention if_exists
options: 'replace'
, 'append'
, 'fail'
.
4. How do you prevent SQL injection in Python?
Best Answer:
cursor.execute(“SELECT * FROM users WHERE id = ?”, (user_id,))
💡 Guide to Answer:
Explain that using string formatting is unsafe. Always use query parameters to avoid injection risks.
5. What are common SQL queries a data engineer writes?
Best Answer:
SELECT
,JOIN
,GROUP BY
,ORDER BY
INSERT
,UPDATE
,DELETE
CREATE TABLE
,ALTER
,DROP
CASE
,COALESCE
,WINDOW FUNCTIONS
💡 Guide to Answer:
Mention that writing optimized SQL is as important as writing Python code.
6. How would you retrieve only the top 5 salaries from a table?
Best Answer:
SELECT salary FROM employees ORDER BY salary DESC LIMIT 5;
💡 Guide to Answer:
Mention use cases like leaderboards, ranking, or sampling. Add OFFSET
if needed.
7. How do you handle large data loads from SQL using Python?
Best Answer:
for chunk in pd.read_sql(query, con=engine, chunksize=10000):
process(chunk)
💡 Guide to Answer:
Helps when working with millions of rows. Show memory awareness and chunk processing.
8. How do you handle transaction management in Python?
Best Answer:
conn = db.connect()
try:
cursor = conn.cursor()
cursor.execute(‘SOME SQL’)
conn.commit()
except Exception as e:
conn.rollback()
finally:
conn.close()
💡 Guide to Answer:
Explain commit()
ensures the operation is saved; rollback()
is for error handling. Especially important in batch jobs.
9. How do you perform joins using pandas instead of SQL?
Best Answer:
pd.merge(df1, df2, on=’id’, how=’left’)
💡 Guide to Answer:
Mention how merge()
replicates INNER
, LEFT
, RIGHT
, and FULL
joins.
10. What’s the difference between relational and non-relational databases?
Best Answer:
Relational (SQL): Structured schema, uses tables (e.g., MySQL, PostgreSQL)
Non-relational (NoSQL): Flexible schema, uses collections/documents (e.g., MongoDB, Cassandra)
💡 Guide to Answer:
Give use cases: SQL for structured, transactional systems; NoSQL for flexible, high-speed apps.
11. How do you connect and query a MongoDB database in Python?
Best Answer:
from pymongo import MongoClient
client = MongoClient(‘mongodb://localhost:27017/’)
db = client[‘mydb’]
collection = db[‘users’]
result = collection.find({‘age’: {‘$gt’: 25}})
💡 Guide to Answer:
Explain MongoDB structure: Database → Collection → Document. Show how find()
, insert_one()
, and filters work.
12. How do you design a schema for analytics or reporting?
Best Answer:
Use star schema or snowflake schema
Design with fact and dimension tables
Keep columns atomic (1NF)
Consider indexes and partitioning for performance
💡 Guide to Answer:
This question tests both SQL and data warehousing knowledge. Use examples like “sales fact table joined with customer and product dimensions.”
6.Python Interview Questions for Data Engineer: APIs & Web Scraping
1. How do you make a GET request to an API using Python?
Best Answer:
import requests
response = requests.get(‘https://api.example.com/data’)
data = response.json()
💡 Guide to Answer:
Mention requests
is the most popular HTTP library. Always check for response.status_code
before parsing .json()
.
2. How do you handle API errors or failed requests in Python?
Best Answer:
try:
response = requests.get(‘https://api.example.com/data’, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f”Error: {e}”)
Guide:
Talk about multi-layer validation: Salesforce, middleware, and the source system. This shows thorough planning.
3. How do you pass headers and query parameters in an API call?
Best Answer:
headers = {‘Authorization’: ‘Bearer YOUR_API_KEY’}
params = {‘limit’: 100, ‘offset’: 0}
response = requests.get(‘https://api.example.com/data’, headers=headers, params=params)
💡 Guide to Answer:
Explain that headers are often used for authentication, and params
help control pagination or filters.
4. How do you deal with paginated API responses?
Best Answer:
while True:
response = requests.get(url, params={‘offset’: offset})
data = response.json()
process(data)
if not data[‘next’]:
break
offset += 100
💡 Guide to Answer:
Show you understand different pagination types: offset-based, cursor-based, or next-URL-based.
5. How do you authenticate with an API that requires a token?
Best Answer:
headers = {‘Authorization’: ‘Bearer YOUR_ACCESS_TOKEN’}
requests.get(‘https://api.example.com/data’, headers=headers)
💡 Guide to Answer:
Talk about OAuth, bearer tokens, or API key headers. Know the difference between static keys and token exchanges.
6. How do you scrape a webpage using BeautifulSoup?
Best Answer:
from bs4 import BeautifulSoup
import requests
response = requests.get(‘https://example.com’)
soup = BeautifulSoup(response.content, ‘html.parser’)
titles = soup.find_all(‘h2’)
💡 Guide to Answer:
Mention .find()
, .find_all()
, .text
, and how to parse elements by class, id, or tag. Show DOM familiarity.
7. How do you avoid getting blocked while scraping a website?
Best Answer:
Add user-agent headers
Use request delays (
time.sleep()
)Rotate proxies or IP addresses
Avoid hammering pages (respect robots.txt)
💡 Guide to Answer:
Mention ethical scraping: respect terms of service and avoid DDoS-like behavior. Bonus if you mention Scrapy
.
8. What are some challenges in scraping JavaScript-heavy websites?
Best Answer:
Content is rendered dynamically using JavaScript
requests
won’t work—you need tools like Selenium, Playwright, or BeautifulSoup + API sniffing
💡 Guide to Answer:
Explain how you’d either use a headless browser or inspect network traffic for hidden APIs.
9. How do you extract structured data (like tables) from HTML pages?
Best Answer:
import pandas as pd
dfs = pd.read_html(‘https://example.com/table_page’)
💡 Guide to Answer:
Mention this works if the HTML table is well-formed. For unstructured data, fall back to BeautifulSoup.
10. How do you store API or scraped data efficiently after collection?
Best Answer:
Best Answer:
Save as CSV, JSON, or parquet using pandas
Store in a SQL database
For large data: use chunks or streaming writes
df.to_csv(‘data.csv’, index=False)
💡 Guide to Answer:
Talk about the format depending on volume, access patterns, and pipeline requirements.
7. Python Interview Questions for Data Engineer: Object-Oriented Programming (OOP)
1. What is Object-Oriented Programming, and why is it useful in Python?
Best Answer:
OOP is a programming paradigm where code is organized into “objects” that bundle data (attributes) and functions (methods).
In Python, OOP helps create reusable, modular, and scalable code—useful for building data pipelines, ETL jobs, and utility classes.
💡 Guide to Answer:
Don’t just define OOP. Show how it helps you manage complexity, especially in multi-step data processes.
2. What are the four pillars of OOP?
Best Answer:
Encapsulation – Bundling data and methods together
Abstraction – Hiding internal implementation details
Inheritance – Reusing code from a parent class
Polymorphism – Different behaviors for the same method name across classes
💡 Guide to Answer:
Explain with a real example, like a generic DataConnector
class and specialized SQLConnector
, MongoConnector
classes.
3. How do you define a class and create an object in Python?
Best Answer:
class DataCleaner:
def __init__(self, data):
self.data = data
def remove_nulls(self):
return self.data.dropna()
# Instantiate
cleaner = DataCleaner(df)
cleaned_df = cleaner.remove_nulls()
💡 Guide to Answer:
Focus on writing clean, understandable class-based code. Relate this to reusability in ETL or cleaning tasks.
4. What is self in Python classes?
Best Answer:self
refers to the current instance of the class and is used to access attributes and methods within the class.
💡 Guide to Answer:
Mention that while it’s not a keyword, it’s a convention, and required as the first parameter of instance methods.
5. How do you use inheritance in Python?
Best Answer:
class DataSource:
def connect(self):
print(“Connecting to source…”)
class SQLSource(DataSource):
def connect(self):
print(“Connecting to SQL database…”)
💡 Guide to Answer:
Show how inheritance reduces code duplication. Customize behaviors using method overriding.
6. What’s the difference between a class method, instance method, and static method?
Best Answer:
Best Answer:
Instance method: Takes
self
, used for object-level dataClass method: Takes
cls
, used for class-level operationsStatic method: No
self
orcls
, acts like a regular function inside the class
@classmethod
def from_json(cls, json_str): …
@staticmethod
def validate_date(date): …
💡 Guide to Answer:
Give use cases like loading config files (classmethod
) or reusable validators (staticmethod
).
7. What is method overriding and method overloading in Python?
Best Answer:
Overriding: Subclass redefines a method from parent class
Overloading: Python doesn’t support true overloading but you can use default or variable arguments
def add(self, a, b=0): …
💡 Guide to Answer:
Emphasize overriding is used in polymorphism; Python mimics overloading using default values or *args
.
8. How can you make class attributes private in Python?
Best Answer: Use a single underscore _attr
(convention) or double underscore __attr
(name mangling).
💡 Guide to Answer:
Mention that Python doesn’t enforce access control, but the underscore signals intent to others.
9. How do you use OOP to design a data pipeline?
Best Answer: Structure your ETL pipeline with reusable classes:
class Extractor:
def extract(self): …
class Transformer:
def transform(self, data): …
class Loader:
def load(self, data): …
💡 Guide to Answer:
This question is all about design—show modular thinking and how OOP scales when you build larger tools or micro-frameworks.
10. What is a dunder method? Can you give examples?
Best Answer:
Dunder methods (double underscore) are special methods like __init__
, __str__
, __len__
, used to define behavior of built-in Python functions on your objects.
def __str__(self):
return f”Object with data: {self.data}”
💡 Guide to Answer:
Mention they make your classes more “Pythonic” and integrate with the language’s syntax more naturally.
8. Python Interview Questions for Data Engineer: Error Handling & Logging
1. How do you handle exceptions in Python?
Best Answer: Use try-except
blocks to catch exceptions and handle them gracefully.
try:
df = pd.read_csv(‘file.csv’)
except FileNotFoundError as e:
print(f”File not found: {e}”)
💡 Guide to Answer:
Always catch specific exceptions (like ValueError
, TypeError
) instead of using a generic except:
. Add finally
for cleanup if needed.
2. What’s the difference between try-except and try-except-finally?
Best Answer:
try-except
: Catches and handles exceptionsfinally
: Runs no matter what, useful for cleanup (like closing files or database connections)
try:
conn = db.connect()
# some logic
except Exception as e:
print(e)
finally:
conn.close()
💡 Guide to Answer:
Explain the use of finally
to avoid resource leaks, especially in file or DB operations.
3. What is the purpose of logging in Python?
Best Answer: Logging records events and errors that occur during program execution. It’s essential for debugging, monitoring, and auditing ETL pipelines.
💡 Guide to Answer:
Mention logging is more robust than print()
because it supports severity levels, time stamps, file outputs, etc.
4. How do you use the logging module in Python?
Best Answer:
import logging
logging.basicConfig(level=logging.INFO, filename=’etl.log’,
format=’%(asctime)s – %(levelname)s – %(message)s’)
logging.info(“ETL process started”)
logging.error(“Failed to connect to DB”)
💡 Guide to Answer:
Explain the use of different log levels: DEBUG
, INFO
, WARNING
, ERROR
, CRITICAL
.
5. What’s the difference between print() and logging?
Best Answer:
print()
is for simple console output (good for quick debugging).logging
allows you to record events with timestamps, levels, and save logs to a file—ideal for production.
💡 Guide to Answer:
Say you use print()
for ad-hoc debugging and logging
for everything else—especially for error tracing.
6. How do you log exceptions with a traceback?
Best Answer:
import logging
try:
risky_code()
except Exception as e:
logging.exception(“An error occurred”)
💡 Guide to Answer:logging.exception()
automatically adds the full traceback, useful for debugging complex pipelines.
7. How would you design an ETL script with proper error logging?
Best Answer:
Structure your script with modular functions, wrap each with try-except
, and log both successes and failures.
def extract():
try:
# logic
logging.info(“Extract successful”)
except Exception as e:
logging.error(f”Extract failed: {e}”)
💡 Guide to Answer:
Highlight how this keeps logs clean and makes it easier to pinpoint failures in multi-step pipelines.
8. How do you handle and log errors when reading corrupted data files?
Best Answer:
try:
df = pd.read_csv(‘file.csv’)
except pd.errors.ParserError as e:
logging.error(f”Parsing error: {e}”)
💡 Guide to Answer:
Mention that pandas has its own exceptions. You can skip bad rows using error_bad_lines=False (deprecated, but still asked in interviews) or on_bad_lines=’skip’ in newer versions.
9. Python Interview Questions for Data Engineer: Concurrency & Parallelism
1. What is the difference between concurrency and parallelism?
Best Answer:
Concurrency is when multiple tasks are in progress (but not necessarily running at the same time). Ideal for I/O-bound operations.
Parallelism is when multiple tasks run at the same time, usually on multiple cores. Ideal for CPU-bound tasks.
💡 Guide to Answer:
Use simple examples: downloading 100 files (I/O → concurrency), transforming 100 images (CPU → parallelism). Mention Python’s GIL and how it affects this.
2. What is the Global Interpreter Lock (GIL) in Python?
Best Answer:
The GIL is a mutex that allows only one thread to execute Python bytecode at a time—even on multi-core systems. It restricts true parallel execution in multi-threading for CPU-bound tasks.
💡 Guide to Answer:
Clarify that GIL affects multi-threading, but not multi-processing. Libraries like numpy
or pandas
are optimized internally and not impacted as much.
3. When would you use multi-threading in Python?
Best Answer:
For I/O-bound tasks like:
Downloading files
Reading from APIs
Writing to disk or DBs
💡 Guide to Answer:
Explain that while threads share memory and are light-weight, they won’t speed up CPU-bound operations due to GIL.
4. How do you implement multi-threading in Python?
Best Answer:
import threading
def download_file(url):
…
thread1 = threading.Thread(target=download_file, args=(‘url1’,))
thread2 = threading.Thread(target=download_file, args=(‘url2’,))
thread1.start()
thread2.start()
thread1.join()
thread2.join()
💡 Guide to Answer:
Mention .start() vs .join(), and how threads don’t block each other—good for parallel I/O.
5. When would you use multiprocessing in Python?
Best Answer:
For CPU-bound tasks like:
Data transformations
Image/video processing
Complex calculations
💡 Guide to Answer:
Processes don’t share memory, which avoids GIL. Great for parallel execution on multiple CPU cores.
6. How do you implement multiprocessing in Python?
Best Answer:
from multiprocessing import Pool
def square(x):
return x * x
with Pool(4) as p:
results = p.map(square, [1, 2, 3, 4])
💡 Guide to Answer:
Mention Pool.map() is similar to the built-in map(), but runs in parallel. Ideal for batch processing
7. What’s the difference between Thread, Process, and Pool in Python?
Best Answer:
Thread: Lightweight, shares memory, good for I/O
Process: Heavyweight, runs in separate memory space
Pool: A pool of worker processes or threads, used for parallel batch tasks
💡 Guide to Answer:
Show you know when to use each. Use Thread
for I/O, Process
for CPU, Pool
for collections of either.
8. What are some libraries for parallel data processing in Python?
Best Answer:
concurrent.futures
(modern threading and multiprocessing)multiprocessing
joblib
(great for parallel loops)dask
(scales pandas across cores/clusters)pyspark
(for distributed computing)
💡 Guide to Answer:
Talk about your familiarity with any of them. concurrent.futures.ThreadPoolExecutor
is especially interviewer-friendly.
9. How do you manage large I/O-bound pipelines efficiently?
Best Answer:
Use
ThreadPoolExecutor
orasyncio
for non-blocking I/OUse batching and chunking
Compress and stream data instead of loading everything at once
💡 Guide to Answer:
Mention how you’ve optimized API calls, file downloads, or DB writes using async or threads.
10. How do you avoid common issues in parallel processing?
Best Answer:
Watch out for race conditions in threads
Use locks or queues for safe communication
Don’t mutate shared data
For processes: beware of high memory consumption and serialization overhead
💡 Guide to Answer:
Mention you’ve handled deadlocks, memory bloat, and inconsistent outputs when first implementing parallel pipelines.
10. Python Interview Questions for Data Engineer: Automation & Scripting
1. How do you automate a daily task using Python?
Best Answer:
Write a script that performs the task (e.g., file cleanup, API data pull)
Use cron (Linux/macOS) or Task Scheduler (Windows) to run the script on a schedule
Example cron job:
0 2 * * * /usr/bin/python3 /path/to/script.py
💡 Guide to Answer:
Mention real tasks you’ve automated: like data refresh, backup, or file renaming. Bonus points if you mention using Python’s schedule
or APScheduler
for programmatic scheduling.
2. How do you rename a batch of files in a folder using Python?
Best Answer:
import os
for filename in os.listdir(‘reports/’):
new_name = filename.replace(‘ ‘, ‘_’)
os.rename(f’reports/{filename}’, f’reports/{new_name}’)
💡 Guide to Answer:
Highlight how automation saves time and reduces human error. You can also mention using glob
for pattern-based selection.
3. How do you send an email alert from a Python script?
Best Answer:
import smtplib
from email.message import EmailMessage
msg = EmailMessage()
msg.set_content(‘ETL job completed’)
msg[‘Subject’] = ‘Daily ETL Report’
msg[‘From’] = ‘you@example.com’
msg[‘To’] = ‘manager@example.com’
with smtplib.SMTP(‘smtp.example.com’, 587) as server:
server.starttls()
server.login(‘you@example.com’, ‘password’)
server.send_message(msg)
💡 Guide to Answer:
Use this for alerting on failures or completion. Mention that in production, you’d store credentials securely (e.g., in .env
, AWS Secrets Manager, or key vaults).
4. How do you run a Python script from another Python script?
Best Answer:
# Option 1: Import and call
import my_script
my_script.run()
# Option 2: Use subprocess
import subprocess
subprocess.run([‘python’, ‘script.py’])
💡 Guide to Answer:
Talk about using imports for modular code and subprocess
when running CLI tools or external scripts.
5. How do you schedule a Python script to run hourly?
Best Answer:
Use cron for hourly tasks:
0 * * * * /usr/bin/python3 /home/user/my_script.py
Or use the schedule
library:
import schedule
import time
schedule.every().hour.do(run_etl)
while True:
schedule.run_pending()
time.sleep(60)
💡 Guide to Answer:
Mention cron
for production and schedule
for in-script testing or lightweight automation.
6. How do you compress and archive files using Python?
Best Answer:
import zipfile
with zipfile.ZipFile(‘archive.zip’, ‘w’) as zipf:
zipf.write(‘data.csv’)
💡 Guide to Answer:
Useful when archiving logs or results after ETL jobs. You can also use shutil
for entire folders.
7. How do you read environment variables in a Python script?
Best Answer:
import os
api_key = os.getenv(‘API_KEY’)
💡 Guide to Answer:
Critical for security—don’t hardcode credentials! Mention .env
files with python-dotenv
for local development.
8. What is your approach to writing reusable scripts for automation?
Best Answer:
est Answer:
Use modular functions (
extract()
,transform()
,load()
)Add command-line arguments with
argparse
Log output and exceptions
Version the script with Git
Schedule using cron or orchestrators like Airflow
💡 Guide to Answer:
This is a systems design-style question. Talk about how you build for reusability, observability, and repeatability.
11. Python Interview Questions for Data Engineer: Testing & Debugging
1. Why is testing important in data engineering?
Best Answer:
Testing ensures that your data pipelines work correctly, catch regressions early, and maintain data quality. It helps avoid costly mistakes like data loss, duplication, or silent failures.
💡 Guide to Answer:
Mention how tests provide confidence in deployments, and make debugging faster during failures.
2. What are the different types of tests you write for a Python data pipeline?
Best Answer:
Unit Tests – Test individual functions or components (e.g., transform logic)
Integration Tests – Verify interaction between components (e.g., DB write + read)
End-to-End Tests – Run the full pipeline on test data
Data Validation Tests – Check for nulls, duplicates, schema mismatches
💡 Guide to Answer:
Explain when and where you’d use each. Mention that you focus most on unit + data validation in ETL projects.
3. How do you write unit tests in Python?
Best Answer: Use the built-in unittest
or the more popular pytest
.
def add(x, y):
return x + y
def test_add():
assert add(2, 3) == 5
💡 Guide to Answer:
Mention pytest
for simplicity, unittest
for built-in coverage. Talk about organizing tests into separate folders or files (test_*.py
).
4. How do you mock external dependencies (like APIs or DBs) during testing?
Best Answer: Use the unittest.mock
module:
from unittest.mock import patch
@patch(‘module.api_call_function’)
def test_api(mock_api):
mock_api.return_value = {‘status’: ‘success’}
assert process_api_data() == ‘success’
💡 Guide to Answer:
Talk about how mocking avoids making real API/DB calls during tests, speeding up and isolating the tests.
5. How do you debug Python code efficiently?
Best Answer:
Use
print()
for quick debuggingUse the
pdb
module or IDE breakpoints for step-by-step inspectionUse
logging
for real-time visibility in production scripts
💡 Guide to Answer:
Mention that while print()
is helpful in dev, logging
is better for tracing issues post-deployment. Bonus: mention tools like VS Code debugger or PyCharm debugger.
6. How do you test a pandas DataFrame transformation?
Best Answer:
def clean_data(df):
return df.dropna()
def test_clean_data():
input_df = pd.DataFrame({‘a’: [1, None]})
expected_df = pd.DataFrame({‘a’: [1.0]})
pd.testing.assert_frame_equal(clean_data(input_df), expected_df)
💡 Guide to Answer:
Use pandas.testing.assert_frame_equal()
for checking equality. Mention edge cases (nulls, empty DataFrames, wrong types).
7. How do you run a group of tests in Python?
Best Answer:
With pytest
:
pytest tests/
Or with unittest
:
python -m unittest discover
💡 Guide to Answer:
Mention organizing tests under a /tests
directory and using CI tools like GitHub Actions, Travis CI, or GitLab CI to automate testing.
8. How do you verify data integrity during or after an ETL run?
Best Answer:
Check row counts before and after
Use hash totals or checksums on key fields
Validate schema and data types
Use data testing tools like
Great Expectations
💡 Guide to Answer:
Show that data correctness > code correctness in your workflow. That’s the mindset of a great data engineer.
12. Python Interview Questions for Data Engineer: Cloud & Big Data Tool Integration
1. How do you connect to AWS S3 using Python?
Best Answer:
import boto3
s3 = boto3.client(‘s3’)
s3.download_file(‘my-bucket’, ‘data.csv’, ‘data.csv’)
💡 Guide to Answer:
Mention the need for AWS credentials (stored in ~/.aws/credentials
or as env vars). Highlight using boto3
for all AWS services and mention IAM permissions.
2. How do you list all files in an S3 bucket using Python?
Best Answer:
s3 = boto3.client(‘s3’)
response = s3.list_objects_v2(Bucket=’my-bucket’)
for obj in response.get(‘Contents’, []):
print(obj[‘Key’])
💡 Guide to Answer:
Show your comfort with SDK documentation. Explain pagination if asked about buckets with large data sets.
3. How do you upload/download a file to/from Google Cloud Storage using Python?
Best Answer:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket(‘my-bucket’)
blob = bucket.blob(‘data.csv’)
# Download
blob.download_to_filename(‘data.csv’)
# Upload
blob.upload_from_filename(‘local.csv’)
💡 Guide to Answer:
Mention needing a service account key and setting the GOOGLE_APPLICATION_CREDENTIALS
environment variable.
4. How do you write to BigQuery from pandas using Python?
Best Answer:
from google.cloud import bigquery
client = bigquery.Client()
job = client.load_table_from_dataframe(df, ‘project.dataset.table’)
job.result()
💡 Guide to Answer:
Explain how BigQuery scales large inserts. Mention to_gbq()
from pandas-gbq
for simpler usage in smaller jobs.
5. How do you connect Python to Azure Blob Storage?
Best Answer:
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(“YOUR_CONN_STRING”)
blob_client = blob_service_client.get_blob_client(container=”mycontainer”, blob=”data.csv”)
with open(“data.csv”, “rb”) as data:
blob_client.upload_blob(data)
💡 Guide to Answer:
Highlight the importance of secure credentials and rotating connection strings or tokens (SAS keys).
6. What is PySpark, and how is it used in data engineering?
Best Answer:
PySpark is the Python API for Apache Spark. It allows you to process big data in a distributed computing environment using Python.
💡 Guide to Answer:
Mention it’s ideal for large-scale ETL, joins, and transformations across distributed clusters. Bonus: mention dataframes vs RDDs.
7. How do you read and write a file using PySpark?
Best Answer:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“MyApp”).getOrCreate()
df = spark.read.csv(“data.csv”, header=True, inferSchema=True)
df.write.parquet(“output.parquet”)
💡 Guide to Answer:
Show familiarity with reading various formats (CSV
, JSON
, Parquet
) and writing partitioned outputs for performance.
8. How do you handle large datasets that don’t fit in memory using Python?
Best Answer:
Use chunked processing in pandas (
chunksize
)Use Dask or Vaex for out-of-core dataframes
Use PySpark or SQL-based engines for distributed data handling
💡 Guide to Answer:
Show that you’re aware of scaling bottlenecks and can switch to bigger tools when needed.
9. What is Dask and how does it compare to pandas?
Best Answer:
Dask is a parallel computing library that scales pandas-like syntax for out-of-memory or multi-core processing. It’s good for large dataframes, delayed execution, and cluster-based processing.
💡 Guide to Answer:
Mention Dask’s similarity to pandas and how it works well in local clusters or cloud environments. Optional: contrast with PySpark.
10. How do you trigger a Lambda function using Python?
Best Answer:
import boto3
lambda_client = boto3.client(‘lambda’)
response = lambda_client.invoke(
FunctionName=’my-lambda’,
InvocationType=’Event’,
Payload=b'{}’
)
💡 Guide to Answer:
Explain Event
is async, RequestResponse
is sync. Lambda is perfect for lightweight ETL or event-driven pipelines.
11. How do you orchestrate data workflows in the cloud?
Best Answer:
Use Apache Airflow (with Python DAGs)
Use AWS Step Functions or Cloud Composer (GCP’s managed Airflow)
Use Prefect for simpler orchestration with Pythonic syntax
💡 Guide to Answer:
Mention how you define dependencies, retries, notifications, and logs in your orchestrator.
12. What are the benefits of using cloud-native tools in data engineering?
Best Answer:
Scalable and serverless (e.g., BigQuery, S3, GCS)
Highly available and secure
Cost-effective for storage and compute
Integrate easily with SDKs and orchestration tools
💡 Guide to Answer:
Highlight how cloud tools remove the burden of infrastructure so engineers can focus on logic and scale.
Conclusion
Mastering Python is non-negotiable if you’re aiming for a data engineering role. But more than just writing code, you need to understand how to use Python in real data environments—from transforming massive datasets to automating workflows and integrating with cloud platforms.
This blog gave you a deep dive into all the critical Python interview questions for data engineer roles, along with practical examples to help you answer them confidently.
If you’re serious about cracking your next interview, bookmark this guide or share it with someone who needs it!
Frequently Asked Questions
Is Python enough for a data engineering role?
Python is one of the most essential languages for data engineers due to its simplicity, rich ecosystem, and support for data processing libraries like pandas, SQLAlchemy, PySpark, and more. However, you’ll also need skills in SQL, cloud platforms, and big data tools.
What Python topics should I focus on for a data engineering interview?
Focus on:
Data structures & algorithms
File and data handling (pandas, numpy)
ETL scripting
API usage
SQL & database interactions
Cloud integrations (e.g., S3, BigQuery)
Error handling & automation
What kind of Python projects impress interviewers?
Building an end-to-end ETL pipeline
Automating data cleanup tasks
Integrating Python with cloud services (e.g., uploading to S3, querying BigQuery)
Parallel processing scripts using multiprocessing or PySpark
Are Python coding rounds common for data engineers?
Yes, especially at product-based companies. You’ll often face a coding round focused on data manipulation, file parsing, or simple algorithms using Python. SQL and system design rounds are also common.
Learn about our Resume Writing Services or Contact us on WhatsApp.