
Data Science Job Interview Warm‑Up: 30 Real Coding & System‑Design Questions
Data science has become one of the most sought‑after fields in technology, leveraging mathematics, statistics, machine learning, and programming to derive valuable insights from data. Organisations across every sector—finance, healthcare, retail, government—rely on data scientists to build predictive models, understand patterns, and shape strategy with data‑driven decisions.
If you’re gearing up for a data science interview, expect a well‑rounded evaluation. Beyond statistics and algorithms, many roles also require data wrangling, visualisation, software engineering, and communication skills. Interviewers want to see if you can slice and dice messy datasets, design experiments, and scale ML models to production.
In this guide, we’ll explore 30 real coding & system‑design questions commonly posed in data science interviews. You’ll find challenges ranging from algorithmic coding and statistical puzzle‑solving to the architectural side of building data science platforms in real‑world settings. By practising with these questions, you’ll gain the confidence and clarity needed to stand out among competitive candidates.
And if you’re actively seeking data science opportunities in the UK, be sure to visit www.datascience-jobs.co.uk. It’s a comprehensive hub featuring junior, mid‑level, and senior data science vacancies—spanning start‑ups to FTSE 100 companies. Let’s dive into what you need to know.
1. Why Data Science Interview Preparation Matters
In data science, technical prowess is essential, but it’s only part of the equation. Employers also want to gauge your ability to interpret findings, communicate insights, and collaborate with cross‑functional teams. Here’s why structured interview prep is a game‑changer:
Demonstrate Breadth & Depth
A data scientist’s toolkit often spans Python, R, SQL, machine learning libraries, and data visualisation frameworks.
Interviewers look for strong foundational knowledge (e.g., statistics, linear algebra) and hands‑on coding experience.
Show You Can Tackle Messy, Real-World Data
The majority of real‑world data is unclean—filled with missing values, outliers, and ambiguities.
Employers will test your ability to wrangle data effectively and document your approach.
Highlight Your Problem-Solving Methodology
Data scientists frequently work with uncertain or ambiguous questions.
Interviewers look for how you break down complex tasks, iterate on solutions, and validate assumptions.
Understand Model Deployment & Lifecycle
In modern data science roles, building models isn’t enough; you must also understand how to deploy and monitor them in production.
Knowledge of MLOps or at least some concept of model lifecycle is increasingly important.
Showcase Communication Skills
Data scientists interact with product managers, engineers, and executives who need clear and compelling narratives from the data.
Interviewers often assess how you’d explain a complex analysis to a non‑technical audience.
Preparation ensures you can handle both technical and behavioural elements of data science interviews. Next, let’s delve into 15 coding‑focused questions.
2. 15 Real Coding Interview Questions
In data science, coding challenges often revolve around data manipulation, feature engineering, and machine learning tasks. Below are 15 coding questions that typically arise in data science interviews. Note that some solutions may be open‑ended or require libraries like NumPy, pandas, or scikit‑learn in Python.
Coding Question 1: Exploratory Data Analysis (EDA)
Question: You have a dataset (CSV) containing housing prices with features like location, square footage, and number of bedrooms. How would you load it, handle missing values, and compute summary statistics in Python?
What to focus on:
Using pandas for reading CSV data.
Dealing with nulls (e.g., drop or impute).
Generating descriptive stats (
.describe()
) and identifying outliers.
Coding Question 2: SQL Grouping & Aggregation
Question: Write a SQL query to retrieve the average order amount for each customer in a table called orders
with columns customer_id
, order_id
, and order_total
.
What to focus on:
Basic GROUP BY syntax.
Aggregation (
AVG
,SUM
,COUNT
).Edge cases (e.g., ignoring null values, naming the resulting columns).
Coding Question 3: Feature Extraction
Question: You have textual data in a column called review_text
. Show how to convert these into TF‑IDF vectors for a classification task.
What to focus on:
Tokenisation and normalisation (lowercasing, removing punctuation).
Use of scikit‑learn (
TfidfVectorizer
).Handling the resulting sparse matrix for further analysis.
Coding Question 4: Simple Linear Regression
Question: Implement a simple linear regression from scratch, given a list of (x, y)
data points. Return the best‑fit slope and intercept.
What to focus on:
Deriving slope (
m
) and intercept (b
) using least squares approach.Summation formula or matrix approach for small data.
Handling edge cases (vertical lines, single data point).
Coding Question 5: Data Cleaning & Transformation
Question: Given a messy dataset with multiple categorical columns, missing values in numeric columns, and inconsistent date formats, write pseudocode or Python code describing your cleaning steps.
What to focus on:
Converting categorical data to one‑hot encodings or ordinal values.
Different strategies for imputing missing numeric values (mean, median).
Converting all date fields into a consistent
datetime
format.
Coding Question 6: Time Series Split
Question: Show how you’d split a time series dataset into training and validation sets, ensuring chronological order is respected.
What to focus on:
Avoiding random splits.
Possibly creating a rolling window or fixed cutoff date.
Handling partial data for future predictions.
Coding Question 7: Logistic Regression for Classification
Question: Train a logistic regression model (e.g., in Python) to classify whether a user will churn. Use columns like usage frequency, subscription length, and engagement metrics.
What to focus on:
Data pre‑processing (scaling, dummy variables).
sklearn.linear_model.LogisticRegression
usage.Evaluating model performance (accuracy, confusion matrix, ROC AUC).
Coding Question 8: K‑Means Clustering
Question: Implement K‑means from scratch or using a library. Show how you’d choose an optimal value of k (number of clusters).
What to focus on:
Centroid initialisation, assignment, and update steps.
The Elbow or Silhouette method for selecting k.
Convergence criteria and iteration limit.
Coding Question 9: Random Forest Feature Importance
Question: Train a random forest on a dataset and extract the top 3 features by importance. Write the code snippet in Python.
What to focus on:
Using scikit‑learn (
RandomForestClassifier
orRandomForestRegressor
).Accessing
.feature_importances_
.Sorting features and selecting the top 3.
Coding Question 10: Evaluating Model Performance
Question: Write code to compute precision, recall, and F1 score for a binary classification model.
What to focus on:
Calculating metrics from confusion matrix (TP, TN, FP, FN).
Alternatively, using scikit‑learn’s
precision_score
,recall_score
,f1_score
.Explaining the trade‑offs between these metrics.
Coding Question 11: Hyperparameter Tuning
Question: Show how you’d perform a grid search to find the best hyperparameters for an SVM classifier.
What to focus on:
GridSearchCV
usage from scikit‑learn.Defining parameter grid for
C
andkernel
.Cross‑validation approach, scoring metric.
Coding Question 12: Handling Class Imbalance
Question: You have a dataset where only 5% of users have churned, leading to highly imbalanced data. Demonstrate how you’d address this problem.
What to focus on:
Sampling techniques (undersampling, oversampling, SMOTE).
Adjusting class weights in models.
Using alternative metrics (precision/recall, F1, PR AUC).
Coding Question 13: NLP Word Embeddings
Question: Convert sentences into vector embeddings using a pre‑trained model (e.g., Word2Vec, GloVe). Illustrate with a short code snippet.
What to focus on:
Tokenisation and mapping to embeddings.
Handling unknown words or out‑of‑vocabulary tokens.
Aggregating word vectors into sentence features if needed.
Coding Question 14: Dimensionality Reduction (PCA)
Question: Show how to perform PCA on a dataset with 50 features to reduce it down to 5 principal components.
What to focus on:
Using
PCA
from scikit‑learn.Standardising input data before PCA.
Interpreting explained variance ratio.
Coding Question 15: Building a Data Pipeline
Question: Describe how you’d read data from a source (e.g., CSV or database), transform it (cleaning or feature engineering), and then feed it to a model, all in one end‑to‑end workflow script.
What to focus on:
Clear structure or function calls for each step (extract, transform, load).
Use of frameworks like
pandas
,numpy
, or pipeline classes from scikit‑learn.Logging and error‑handling best practices.
Ensure your coding approach is clean, modular, and well‑documented. Data scientists must not only deliver answers but also produce maintainable, reproducible code.
3. 15 Data Science Architecture & Design Questions
Data science isn’t just about building one‑off models—it often requires designing scalable and robust infrastructure to handle data ingestion, experimentation, model deployment, and monitoring. Below are 15 questions focusing on the system design and architecture aspects of data science.
Architecture Question 1: Data Ingestion for ML
Scenario: You need to gather data from multiple sources (CRM, web logs, third‑party APIs) to produce daily features for a recommendation system.
Key Points to Discuss:
Orchestration tools (Airflow, Luigi) for scheduling.
Data merging and cleaning strategies.
Batch vs. real‑time trade‑offs.
Architecture Question 2: Real-Time Predictive Analytics
Scenario: A finance company wants to detect fraudulent transactions within seconds.
Key Points to Discuss:
Streaming frameworks (Kafka, Spark Streaming, Flink).
Model serving endpoints or microservices (e.g., Flask, FastAPI, or Sagemaker endpoints).
Latency constraints and sliding window updates.
Architecture Question 3: Experiment Tracking & Model Versioning
Scenario: Your organisation demands reproducible experiments and traceable versions for each model iteration.
Key Points to Discuss:
Use of MLflow, DVC, or custom solutions.
Storing hyperparameters, metrics, and data snapshots.
Governance around “model of record” in production.
Architecture Question 4: Feature Store
Scenario: You need a consistent set of features for both training and online inference to ensure data parity.
Key Points to Discuss:
Separation between offline features (batch) and online features (real‑time).
Tools or patterns for data versioning (Delta Lake, feature store solutions).
Preventing training/serving skew.
Architecture Question 5: Large-Scale Model Training
Scenario: You’re training a deep learning model on millions of images and must parallelise the process.
Key Points to Discuss:
Distributed training solutions (Horovod, PyTorch Distributed).
GPU/TPU usage and pipeline parallelism vs. data parallelism.
Synchronisation overhead and checkpointing.
Architecture Question 6: AB Testing Infrastructure
Scenario: Deploy two model variants (A and B) to compare their performance on user behaviour.
Key Points to Discuss:
Random traffic splitting or user segment approach.
Metrics collection (CTR, conversion rate) and statistical significance.
Rollback strategy if B underperforms.
Architecture Question 7: CI/CD for Data Science
Scenario: Models need frequent updates as new data arrives; you want an automated pipeline to retrain and redeploy.
Key Points to Discuss:
Git for version control of code and data config.
Jenkins, GitLab CI, or similar for building, testing, and deploying ML pipelines.
Automated unit tests (data shape, model sanity checks).
Architecture Question 8: Monitoring Deployed Models
Scenario: Once in production, a model’s performance might drift over time. How do you set up ongoing monitoring?
Key Points to Discuss:
Tracking model inputs and outputs for distribution changes.
Alerting if metrics deviate from expected ranges.
Potential auto‑retraining triggers.
Architecture Question 9: Data Visualisation Dashboards
Scenario: Your organisation wants interactive dashboards to explore model predictions and feature importance.
Key Points to Discuss:
Tools: Tableau, Power BI, or Python libraries (Plotly, Dash).
Designing user‑friendly interfaces for non‑technical stakeholders.
Ensuring data refresh schedules match real-time or daily updates.
Architecture Question 10: Cloud vs. On-Prem Deployment
Scenario: A healthcare provider needs to keep patient data secure while running advanced analytics.
Key Points to Discuss:
Compliance with regulations (GDPR, HIPAA).
Hybrid approaches: on-prem for sensitive data, cloud for large-scale compute.
Secure data transfer, encryption in transit/at rest.
Architecture Question 11: MLOps vs. Traditional DevOps
Scenario: The data science team is trying to unify model deployment with established software engineering practices.
Key Points to Discuss:
Differences in pipeline stages (data prep, feature engineering, model validation).
Tools for MLOps pipelines (Kubeflow, MLflow, AWS Sagemaker).
Collaboration between data scientists and DevOps engineers.
Architecture Question 12: Edge Inference
Scenario: Models need to run on IoT devices in remote locations with limited connectivity.
Key Points to Discuss:
Model compression techniques (quantisation, pruning).
On‑device hardware constraints (CPU, memory).
Periodic sync with cloud for updates or batch re‑training.
Architecture Question 13: Handling Multi-Tenancy
Scenario: Your company offers an analytics platform for multiple clients, each with unique data sets.
Key Points to Discuss:
Data isolation (logical or physical) per tenant.
Resource scaling across different usage patterns.
Custom model constraints or shared model usage.
Architecture Question 14: Recommender Systems at Scale
Scenario: You’re building a personalised product recommendation engine for millions of users.
Key Points to Discuss:
Real‑time user event logging (Kafka).
Collaborative filtering vs. content‑based methods.
Approximate nearest neighbour searches for speed.
Architecture Question 15: Ethical & Privacy Considerations
Scenario: You’re working with sensitive personal data (health records, financial info).
Key Points to Discuss:
Data anonymisation or pseudonymisation.
Bias detection in ML models.
Accountability if model decisions impact user outcomes (e.g., loan approvals).
When responding to architecture questions, remember to address scalability, robustness, security, and maintenance. Highlight real-world experiences or examples where you balanced cost with performance or overcame data compliance hurdles.
4. Tips for Conquering Data Science Job Interviews
Data science interviews can be intense, spanning everything from basic statistics to big data frameworks and communication prowess. Here’s how to put your best foot forward:
Revisit Core Concepts
Ensure you’re fluent in linear algebra, calculus, and probability/statistics—these underlie most machine learning.
Be ready to explain bias‑variance trade‑off, overfitting, regularisation, and feature selection strategies.
Practice Python, R, or Your Preferred Language
Familiarise yourself with pandas, NumPy, and scikit‑learn (or tidyverse in R).
Show you can manipulate data frames, pivot tables, and run quick EDA or ML pipelines.
Understand the ML Lifecycle
Interviewers may ask how you’ll handle data drift, model retraining, or performance monitoring.
Mention frameworks or scripts you’ve built to deploy and version models in production.
Brush Up on SQL
Many data science roles involve significant SQL usage—whether for data extraction, wrangling, or feature engineering.
Practise window functions, complex joins, and subqueries.
Showcase Real Projects & Impact
Use the STAR method (Situation, Task, Action, Result) to describe your past accomplishments.
Quantify achievements: “I improved lead conversion by 15% using an ensemble model.”
Keep Current with Industry Trends
Data science evolves quickly; keep tabs on new deep learning architectures, MLOps tooling, or AutoML solutions.
Mention relevant articles, courses, or conferences that shape your perspective.
Prioritise Communication & Storytelling
You’ll likely be asked to present your solutions or findings. Frame insights in a narrative that resonates with non‑technical stakeholders.
Consider how you’d explain a random forest or a clustering approach to a product manager.
Ask Strategic Questions
Near the end of interviews, inquire about the company’s data infrastructure, team composition, or project pipeline.
Demonstrates genuine interest and can reveal whether the environment aligns with your career goals.
Prepare for Behavioural & Cultural Fit
Companies often need collaborative data scientists who can champion data literacy across the organisation.
Expect questions like, “Tell me about a time you had conflicting data from two sources,” or “How do you manage tight deadlines on data projects?”
Show Confidence, But Stay Coachable
Offer thoughtful solutions, but don’t be afraid to express uncertainty or ask clarifying questions.
Data scientists often refine hypotheses and learn from new data—it’s a sign of growth mindset.
By integrating technical expertise, analytical thinking, and strong communication skills, you’ll stand out as a versatile data scientist ready to tackle any challenge.
5. Final Thoughts
Data science has become a cornerstone of modern decision‑making, enabling organisations to transform raw data into actionable insights. But success in a data science role requires more than just the ability to code a model—it demands end‑to‑end problem‑solving, strategic thinking, and effective communication.
The 30 real coding & system‑design questions above—spanning EDA, ML, big data, MLOps, and more—offer a comprehensive prep checklist. Treat them as a springboard to refine your skills, validate your knowledge, and practice explaining solutions logically and concisely. Along the way, remember that data science is a dynamic field—continuous learning is the hallmark of a thriving career.
When you’re ready for your next role, www.datascience-jobs.co.uk awaits with a variety of exciting UK‑based opportunities—ranging from data scientist positions at start‑ups to advanced analytics roles at large enterprises. Armed with rigorous preparation and the right mindset, you’ll be well positioned to make a stellar impression in your interviews and secure a fulfilling data science career.