Master Your Data Analyst Interview
Realistic questions, proven answers, and actionable tips to help you stand out
- 30+ curated technical and behavioral questions
- STAR‑formatted model answers for each question
- Actionable tips and red‑flag warnings
- Practice pack with timed mock rounds
Data Cleaning & Preparation
In my previous role, I received a sales dataset with 15% missing values in several columns.
I needed to prepare the data for a quarterly performance report without biasing the results.
I first profiled the missingness patterns using Python's pandas, then applied appropriate techniques: mean imputation for numeric fields with low variance, mode imputation for categorical fields, and flagging rows with >30% missing for exclusion.
The cleaned dataset improved model accuracy by 4% and the report was delivered on time, receiving commendation from senior management.
- What risks are associated with mean imputation?
- How would you handle missing values in time‑series data?
- Understanding of different imputation techniques
- Ability to justify method choice
- Awareness of impact on downstream analysis
- Suggesting deletion of all rows with any missing value
- No mention of validation
- Profile missingness patterns
- Choose imputation method based on data type and distribution
- Implement imputation in Python/pandas
- Validate by comparing summary statistics before and after
Our marketing team needed a unified view of campaign performance across Google Ads, Facebook Ads, and internal CRM data.
Combine disparate datasets with different schemas and units into a single analytical table.
I built an ETL pipeline in Python: extracted data via APIs, standardized column names, converted currencies to USD using daily exchange rates, and applied Z‑score normalization for numeric metrics. I stored the result in a Snowflake table for downstream reporting.
The unified dataset reduced manual reconciliation time by 70% and enabled cross‑channel ROI analysis that identified a 12% uplift opportunity.
- How would you handle schema changes in one source?
- What alternative normalization methods exist for skewed data?
- Clarity of ETL steps
- Appropriate handling of unit conversion
- Choice of normalization technique
- Skipping unit conversion
- Only using min‑max scaling without checking distribution
- Extract data via APIs or SQL queries
- Standardize schema (column names, data types)
- Convert units/currencies to a common baseline
- Apply statistical normalization (e.g., Z‑score)
- Load into a central warehouse
While building a sales forecast model, I noticed unusually high values in the 'discount' column.
Identify outliers that could distort the regression coefficients and decide on treatment.
I plotted boxplots and calculated the IQR to flag points beyond 1.5*IQR. For confirmed outliers, I investigated root causes; many were data entry errors, which I corrected. Remaining legitimate extreme values I capped using winsorization and added a binary flag feature to capture their effect.
After outlier treatment, the model’s R² improved from 0.68 to 0.74 and prediction error decreased by 9%.
- When might you keep an outlier instead of removing it?
- How does winsorization affect model interpretability?
- Use of both visual and statistical methods
- Justification for chosen treatment
- Impact on model performance
- Blindly removing all outliers
- No validation of treatment effect
- Visual inspection (boxplot, scatter)
- Statistical detection (IQR, Z‑score)
- Investigate cause of each outlier
- Correct errors or apply winsorization
- Create indicator variable if needed
I needed to combine daily website traffic logs with monthly sales figures to analyze conversion trends.
Align the two datasets despite differing time granularities without losing information.
I aggregated the daily traffic to monthly totals using SQL GROUP BY, then performed a left join on the month key. For metrics requiring daily granularity, I forward‑filled the monthly sales values and added a weight column to indicate the proportion of the month each day represented. I documented assumptions and validated totals against source reports.
The merged dataset enabled a reliable daily conversion rate analysis, leading to a recommendation that increased conversion by 5% through targeted campaigns.
- What are the risks of forward‑filling monthly data to daily rows?
- How would you handle mismatched fiscal calendars?
- Understanding of aggregation vs. disaggregation
- Clear documentation of assumptions
- Validation steps
- Assuming perfect alignment without checks
- No mention of data validation
- Identify granularity mismatch
- Aggregate finer‑grain data to match coarser level or disaggregate using appropriate assumptions
- Perform join with clear keys
- Create flags/weights for imputed values
- Validate aggregated totals
Statistical Analysis & Modeling
During a A/B test for a new checkout flow, I needed to interpret the test results for stakeholders.
Clarify the potential errors associated with rejecting or not rejecting the null hypothesis.
I described that a Type I error occurs when we incorrectly reject a true null hypothesis (false positive), while a Type II error happens when we fail to reject a false null hypothesis (false negative). I linked the concepts to our significance level (α) and power (1‑β).
Stakeholders understood the trade‑off and agreed to set α at 5% while aiming for 80% power, ensuring balanced risk.
- How does increasing sample size affect Type II error?
- When might you accept a higher Type I error rate?
- Clear definitions
- Connection to α and power
- Practical implications
- Confusing the two error types
- No mention of significance level
- Define null hypothesis
- Type I error = false positive (α)
- Type II error = false negative (β)
- Relation to significance level and power
A product team wanted to predict whether a user would churn (yes/no) based on usage metrics.
Select the appropriate modeling technique for a binary outcome.
I explained that logistic regression is suited for binary dependent variables because it models the log‑odds and bounds predictions between 0 and 1, whereas linear regression can produce probabilities outside that range and violates assumptions of homoscedasticity.
The team adopted logistic regression, achieving an AUC of 0.82 and enabling targeted retention campaigns.
- What are the key assumptions of logistic regression?
- How would you handle imbalanced classes in this scenario?
- Correct identification of outcome type
- Explanation of probability bounds
- Awareness of assumptions
- Suggesting linear regression for binary outcome without justification
- Ignoring class imbalance
- Outcome type (binary vs continuous)
- Logistic regression models probability via log‑odds
- Ensures predictions stay within 0‑1
- Linear regression assumptions not met for binary
I segmented customers into groups for a marketing campaign using K‑means clustering.
Determine whether the clusters were meaningful and actionable.
I calculated internal metrics such as silhouette score and Davies‑Bouldin index to assess cohesion and separation. I also performed external validation by comparing clusters against known customer segments and conducted a business review to see if each cluster showed distinct purchasing behavior. Finally, I visualized clusters using PCA plots for stakeholder communication.
The chosen K=5 yielded a silhouette score of 0.62 and revealed clear spend‑level differences, leading to a 7% lift in campaign response rates.
- How would you choose the optimal number of clusters?
- What if the silhouette score is low but business impact is high?
- Use of quantitative metrics
- Link to business outcomes
- Visualization awareness
- Relying solely on one metric without context
- No business validation
- Internal metrics: silhouette, Davies‑Bouldin, inertia
- External validation: compare with known labels or business KPIs
- Business relevance: distinct behavior patterns
- Visualization for communication
While building a predictive model for sales, I noticed unstable coefficient estimates.
Identify and address multicollinearity among predictor variables.
I explained that multicollinearity occurs when independent variables are highly correlated, inflating variance of coefficient estimates and making them unreliable. I detected it using Variance Inflation Factor (VIF) thresholds (>5) and correlation heatmaps. To remediate, I removed redundant features, combined them via PCA, or applied regularization (Ridge).
After reducing VIF values below 2, the model’s coefficients stabilized and predictive R² improved from 0.71 to 0.76.
- When is it acceptable to keep correlated variables?
- How does regularization help with multicollinearity?
- Clear definition
- Appropriate detection techniques
- Practical mitigation strategies
- Ignoring VIF values
- Suggesting removal without assessing business impact
- Definition of multicollinearity
- Impact on coefficient variance and interpretability
- Detection methods: correlation matrix, VIF, condition index
- Mitigation: drop variables, combine, regularization
Business & Communication
During a quarterly review, I discovered that a specific product line’s churn rate was 18% higher than the company average.
Explain the cause and propose actionable steps to senior leadership without using technical jargon.
I created a concise slide deck highlighting the churn trend, used a simple bar chart to compare segments, and narrated the story: the high churn correlated with a recent price increase. I recommended a A/B price test and a targeted email campaign. I avoided terms like ‘hazard ratio’ and focused on business impact.
Leadership approved the test, which reduced churn by 6% over the next two months and saved $250K in revenue loss.
- How did you handle questions about the statistical significance of your findings?
- What if stakeholders disagreed with your recommendation?
- Clarity of communication
- Use of visual aids
- Actionability of recommendation
- Over‑technical language
- Vague recommendations
- Identify key insight
- Choose simple visual (bar chart)
- Narrate cause‑effect relationship
- Provide clear, actionable recommendation
In Q3, the marketing, finance, and product teams each requested ad‑hoc analyses for upcoming presentations.
Prioritize the requests to meet all deadlines while maintaining quality.
I gathered requirements, estimated effort, and mapped each request to business impact. I communicated the timeline to stakeholders, negotiated scope reductions for lower‑impact tasks, and used a Kanban board to track progress. I also delegated routine data pulls to a junior analyst.
All three deliverables were completed on time; the marketing analysis led to a campaign that increased click‑through rates by 9%.
- What tools do you use to track and communicate progress?
- How do you handle a request that suddenly becomes high priority?
- Prioritization framework
- Stakeholder communication
- Effective delegation
- No mention of impact assessment
- Failing to communicate delays
- Gather requirements and impact assessment
- Estimate effort and create timeline
- Communicate and negotiate scope
- Use task management tools
- Delegate where possible
The executive team was debating whether to expand into a new geographic market.
Provide a data‑driven narrative to support the decision.
I combined market size data, competitor analysis, and internal sales trends into a story arc: market opportunity, risk assessment, and projected ROI. I used a mix of maps, waterfall charts, and a concise executive summary. I highlighted a scenario analysis showing a 12% ROI under conservative assumptions. I rehearsed the presentation with the CRO to anticipate questions.
The board approved a phased entry strategy, allocating $3M to the pilot, which achieved a 15% market share within six months.
- How do you tailor a data story for different audience levels?
- What if the data contradicts senior leadership’s expectations?
- Narrative structure
- Effective visuals
- Strategic relevance
- Overloading slides with raw data
- Lack of clear recommendation
- Gather relevant data sources
- Structure narrative: context, analysis, recommendation
- Visual storytelling (maps, waterfall)
- Scenario analysis for risk
- Rehearse and anticipate questions
While preparing a customer segmentation model, I needed to use personally identifiable information (PII) such as email and phone numbers.
Protect privacy while still delivering useful insights.
I consulted the company’s data governance policy, applied de‑identification techniques (hashing email, removing direct identifiers), and performed analyses on aggregated cohorts. I documented the process, obtained sign‑off from the compliance team, and stored intermediate files on encrypted drives with access controls.
The project proceeded without any compliance issues, and the segmentation model was deployed securely, increasing targeted campaign efficiency by 11%.
- What steps would you take if a data breach were discovered during a project?
- How do you balance data utility with privacy constraints?
- Awareness of privacy regulations
- Practical de‑identification methods
- Collaboration with compliance
- Ignoring policy or compliance sign‑off
- Using raw PII in analysis
- Review data governance policies
- De‑identify or anonymize PII
- Work with aggregated data
- Document and obtain compliance sign‑off
- Secure storage and access controls
- SQL
- Python
- Data Visualization
- Statistical Analysis
- ETL
- Data Cleaning
- Dashboard
- Power BI
- Tableau
- Regression