Datascience¶

22 cards — 🟢 4 easy | 🟡 7 medium | 🔴 4 hard

🟢 Easy (4)¶

1. What are the key differences between Parquet, Avro, and Arrow data formats?

Show answer

Parquet is a columnar storage format optimized for analytics queries and compression — ideal for data warehouses and Spark jobs. Avro is a row-based format with a built-in schema, designed for data serialization and streaming (common in Kafka). Arrow is an in-memory columnar format designed for zero-copy reads and fast inter-process communication — it is not a storage format but a compute layer used by Pandas, Spark, and DuckDB.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

2. What is DVC (Data Version Control) and why is it needed?

Show answer

DVC is an open-source tool that extends Git to handle large files, datasets, and ML models. Git tracks code but struggles with large binary data. DVC stores metadata (hash pointers) in Git while pushing actual data to remote storage (S3, GCS, Azure Blob, SSH). Key features: (1) Data versioning — track dataset changes alongside code changes. (2) Pipelines — define reproducible ML pipelines as DAGs. (3) Experiment tracking — compare metrics across runs.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

3. What are the key differences between Apache Airflow, Prefect, and Dagster for data pipeline orchestration?

Show answer

All three orchestrate data pipelines as DAGs but differ in philosophy: (1) Apache Airflow — the most mature and widely adopted. Uses Python DAG definitions, has a rich UI, and a large ecosystem of operators/providers. Downsides: DAGs are not easily testable, scheduling is tightly coupled, and local development is heavy. (2) Prefect — focuses on developer experience. Tasks are plain Python functions with decorators.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

4. What are the key components of an end-to-end MLOps platform, and how do they fit together?

Show answer

An MLOps platform automates the ML lifecycle. Core components: (1) Data layer — data lake/warehouse, feature store, data versioning (DVC, LakeFS). (2) Experiment tracking — MLflow, W&B for logging parameters, metrics, artifacts. (3) Training infrastructure — GPU clusters, distributed training, hyperparameter tuning (Optuna, Ray Tune). (4) Model registry — versioned model storage with staging/production promotion and approval gates.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

🟡 Medium (7)¶

1. Summary of Formulations?

Show answer

| Algorithm | Mathematical Formula | Loss Function | Performance Metrics |

Remember: data science workflow: collect -> clean -> explore -> model -> evaluate -> deploy -> monitor. Most time is spent on cleaning (80%) and the least on modeling (20%).

Gotcha: 'data scientist' roles vary wildly: some are analysts (SQL + dashboards), some are ML engineers (Python + models), some are data engineers (pipelines + infrastructure). Clarify expectations.

2. Summary of Loss Functions and Performance Metrics?

Show answer

| Algorithm | Loss Function | Performance Metrics |
|------------------------|------------------------------------|----------------------------------------|
| Linear Regression | Mean Squared Error | R², MAE |

Remember: data science workflow: collect -> clean -> explore -> model -> evaluate -> deploy -> monitor. Most time is spent on cleaning (80%) and the least on modeling (20%).

Gotcha: 'data scientist' roles vary wildly: some are analysts (SQL + dashboards), some are ML engineers (Python + models), some are data engineers (pipelines + infrastructure). Clarify expectations.

3. What is ETL and how does it differ from ELT?

Show answer

ETL (Extract, Transform, Load) extracts data from sources, transforms it in a staging area, then loads it into a target system. ELT (Extract, Load, Transform) loads raw data first, then transforms it in the target system (e.g., a data warehouse like BigQuery or Snowflake). ELT is preferred for cloud-native architectures where the target system has enough compute power for transformations. Common ETL/ELT tools include Apache Airflow, dbt, AWS Glue, and Prefect.

Remember: ETL = Extract (get data), Transform (clean/reshape), Load (put in warehouse). Modern alternative: ELT — load raw data first, transform in the warehouse (cheaper storage).

4. What are the main patterns for serving ML models in production?

Show answer

Three primary patterns: (1) Batch inference — run predictions on a schedule (e.g., nightly Spark job), store results in a database. Simple but high latency. (2) Real-time inference via REST/gRPC — deploy model behind an API server (TensorFlow Serving, TorchServe, Triton, Seldon Core, KServe). Low latency, requires scaling and monitoring. (3) Embedded inference — bundle model into the application (e.g., ONNX Runtime in a microservice). Lowest latency, but harder to update.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

5. How do you provision and manage GPU resources for ML workloads in Kubernetes?

Show answer

Steps: (1) Install the NVIDIA device plugin DaemonSet (or AMD equivalent) which exposes GPUs as schedulable resources. (2) Use node labels and taints to isolate GPU nodes. (3) Request GPUs in pod specs via resources.limits: nvidia.com/gpu: 1. (4) Use the NVIDIA GPU Operator to automate driver installation, container toolkit, and monitoring. (5) Enable GPU time-slicing or MIG (Multi-Instance GPU) on A100s for sharing GPUs across workloads.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

6. What is a feature store and why is it important for ML systems?

Show answer

A feature store is a centralized repository for storing, managing, and serving ML features. It solves: (1) Feature reuse — teams share computed features instead of rewriting transformations. (2) Training-serving skew — the same feature computation logic is used for both training and real-time serving. (3) Point-in-time correctness — historical feature values are stored with timestamps to prevent data leakage during training. Key components: offline store (batch features for training, e.g.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

7. What is model drift and how do you detect it in production?

Show answer

Model drift occurs when a deployed model degrades over time. Two types: (1) Data drift — the distribution of input features changes (e.g., user demographics shift). Detected by comparing incoming feature distributions against training data using statistical tests (KS test, PSI, Jensen-Shannon divergence). (2) Concept drift — the relationship between features and target changes (e.g., fraud patterns evolve). Detected by monitoring prediction accuracy against ground truth labels when available.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

🔴 Hard (4)¶

1. What challenges arise when running Jupyter notebooks in production, and how are they addressed?

Show answer

Key challenges: (1) Notebooks have hidden state from out-of-order cell execution — use nbstripout or restart-and-run-all before committing. (2) Difficult to version control — use Jupytext to sync .ipynb with .py files. (3) Hard to schedule — use Papermill to parameterize and execute notebooks as batch jobs. (4) Poor testing support — extract logic into importable .py modules with unit tests.

Remember: Jupyter = interactive computing. Cells of code + markdown + visualizations. Name origin: Julia + Python + R. Great for exploration, bad for production code.

2. How do MLflow and Weights & Biases (W&B) help with ML experiment tracking?

Show answer

Both tools track experiments to ensure reproducibility: (1) MLflow — open-source, tracks parameters, metrics, artifacts, and code versions. Supports a model registry for staging/production promotion. Runs a tracking server (can be self-hosted). Integrates with Spark, sklearn, PyTorch. (2) W&B (Weights & Biases) — SaaS-first, provides richer visualization (loss curves, gradient histograms), hyperparameter sweep orchestration, dataset versioning, and collaborative dashboards.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

3. What tools and practices ensure data quality and validation in ML pipelines?

Show answer

Key tools: (1) Great Expectations — define "expectations" (schema checks, value ranges, null rates, uniqueness) as code. Generates data docs and alerts on violations. (2) Pandera — lightweight schema validation for Pandas DataFrames (column types, value constraints, custom checks). (3) TensorFlow Data Validation (TFDV) — detects anomalies, schema drift, and training-serving skew. (4) dbt tests — built-in and custom SQL tests for data warehouse tables.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.

4. How do you design a scalable ML training infrastructure that supports distributed training across multiple GPUs and nodes?

Show answer

Key components: (1) Distributed training frameworks — use Horovod, PyTorch DistributedDataParallel (DDP), or DeepSpeed for data/model parallelism. Data parallelism splits batches across GPUs; model parallelism splits the model itself for very large models. (2) Storage — use a high-throughput parallel filesystem (Lustre, GPFS, or cloud equivalents like FSx for Lustre) for training data. Object storage (S3) is fine for checkpoints.

Remember: reproducibility is essential in data science. Use version control for code, DVC for data, MLflow for experiments, and Docker for environments.