Hi everyone,
I’m working on a data engineering project that will serve as a capstone to showcase my skills. The project has a strong focus on building and orchestrating data pipelines, storage architectures, and serving data for analytics and machine learning use cases. Here’s a quick summary of what I aim to achieve in this project:
Project Scope
-
End-to-End Pipelines:
Using infrastructure-as-code tools to automate and monitor pipelines. -
Data Architecture:
Designing data lake and lakehouse architectures for analytics and ML use cases. -
Processing & Modeling:
Preparing data for BI, analytics, and machine learning. -
BI and Visualization:
Creating dashboards (with Power BI) to provide actionable insights for a business intelligence problem.
The Challenge: Finding the Right Dataset
To make this project impactful and realistic, I’m looking for good datasets that are:
- Easily obtainable (preferably open-source).
- Relevant to a real-world business problem, ideally one with a BI focus.
- Large enough to have the need to model to a star schema
Potential Themes
Here are a few themes I’m exploring for datasets:
- E-commerce: Customer behavior, transactions, product catalogs, etc.
- Public Transportation: Passenger flow, delays, route optimization, etc.
- Social Media or Clickstream Data: User interactions, sentiment analysis, etc.
- IoT Sensors: Weather, traffic, or energy consumption data.
- Finance: Stock prices, transactions, or fraud detection datasets.
What I’ll Do With the Data
- Build and process pipelines (both batch and streaming) in AWS.
- Model data for BI insights, and potentially apply machine learning/deep learning.
- Create a user-friendly dashboard to visualize insights and propose solutions to a business problem.
If you know of any good datasets (or data sources/APIs) that fit this description, I’d love to hear your recommendations! Your input will help shape a hands-on project that demonstrates practical solutions for data challenges.
Thanks in advance for your help!
François