Seeking Dataset Ideas for a Data Engineering Project 🚀

francois_adam · January 13, 2025, 12:04am

Hi everyone,

I’m working on a data engineering project that will serve as a capstone to showcase my skills. The project has a strong focus on building and orchestrating data pipelines, storage architectures, and serving data for analytics and machine learning use cases. Here’s a quick summary of what I aim to achieve in this project:

Project Scope

End-to-End Pipelines:
Using infrastructure-as-code tools to automate and monitor pipelines.
Data Architecture:
Designing data lake and lakehouse architectures for analytics and ML use cases.
Processing & Modeling:
Preparing data for BI, analytics, and machine learning.
BI and Visualization:
Creating dashboards (with Power BI) to provide actionable insights for a business intelligence problem.

The Challenge: Finding the Right Dataset

To make this project impactful and realistic, I’m looking for good datasets that are:

Easily obtainable (preferably open-source).
Relevant to a real-world business problem, ideally one with a BI focus.
Large enough to have the need to model to a star schema

Potential Themes

Here are a few themes I’m exploring for datasets:

E-commerce: Customer behavior, transactions, product catalogs, etc.
Public Transportation: Passenger flow, delays, route optimization, etc.
Social Media or Clickstream Data: User interactions, sentiment analysis, etc.
IoT Sensors: Weather, traffic, or energy consumption data.
Finance: Stock prices, transactions, or fraud detection datasets.

What I’ll Do With the Data

Build and process pipelines (both batch and streaming) in AWS.
Model data for BI insights, and potentially apply machine learning/deep learning.
Create a user-friendly dashboard to visualize insights and propose solutions to a business problem.

If you know of any good datasets (or data sources/APIs) that fit this description, I’d love to hear your recommendations! Your input will help shape a hands-on project that demonstrates practical solutions for data challenges.

Thanks in advance for your help!

François

pastorsoto · January 16, 2025, 11:31pm

I like finance datasets, this helps to prepare for stream data, plus you can get amazing insights with all available data that is out there. The complexity of building a system that analyzes finance data and produces good insights is not a minor thing, so if you can pull this off will be amazing!

Topic		Replies	Views
DataSet for SMB Engineering and Analysis AI Discussions	4	72	April 26, 2023
What to follow? AI Discussions	2	138	March 26, 2023
My journey Introductions careers , introductions	0	16	October 28, 2024
Disappointing Lab in Week 3 Introduction to Data Engineering week-3	3	71	September 30, 2024
Hello everyone- any Aspiring Data Engineers here? Introductions careers	1	27	January 9, 2025

Seeking Dataset Ideas for a Data Engineering Project 🚀

Project Scope

The Challenge: Finding the Right Dataset

Potential Themes

What I’ll Do With the Data

Related topics