Seeking Dataset Ideas for a Data Engineering Project 🚀

Hi everyone,

I’m working on a data engineering project that will serve as a capstone to showcase my skills. The project has a strong focus on building and orchestrating data pipelines, storage architectures, and serving data for analytics and machine learning use cases. Here’s a quick summary of what I aim to achieve in this project:

Project Scope

  1. End-to-End Pipelines:
    Using infrastructure-as-code tools to automate and monitor pipelines.

  2. Data Architecture:
    Designing data lake and lakehouse architectures for analytics and ML use cases.

  3. Processing & Modeling:
    Preparing data for BI, analytics, and machine learning.

  4. BI and Visualization:
    Creating dashboards (with Power BI) to provide actionable insights for a business intelligence problem.

The Challenge: Finding the Right Dataset

To make this project impactful and realistic, I’m looking for good datasets that are:

  • Easily obtainable (preferably open-source).
  • Relevant to a real-world business problem, ideally one with a BI focus.
  • Large enough to have the need to model to a star schema

Potential Themes

Here are a few themes I’m exploring for datasets:

  • E-commerce: Customer behavior, transactions, product catalogs, etc.
  • Public Transportation: Passenger flow, delays, route optimization, etc.
  • Social Media or Clickstream Data: User interactions, sentiment analysis, etc.
  • IoT Sensors: Weather, traffic, or energy consumption data.
  • Finance: Stock prices, transactions, or fraud detection datasets.

What I’ll Do With the Data

  1. Build and process pipelines (both batch and streaming) in AWS.
  2. Model data for BI insights, and potentially apply machine learning/deep learning.
  3. Create a user-friendly dashboard to visualize insights and propose solutions to a business problem.

If you know of any good datasets (or data sources/APIs) that fit this description, I’d love to hear your recommendations! Your input will help shape a hands-on project that demonstrates practical solutions for data challenges.

Thanks in advance for your help! :blush:

François

1 Like

I like finance datasets, this helps to prepare for stream data, plus you can get amazing insights with all available data that is out there. The complexity of building a system that analyzes finance data and produces good insights is not a minor thing, so if you can pull this off will be amazing!