Course 1: Real world project

pastorsoto · November 4, 2024, 10:08pm

Hi guys! I want to do an excercise to inmerse ourself into a real world experience using a practical example that would help us to sit in a place to fill our gaps and understand the things we need from customers. At the beginning, specially if you’re working as freelance or promoting yourself in platforms as LinkedIn, it’s more likely to get proposal for start-up or individuals, let’s play and see how we can develop a plan to a starting project!

Let’s build a project together using the principles of the first course and see how things change with the knowledge we gain

You are working for a start-up that wants to develop a system for providing insights to scouters about players using LLMs, the initial dataset is provided by the wyscout API.

What questions would you ask to the stakeholder? if any
What would be your proposal architecture for this project?

Let’s use the knowledge from course 1 to apply the principles and provide a plan to this person!

WyattB · November 5, 2024, 4:32pm

SPOILER ALERT, just in case you want to write your own response before reading others.

These are the questions I would use to structure the initial conversations with stakeholders.

Tell me about the company’s goals (or your team’s goals) for this upcoming year.
How does this project fit into those goals?
What will you do with the data from the wyscout API?
What systems are currently in place to work with this data?
What are the biggest pain points for you with the current system?
Who else at your company will work with this data?

What do you all think? What would you ask during these initial conversations?

francois_adam · November 5, 2024, 7:42pm

Requirements Gathering

I will try to have a better understanding of the business needs:

1. Business Goals

What is the ultimate objective of this project? Are we aiming to identify promising players, evaluate player performance, or something else?
What insights do scouters need most? Are they interested in predictive metrics, player comparisons, or historical trends?

2. Stakeholder Needs

Who are the primary users of this system? Are they only scouters, or will coaches and data analysts also use it? Will I have the chance to meet some of them?
What information do scouters currently lack but wish to have?
How often will users need to access updated insights? Daily, weekly, real-time? This will guide data refresh and processing frequency.

3. System Requirements

a. Functional Requirements

What specific data should we gather from the Wyscout API? Player statistics, performance metrics, injury history, scouting reports, etc.?
How should the data be processed? Are there transformations we need to apply before delivering it to ML pipelines?
In which format will the Machine Learning Engineer need the data?

b. Non-functional Requirements

What are the performance expectations for data retrieval and processing? Should it be monthly, or daily update acceptable?
How scalable does the system need to be?
What are the data privacy and security requirements?

Proposal Architecture

1. Data Ingestion Layer

API Ingestion
Data Lake: Store raw data in a Amazon S3 bucket.

2. Data Transformation Layer

ETL/ELT: performing data cleaning, transformations, and aggregations.
Data Warehouse: Store processed data in a Amazon Redshift.

3. Monitoring and Logging

Implement monitoring with tools like Amazon CloudWatch to track data pipeline performance, API usage, and system health.

pastorsoto · November 7, 2024, 2:03am

Hi @WyattB amazing questions: I will take the personality of the stakeholder so we can do the exercise!

The goal for the upcoming year will be having a chatbot live and running for potential customers evaluate the use of AI into the scouting analysis
This is the first project, the idea is to have the model runnning
The idea is to get the data from wyscout to feed the LLMs so the user can ask questions about players and the model provide answers
We don’t have a system running yet, this will be our first iteration
As this is the first approach to the data we don’t have answer for this
Currently just you to prepare the system so I can hire a data science to do the rest of the job

Based on my answer what would be your next step?

As we can see this is at early stages, later we will try to mimic a project with a system already in place

pastorsoto · November 7, 2024, 2:09am

Great questions!!

Business goals:

To provide a platform for scouter to have conversations with the LLM to make better decisions
It depends on the user, most of them use comparison and predictive metrics

Stakeholder needs
3. Scouters
4. Currently scouters don’t have a system that allow to have conversations about player skills as we can with LLMs
5. Usually scouters do it at the end of the season, but they might check weekly updates based on the results of their last match, it can be even beyond up to twice per week

Based on my information what would be your next step?

WyattB · November 7, 2024, 8:23pm

Hi Pastor, thanks for engaging with this thread. It really is a valuable experience that helps me get the most out of the material!

This is the outline for requirements gathering from the first course. Based on the conversation so far, this is what I have recorded. How does this look to you?

REQUIREMENTS

Business Goals

Train a LLM for scouting analysis
Deploy a customer-facing chatbot that responds to live requests based on this LLM

Stakeholder Needs

For a future data science hire

A system to access data from the wyscout API for LLM model training
A system to serve wyscout data to the LLM for live model inference

System Requirements

Functional

The system needs to serve wyscount data to the LLM for model training and live model inference

Nonfunctional

Maintainability - The system will be easy to adapt to changes in the data scheme
Reliability - the system will be available even when the wyscout API is not
Reliability - the system will serve accurate and up-to-date data (within 12 hours for inference).
Scalability - for inference, the system will scale up to serve the data volume expected with the maximum level of user activity

WyattB · November 8, 2024, 5:49pm

@pastorsoto @francois_adam
This would be my starting point for an architecture propsal. This is all pretty new to me, so any and all feedback is super helpful. Thanks!

ARCHITECTURE

My first proposal would be a simple datapipe line outlined below. I used a batch architecture for both the training and the inference pipeline, because the source system is an API, and real-time streaming is not possible.

LLM Training Pipeline

Wyscout API
AWS Lambda
Amazon DynamoDB

LLM Inference Pipeline

Wyscout API
AWS Glue ETL
Amazon DynamoDB

Both pipelines would run Python code, via AWS Lambda for training and via Amazon Glue ELT for inference, to read in player data from the Wyscout API and save it in an AWS DynamoDB database. The training data pipeline would be run as needed when the data scientist wants fresh data. The inference pipeline, on the other handwill, will update more frequently and need more robust data quality checks and monitoring capabilities, since it will run in production.

I used Amazon DynamoDB as a database because it can store the wyscout data as key-value pairs, where the values are JSON format records. This matches the raw format of the data, and thus gives the data scientist maximal flexibility by minimally processing the raw data. It also allows the data scientist to query the data with SQL-like syntax, which will be useful for exploring the data and eventually integrating it into the production LLM code.

pastorsoto · November 10, 2024, 1:40pm

I like your approach, as you can see a lot of things from the course can be applied to real world-scenarios, moreover having the intuition on navigating on concepts to understand what the client needs is a skill that will help you to succeed.
Often the client’s idea is not as straightforward and we need to evaluate the best way to get the format of the data in useful way.

The API documentation of the data allows to understand the work we need to do to obtain the data, the complexity of the problem relies on the structure of the data, I mean you have hundreds of clubs and leagues from all the world, and the API do not allow connection between a name and a player, each player has an ID but players name change between system to system, because some might store Messi as Lionel Messi, Messi, L. Messi and there is no way to know.

I like this data because it allow to navigate through different problems you might have in a real-world job.

How would you architect this API and data to provide the client the data in useful way.

This is the API documentation
Wyscout API

This is a sample of the data
Download Free Samples - Wyscout FootballData

francois_adam · November 10, 2024, 10:29pm

@pastorsoto @WyattB

This would be my updated architecture, feel free to correct me:

1. Data Ingestion Layer :

Since real-time updates are not necessary, leverage a batch-oriented data pipeline (Amazon Glue ETL jobs) that pulls data from the Wyscout API at set intervals (weekly or seasonal).
Data Lake Storage : Raw data from Wyscout API is stored in Amazon S3 as JSON files. This storage allows for schema evolution over time, as Wyscout’s data schema may change.

2. Data Transformation Layer :

Amazon Glue ETL jobs to clean and standardize data, resolving player name discrepancies across systems.
Data Validation : Include schema validation in the ETL job to ensure incoming data conforms to expected formats.

3. Data Storage Layer

Data Lake: Amazon S3 for raw data storage, retaining historical data for potential reprocessing.|
Data Warehouse: Amazon Redshift to support analytical queries on structured data for the LLM’s ML model and scouters’ reports. This allows flexible querying across multiple seasons and player attributes.|

4. Machine Learning Layer

Data Versioning : Maintain distinct snapshots of training data to track changes over time (in Redshift), ensuring model stability across different seasons.

analegaonkar · November 12, 2024, 2:26pm

@pastorsoto

Hoping this post is still active!!!

I understand that you basically want to create a LLM for the scouters who will be choosing the players based on the answers LLM provides, right?

As its phase1 of the project, we need to start with gathering data so that once all the data is collected at one place, we can get sense of it, create data models, then check if those data models/relations help us answer our questions and then try to apply LLM techniques on the data.

So, lets first fetch the data from APIs’ that is related to players. For now, let’s keep referee and coach API data out of picture.

I went through the API documentation briefly and believe these API data can get us lot of data regarding players, competitions, and events.

For your question- How would you architect this API and data to provide the client the data in useful way.
Answer - For basic data model, get area data, then competition data based on area. Competition data will give u competition wyID. Using competition wyID, get players data which consists of player name and player wyID.
As we get season, round, team data, the data model will grow but we will keep players as the main dataset.

Architecture wise -
We can create python scripts to get fetch this data and run it weekly/as required to get latest data
OR
as said in above comments by fellow colleagues we can use AWS services to do the same task.

pastorsoto · November 13, 2024, 12:23pm

Great plan! As you can see real-world problems can be solved or at least archtect with just the first course of the specialization, the first course was oriented to the building blocks of data engineering, or the first two days of the conversation for a job, now that we have the plan we can dive deep into the next courses to build our knowledge and implement the solutions!!

Thanks for engaging!

pastorsoto · November 13, 2024, 12:28pm

Yes! that’s an amazing plan! The main idea of this exercise was to engage in real world scenarios that you might be facing in real world projects and jobs, many companies would love to see this type of projects in your portfolio, even if it is a draft with just a basic implementation

Check this job post: this use similar (or the same) data that I am providing, and it is likely that your first task would be doing something similar for them, even as data scientist!

I am glad you engage in the conversation!!

Lead Data Scientist | 25 October, 2024 | Jobs and careers with Liverpool Football Club

About the role

Lead Data Scientist:

As the Lead Data Scientist, you will be an integral part of our Research Department. Your focus will be on expanding our understanding of football through innovative modelling and analysis of football data. You will work with stakeholders across all of Liverpool’s teams (women’s, men’s, U23, and academy) to provide insights related to match analysis, player performance, youth development, and talent identification.

Role Overview:

Collaboration:
- Work with staff members across all of Liverpool’s teams (women’s, men’s, U23, and academy) to provide relevant models, visualisations, and reports.
Model Development:
- Design and develop cutting edge statistical models to analyse a broad set of football data including: event data, tracking data, and performance data.
- Test and validate the accuracy and predictiveness of data models.
- Productionise models and monitor their performance over time.
Platform Management:
- Build and maintain self-service platforms to surface models and visualisations to various departments across the club.
Continuous Learning:
- Stay up-to-date with literature, tools, and technologies relevant to football analytics.

Qualifications:

Education: Master’s or Ph.D. in mathematics, physics, statistics, or other STEM subject.
Experience: Previous work with football analytics. Minimum of 5 years of experience in data science or related field.

Technical Skills:

Programming: Proficiency in Python or R.
Database Management: Understanding and experience with database technologies. Experience with modern data lake and data warehousing tools.
Modelling and Analysis: Mastery of statistical modelling, predictive analysis, machine learning, and deep learning.
Big Data: Experience with large datasets and technologies for dealing with big data.
Cloud Computing: Familiarity with cloud computing technologies and services for productionising machine learning models.
Reporting and Visualisation: Competency in producing high quality reports and data visualisations.

Soft Skills:

Project Management: Ability to break down complex problems into manageable tasks to ensure timely and successful project completion.
Collegiality: Ability to collaborate effectively with colleagues of various skill sets to solve problems and build maintainable solutions.
Self-learning: Ability to up-skill flexibly in response to new problem spaces and technical challenges.
Scientific Mindset: Scientific approach to problem solving with an ability to quantify confidence in the results with an understanding of the statistical and systematic uncertainties inherent in any model.
Communication: Ability to communicate results effectively and efficiently in a time-sensitive environment with a diverse array of stakeholders.

Why should you apply?

This is a full-time role working 35 hours per week. this role can either be hybrid, based at the AXA training centre, or fully remote.

To reward your hard work and commitment we offer a competitive salary, 25 days holiday (plus 8 bank holidays and the option to purchase up to an additional 5 days) and a contributory pension scheme.

You will have access to our benefits kit bag where you can get high street discounts, and a selection of benefit schemes you can join. There are opportunities to get involved with volunteering through our LFC Foundation to give back to the local community.

At Liverpool Football Club, we have an unwavering commitment to equality, diversity and inclusion and are always looking to making a positive difference in the communities that we operate within. We are proud of our achievements in this area; maintaining the Premier League Equality Standard Advanced Level, becoming a founding signatory of the Football Association’s Football Leadership Diversity Code and being recognised as a leader in this important area on and off the pitch. We take our responsibilities in this area seriously and through the work being done across the club, we are committed to increasing the diversity of our people and becoming an increasingly inclusive workplace for all. We are committed to hiring great people representative of diverse backgrounds, perspectives, and skills across our entire business. If you share our enthusiasm and passion for inclusivity, then we want to hear from you.

Liverpool FC is committed to safeguarding and promoting the welfare of children and vulnerable adults and expects all Colleagues and Volunteers to share this commitment.

Topic		Replies	Views
Course 1, Week 1, Understanding Business Language Introduction to Data Engineering week-1	3	47	October 26, 2024
Hello my fellow ambitious learners! Practice makes perfect! MLS Resources	11	1657	April 1, 2023
Need guidance related to ML Projects Introductions careers	2	28	September 14, 2024
Question about apps Supervised ML: Regression and Classification week-3	4	366	August 14, 2023
First week task Data Eng Study Group	0	152	September 19, 2024