Getting Real

Hi community,

I want to say hello to my fellow learners.

Just finishing up the ML Certification. The fun part usually starts when I move beyond the courses and start to apply learnings to solve real business problems.

I am currently building out basic infrastructure to support collaborative development (github, etc). I also run linux infrastructure for model building and services hosting. I have 25+ years of technology experience building and launching systems, and managing data. It’s the AI piece that’s a new addition to my toolbox. I’m deeply grateful for Andrew Ng’s gentle introduction to the AI space.

I am in product development mode on the projects below, validating product hypotheses with real customers.

If anyone is interested in discussing these projects or is looking to collaborate or just talk over a coffee, feel free to reach out.

— Greg

Project 1: ROI Navigator

For businesses that need to steer their technology investment strategy, but often struggle with financially analyzing various options, ROI Navigator provides robust analysis tools that help them understand their current IT costs, the potential future state total costs, and the price of getting there.

I’ve previously done a lot of work in this domain, but now I’m applying AI to reduce points of friction. Features include:

  • Estimate current state TCO based on sparse inputs
  • Estimate one time costs and schedules to implement a change

I am currently building with: PyTorch, Spark, and using React and Next.js for my font end. My working approach centers on effective neural network development, but I may start leveraging RAG / OLlama for web searches to compile an easy to maintain global cost database for things like hardware, labor, and power.

Project 2: Specialty Insurance Market Customer Services

There is an emerging area in insurance called “specialty markets”. It’s known by other names as well, but it essentially boils down to policy areas involving high risk due to wildfire and flooding. Insurers are increasingly being required to cover these risk areas and write policies, and states like California are capping what the insurers can charge. This drives a need to comply with market regulations while keeping costs down.

My feature set for this solution centers on enabling the insurance business processes required to service customers, from qualification, to policy writing, to claims processes.

My technology strategy at this moment is centered around Ollama and Langchain, along with several integrations up and down the value chain, supported by Langchain.

I recommend you continue your studies with the Deep Learning Specialization. It allows you to create much more complex models.

The MLS course only scratches the surface of many topics.

1 Like

TMosh, thanks for the pointer. My approach has been to rip out as much code as possible from the MLS series, build my own libraries based on that code, and start applying it to solve problems.

There is of course the process of moving from the python coded implementations to libraries that do much of the lifting - that’s part of the fun.

I have in the process been able to set up a basic SDLC. Also made the decision to migrate code to Pytorch. It been good to be up to speed enough to compare libraries. In terms of continuing learning, do you have a specific course you recommend? Thanks.

DLS courses 2, 4, and 5 use TensorFlow. Pytorch is similar.

All of the DLS courses are useful. But starting in the middle of the sequence can be difficult.

DLS courses 4 and 5 discuss convolutional networks (CNN) and sequence models (i.e. LSTMs and Transformers).

CNNs are the key to most of the computer vision work.

Sequence models are the technology behind the current flock of natural language processing and chat tools.

2 Likes

Hi all, using this space to communicate & share learnings with the community here.

It’s always good to come up for air every so often and talk to humans. :wink:

This pertains to the ROI Navigator project I mentioned up top.

I’m building a model that predicts cloud rates based on service consumption records.

I’m building a dirt-simple nn to start, with the features and targets below. I’m not even doing validation yet. I want to start basic, ask questions, and evolve iteratively.

You can also see from the features list how I’m vectorizing the data: If it’s a categorical string, it gets encoded. If it’s a numeric value, it gets normalized.

To give a sense of the data at a deeper level:

  • productCode has 26 unique values
  • usageType has ~600 unique values ← this has not lent itself well to primitive encoding, and i fear may be too big for one hot encoding.

features & targets

features = [
      ["lineItem/ProductCode", 'categorize'],
      ["product/region", 'categorize'],
      ["lineItem/UsageType", 'categorize'],
      ["lineItem/UsageAmount", 'normalize'],
      ["lineItem/Operation", 'categorize'],
      ["pricing/unit", 'categorize'],
      ["pricing/term", 'categorize'],
      ["product/transferType", 'categorize'],
      ["product/productFamily", 'categorize'],
      ["product/servicecode", 'categorize'],
      ["product/fromLocation", 'categorize'],
      ["product/toLocation", 'categorize'],
      ["product/toLocationType", 'categorize'],
      ["product/fromRegionCode", 'categorize'],
]

targets = ["lineItem/UnblendedRate"]

nn design

model = torch.nn.Linear(len(features), 1)

My train statistics

Epoch 290001, Loss: 0.0001

Learned parameters:

Weight: Parameter containing:
tensor([[-1.8386e-03, -1.9079e-05, -5.7763e-01, -6.2332e-01, -1.9230e-02,
         -2.4756e-01,  5.4674e-01,  1.7575e-03, -2.0491e-04,  2.1780e-02,
         -3.5508e-03, -7.6988e-02]], requires_grad=True)

Bias: Parameter containing:
tensor([0.0512], requires_grad=True)

My prediction accuracy - in some areas, surprisingly OK, in others, really bad.

Product Code                                                Target                Prediction                     Error
-------------------------------------------------------------------------------------------------------------------------------
Shell                                           0.02000000                0.07133483                   -256.67%
QueueService                                         1.84000000                1.85449263                     -0.79%
Instances                                            3651.63000000             3662.53707219                     -0.30%
LongTermStorage                                          29.15300000               15.54393589                     46.68%
Storage                                              194.82900000              202.10360763                     -3.73%
SNS                                               1.31500000                1.17508578                     10.64%
VPC                                             116.08500000              114.49740121                      1.37%

As I take a step back, I’m glad for these things:

  • More comfortable with pandas, numpy, pytorch
  • Getting a model to work with my data
  • Starting to understand the right shapes you should see in and out of model initialization, training, testing, etc

I have these open questions to explore:

  • What is my overall model design strategy?
    • Does this data lend itself better to a decision tree?
  • What is my data strategy?
    • How could I better manage the categorical data in the nn? I hear a lot about Embeddings, but most of it is in context of NLP - how should I think about managing highly categorical data to predict a continuous scalar in an nn?
  • How do I develop a model when I’m not sure what the sensitivity is to any given feature?
    • It was interesting to add random features from my dataset and see some predictions increase or decrease in accuracy. What’s the systematic way to gauge the importance of a feature in a prediction when you have so many?
  • Products & Libraries
    • What libraries are people using to solve what problems? It’s hard to understand if you should try to go with pure pytorch, or use scikit for certain things.
    • I’m hearing pytorch is somewhat favored in academic and research settings, and tensorflow has a strong base in enterprise. For someone looking to apply AI/ML in the business world, is this accurate to any extent or important?

I have these basics to build into my code:

  • Data splitting & validation

Thanks!

How are you doing the encoding? Is it an enumerated set of integers, or are you using one-hot coding?

Enumerated integers do not work very well for categories, because a list of integers includes an implied similarity between adjacent values. For example labels ‘1’ and ‘3’ appear to be more different (3-1 = 2) than labels ‘1’ and ‘2’.

Using one-hot coding addresses this issue.

using

from sklearn.preprocessing import LabelEncoder, MinMaxScaler

respectively for encoding and normalization based on need.

i know i could do one hot for the product code that has ~26 values, but I assumed that would not work well for usagetype that has abuot ~600 uniques.

thoughts?

I agree.

So I see a lot of discussion about using an Embedding layer for this type of scenario, but most of the guidance is in the context of NLP solutions. Is this an appropriate design approach when you want to incorporate a categorical feature with this many unique values? Note that I also suspect that only a handful out of the 600 factor in, mostly in the case of network transport. So it feels like there’s a role for some form of regularization to play here too – but my understanding is that L1 regularization adjusts for the entire feature, not just elements within. So maybe not the right tool for the job…

OK - my reading reveals this:

Embeddings

  • Reduce an explosion of dimensions, which complicates the model
  • Provides a “dense vector”, with good information density in the feature that can be utilized for training (I suppose as opposed to one hot, which can produce massive arrays with few values (sparse))

Visualization

  • You can utilize techniques like t-SNE or PCA to visualize dimensions in 2D or 3D space to better understand what’s going on
  • General guideline is to tweak the embedding output and measure performance

Model Design with Embeddings

  • This example utilizes Keras. I’m using pytorch, but same rules apply.
from tensorflow import keras
from keras.layers import Input, Embedding, Dense

# Assuming 'categorical_feature' is input value ~600 unique values
input_layer = Input(shape=(1,))  # Input shape for a single categorical feature

# Embedding layer (adjust output_dim as needed)
embedding_layer = Embedding(input_dim=600, output_dim=64)(input_layer) 

# Flatten the embedding output
flatten_layer = Flatten()(embedding_layer)

# other layers go here

# Output layer for a basic numeric scalar prediction
output_layer = Dense(1)(...previous_layer...) 

model = keras.Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='mse', optimizer='adam') # for example
model.fit(X_train, y_train, ...)

Going to try. Will also just one-hot the other feature in question for some fun.

Seems promising.

1 Like

Muuuuuuuch better model performance.

Product Code                                                Target                Prediction                     Error
-------------------------------------------------------------------------------------------------------------------------------
Cloud Instances                                      8133.91659038             8072.04064769                      0.76%
k8s                                                     0.20000000                0.19369605                      3.15%
Long Term Storage                                      29.15300000               28.98991119                      0.56%
Storage                                               207.80831000              209.69635525                     -0.91%
SNS                                                     1.31500000                1.49027397                    -13.33%
SWF.                                                   13.62500000               13.51745847                      0.79%
StateMachine                                           14.81800000               14.81933378                     -0.01%
VPC                                                   129.15000000              129.62079291                     -0.36%

The model design - still validating and testing, but it runs. Keras is much friendlier than pytorch.

      product_code_num_uniques = X[xn['product_code']].nunique()
      usage_type_num_uniques = X[xn['usage_type']].nunique()

      # Define embedding dimensions
      product_code_output_dims = math.ceil(product_code_num_uniques * 0.5)
      usage_type_output_dims = math.ceil(usage_type_num_uniques * 0.5)

      print(f"product_code_num_uniques {product_code_num_uniques}")
      print(f"usage_type_num_uniques {usage_type_num_uniques}")
      print(f"product_code_output_dims {product_code_output_dims}")
      print(f"usage_type_output_dims {usage_type_output_dims}")

      # Input layers for categorical features
      shape("product_code", X[xn['product_code']].values)

      # https://www.youtube.com/watch?v=oXMEeGrAuk0
      product_code_input = keras.Input(shape=(1,), name="product_code_input")
      usage_type_input = keras.Input(shape=(1,), name="usage_type_input")

      # Embedding layers
      product_code_embedding = layers.Embedding(
          input_dim=product_code_num_uniques,
          output_dim=product_code_output_dims
      )(product_code_input)

      usage_type_embedding = layers.Embedding(
          input_dim=usage_type_num_uniques,
          output_dim=usage_type_output_dims
      )(usage_type_input)

      # Flatten the embeddings
      product_code_flat = layers.Flatten()(product_code_embedding)
      usage_type_flat = layers.Flatten()(usage_type_embedding)

      # Concatenate the flattened embeddings
      concatenated = layers.Concatenate()([product_code_flat, usage_type_flat])

      # Output layer (a single neuron for a numeric scalar output)
      output = layers.Dense(1, activation='linear')(concatenated)

      # https://www.tensorflow.org/api_docs/python/tf/keras/Model
      model = keras.Model(
          inputs=[product_code_input, usage_type_input],
          outputs=output)

      # Compile the model
      model.compile(optimizer='adam', loss='mse')

Confusing warning: /keras/src/models/functional.py:225: UserWarning: The structure of inputs doesn’t match the expected structure: [‘product_code_input’, ‘usage_type_input’]. Received: the structure of inputs={‘product_code_input’: ‘', ‘usage_type_input’: '’}

100 epochs with 1024 batch size (!)

Product Code                                                Target                Prediction                     Error
-------------------------------------------------------------------------------------------------------------------------------
Cloud Instances                                   0.04680206                0.04677766                      0.05%
k8s                                               0.10000000                0.09938477                      0.62%
Long Term Storage                                 0.05449159                0.05449689                     -0.01%
Storage                                           0.02196473                0.02192950                      0.16%
SNS                                               0.01878571                0.01831898                      2.48%
SWF                                               0.06550481                0.06573361                     -0.35%
StateMachine                                      0.05163066                0.05149092                      0.27%
VPC                                               0.03037394                0.03026936                      0.34%