When starting a new NN model for image classification how does one start building the model in terms of the number of layers and the number of units in each layer and the respective activation functions?
Starting a Neural Network model from scratch can feel a bit overwhelming due to the sheer number of tunable parameters. For image classification, the standard approach is to lean on Convolutional Neural Networks (CNNs) rather than standard dense networks, as they are specifically designed to understand spatial relationships.
1. âStandardâ Starting Architecture for Image Classification
A typical baseline architecture for image classification is composed of two main stages: feature extraction and classification.
A. Convolutional Layers (The Feature Extractors)
These layers scan the image to identify edges, textures, and shapes.
-
Number of Layers: Start with 2 to 4 convolutional blocks. Each block typically consists of a Convolution layer, an Activation layer, and a Pooling layer.
-
Filters (Units): Start small and double the number as you go deeper (e.g., 32 â 64 â 128).
- Reasoning: As the network deepens, the image resolution shrinks due to pooling, but the complexity of the features increases. Adding more filters allows the model to capture this higher-level complexity within the input images.
-
Kernel Size: 3 x 3 is a good default. It is computationally efficient while still capturing local patterns effectively.
B. Dense Layers (The Classifier)
These layers take the features extracted by the convolutional layers and make the final classification decision.
-
Number of Layers: Usually 1 or 2 hidden dense layers are placed after flattening the convolutional output from the final convolutional unit.
-
Units: Common choices are 128, 256, or 512.
- Note: If this number is too high, the model is likely to memorize the training data rather than learning generalizable patterns (overfitting).
2. Choosing Activation Functions
Activation functions introduce non-linearity, enabling the model to learn complex data patterns.
| Layer Type | Recommended Activation | Why? |
|---|---|---|
| Hidden Layers | ReLU (Rectified Linear Unit) | It is the industry standard used with the NN intermediate layers (hidden). It outputs the input value if the input is positive and zero otherwise. |
| Output (Binary Classification) | Sigmoid | Squashes output strictly between 0 and 1. |
| Output (Multi-class Classification) | Softmax | Turns outputs for multiple classes into a probability distribution that sums to 1 |
Let me know if you need any further explanation or clarification.
Thanks. Lots of great advice there!
Please see keras_tuner as well.
You might also get ideas from reviewing architectures listed on the web at places like thisâŚ
Your choice will also be influenced by non- functional requirements such as where the model needs to run, what throughput you need to achieve, and the cost/benefit of your truth table.
Thank you!
What other things do I need to consider when building a model from scratch, developing it and testing it as a ML Engineer in industry? For example I have heard of JAX, TF-serving and XLA/MLIR is one job description.
You donât have to always build a model from scratch, depending on your use-case sometimes it may make sense to fine-tune an existing off-the-shelf model which has a use case similar to yours. Building a model generally requires a fair bit of experimentation and testing with varied datasets, architectures and hyper-parameters to get it to work. You have to try different combinations and see what works for you.
Talking about JAX at a high level, it is a deep learning library like PyTorch and TensorFlow. TF-Serving is a system for deploying pre-trained models (primarily TensorFlow) in production for inference. XLA at a high level, optimizes machine learning (ML) models, accelerating training and inference.
OK, thanks.
Are there any other tools or libraries I should be aware of when building and testing a production NN model for an employer as an employee ML Engineer who has never worked for an employer in this field before?
Pick up any one of the deep learning libraries to master, these days PyTorch is the one that is en vogue. Other options are TensorFlow and JAX.
As far as platforms for managing the ML lifecycle, including experiment tracking, model versioning, and deployment you can consider MLFlow, it is an open-source platform for managing the ML lifecycle.
For model serving and deployment you can consider one of the cloud based managed services like AWS SageMaker / Bedrock (LLMs), Googleâs VertexAI or Azure ML / Azure OpenAI Service (LLMs).
There are also other plethora of options and tools out there for managing the entire ML lifecycle.
You can also consider KubeFlow / Seldon Core, if you have some Kubernetes background.
Thanks. Have you any experience building, developing and testing a NN model in an working environment for an American employer?
I do have experience building, developing and testing a NN model in a working environment but not for an American employer.
Similar, I would try and find a model which has a similar application to yours, no need to reinvent the wheel again!
What were key takeaways for you from building your first model for an employer?
Please explain what these different layer types are;
-
a Convolution layer
-
an Activation layer
-
a Pooling layer.
I will suggest that the most important feature of your first model is not a feature of your first model. Itâs a feature of the business question you are being asked to address in your first model. If you are being asked to colonize Mars, or end cancer, your project will fail. Even if you are being asked something that seems specific and achievable in a reasonable time, say, reduce hospital readmissions, you cannot succeed. That is not something a machine learning model can accomplish. What you might be able to do is predict likelihood of hospital readmission for a given patient, or the likelihood of sepsis onset, or structural failure of a part. It presumes that sufficient training data exists and you can quantify what success means.
When I was doing machine learning projects for real US employers with US and International customers, the biggest risk to project success was unrealistic expectations. Donât start out trying to solve an extraordinarily hard problem at 100% accuracy, something exceeding expert human performance under the best conditions. Rather, be willing to achieve a provable, modest success with a modest amount of resources, then iterate. There is typically more than one way to attack a well defined business problem, but there are no technical solutions to one that is poorly defined. So pick a technology or platform, or work with the one your customer is already building on top of. Then, as suggested above, building incrementally off of a proven success is likely a faster path to value than big bang invention of a totally novel approach to a gnarly problem.
Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. By mimicking the way the human visual cortex processes information, they break down complex images into manageable, hierarchical features.
Below is a brief explaination what these different layer types are:
1. Convolutional Layer
The Convolutional Layer is the engine of the network. Instead of processing an entire image as one flat list of pixels, it focuses on small, local regions to preserve spatial relationships.
-
The Mechanism: A small matrix, known as a filter or kernel (e.g., 3 \\times 3), slides across the input image. At each stop, it performs a mathematical operation (element-wise multiplication and summation) to create a Feature Map.
-
Hierarchical Learning: * Early Layers: Detect simple patterns like horizontal edges or color gradients.
- Deeper Layers: Combine simple patterns to recognize complex shapes, such as eyes, wheels, or entire faces.
2. Activation Layer (ReLU)
The Activation Layer acts as a gatekeeper, deciding which information is important enough to pass forward.
-
The âWhyâ: Real-world data is messy and non-linear. Without an activation layer, the entire network would behave like one giant linear equation, making it unable to learn complex patterns.
-
ReLU (Rectified Linear Unit): This is the industry standard. It follows a simple rule:
-
If the input is negative, it becomes 0.
-
If the input is positive, it stays the same.
-
-
Other Types: * Sigmoid/Tanh: Often used in specific layers for probability.
- Softmax: Typically the final layer used to output multiclass probabilities.
3. Pooling Layer
The Pooling Layer is responsible for âdownsampling.â It shrinks the image dimensions to make the data more manageable.
-
Max Pooling: The most common method. It looks at a window (e.g., 2 \\times 2) and retains only the highest value, discarding the rest.
-
Key Benefits:
-
Efficiency: Reduces the number of parameters and computation time.
-
Translation Invariance: Helps the network recognize an object even if it is slightly tilted or shifted.
-
Prevents Overfitting: By simplifying the data, the model focuses on the most prominent features rather than noise.
-
The CNN Workflow
A typical CNN is built by stacking these layers in a repetitive cycle:
| Step | Layer | Purpose |
|---|---|---|
| 1 | Convolution | Feature Extraction (Finding the patterns). |
| 2 | Activation | Introducing Non-linearity (Allowing for complexity). |
| 3 | Pooling | Spatial Reduction (Simplifying the data). |
By repeating this âsandwichâ many times, the network evolves from âseeingâ pixels to âunderstandingâ objects.
Thanks thatâs really helpful.
I am a complete beginner to this but I have completed the MLS with 100% in all grades exercises and I am part through the DLS.
My aim is to complete the DLS then demonstrate my knowledge from these courses by building an image classifier NN which is trained from a small input dataset of 2000 plant leaf pictures for 5 different plant species and see if I can train it to predict one of those plant leaf species and use this exercise to demonstrate to employers a real- world example of multi class image classification.
Is 2000 images as an input training dataset large enough? I will actually be using 1000 original different plant leaf images but doubling the size of the total input training dataset by applying data augmentation by flipping each image from left to right.
Should I also perform z-score normalisation on each input pixel feature for every image?
Always hard to apply a simple rule to this question. My thought is it is enough to do a simple model with decent results. However, donât overlook that you need to split your data between train and test and that you have 5 classes. So you are really talking about a few hundred of each class to train on. That is very small for a real world example. It also means you can likely hold the entire training set data structure in memory at runtime, which simplifies your life but avoids mastering another real world challenge of handling large data. Following my suggestion above, maybe start with 2K but then try to scale.
Also, when you get to that point, donât overlook class imbalance. 400 of each class trains differently than 1800 of one class and 50 each of the rest.
EDIT
Hereâs a related thread with some thoughts about starting from scratch vs starting with a known baselineâŚ
Thanks thatâs really helpful.
I am a complete beginner to this but I have completed the MLS with 100% in all grades exercises and I am part through the DLS.
My aim is to complete the DLS then demonstrate my knowledge from these courses by building an image classifier NN which is trained from a small input dataset of 2000 plant leaf pictures for 5 different plant species and see if I can train it to predict one of those plant leaf species and use this exercise to demonstrate to employers a real- world example of multi class image classification.
Is 2000 images as an input training dataset large enough? I will actually be using 1000 original different plant leaf images but doubling the size of the total input training dataset by applying data augmentation by flipping each image from left to right.
Should I also perform z-score normalisation on each input pixel feature for every image?
In the world of DL, 2000 images is considered a small dataset. However, âsmallâ doesnât mean âimpossibleâ. With 5 species, you have 400 images per class (200 original + 200 flipped). If you were training a massive architecture from scratch, this wouldnât be enoughâthe model would likely just memorize your training set (overfit). Flipping is a great start for augmenting your dataset, but donât stop there! Since leaves can be at any angle or lighting, try adding random rotations, brightness adjustments, etc, in short try to generate more synthetic data.
Regarding if you should also perform z-score normalization on each input pixel feature for every image. The short answer is - not usually for images. For images, we typically use a simpler approach like - Min-Max Scaling, Mean Subtraction. While you can use z-score, standardizing 0â1 is computationally faster and usually sufficient for the activation functions (like ReLU) used in CNNs.
You can also focus on other real-world elements in your project like: validation split, confusion matrix, etc