In C2W2 labs Softmax and Multiclass we create a dataset by using make_blob
Until now, we’ve been defining the dataset by giving X,y value pairs or e.g pictures representing zeroes and ones. Here we give ‘centers’ instead and generate a dataset. What are ‘centers’ and how is the dataset generated?
What does [[-5, 2], [-2, -2], [1, 2], [5, -2]]
represent?
Thank you!
Hi @Delacraft
This is for creating toy datasets. This function will generate 2000 samples of data, around each of the provided centers with a standard deviation of 1.0 (and random state of 30). Each point in the dataset is generated by sampling from a Gaussian (normal) distribution centered at one of these coordinates.
The below image is an example when you provide 3 centers (e.g. [-10, 8], [6, 8], and [8, -3]) with a smaller std.
Hope it helps! Feel free to ask if you need further assistance.
1 Like
Thanks @Alireza_Saei ! What is a random state? How random is 30?
Here we generate this example set because we are learning about ML and I assume it’s an easy way of getting a dataset, without having to collect the data, label it etc. Outside of classroom, in what cases do we generate these kind of artificial datasets? What are they used for?
You’re welcome! happy to help 
Random state is just a seed number that controls randomness and using 30 or 42 or … ensures reproducibility (you get the same random data every time).
Exactly.
Outside class, we use synthetic datasets to:
- Test/debug models quickly
- Benchmark algorithms
- Simulate rare or controlled scenarios when real data is hard to get
Hope it helps!
1 Like