Data for Machine Learning algorithms

If in our data we have important things which are not numbers, for example in house pricing, location, or in finance some things are risk measures which are not in numbers but in words, how we can input this to the machine?

Generally you create a dictionary of all of the labels, and create a “one-hot” logical vector from them. This means for each example, one of the labels will be ‘1’ and all the others will be ‘0’.

Geography is slightly different. You can encode a location by using the centroid of the geographic region, encoded as floating point values for the latitude and longitude.

So for example if we have 10 districts, I must rate them from 0-10? Also can you explain what does “one-hot” logical vector mean? :sweat_smile:

Let’s say you want to classify fruit.
In your universe, there are only three types: apples, grapes, and dates.

So you have three labels “apple”, “grape”, “date”. You have to convert these to numbers, so your model can do some math on them.

Do not encode them as a sequence of integers. So, do not use fruit = 1 for “apple”, fruit = 2 for “grape”, and fruit = 3 for “date”.

The reason this doesn’t work is that it creates a linear relationship between the values. Using an enumerated encoding would make it appear that apple and grape have a difference of 1, but apple and date have a difference of 2. That’s misleading information, so avoid using that method.

Instead, you use this encoding, where only one value is set to 1 (for ‘true’), and the others are 0 (for ‘false’).

If you have an image of an apple, its labels would be:
apple = 1
grape = 0
date = 0

When you encode this in the data set for training, you would have three columns - one for each of your logical variables.