Feature Engine

Introduction of feature engineering. Include some methods in Pyhton.

Feature

Reason to do feature selection

  1. Because a large number of features may cost long training time
  2. Increasing number of features may increase the risk of overfitting
    It can also help us to reduce the dimension of our dataset without loss main information.

Feature Importance

Which provide a important score for each feature. And usually can be done for the raw data

P-value

P-value is the probability of Null hypothesis is true in statisitc model. Normally we select p-value = 0.05 as an significant level.

1
2
3
4
5
6
7
8
9
import statsmodels.api as sm
#Adding constant column of ones, mandatory for sm.OLS model
x = pd.DataFrame() # feature df
y = pd.DataFrame() # target df
x_1 = sm.add_constant(x) # add a constant column to df x
#Fitting sm.OLS model
model = sm.OLS(np.array(y), np.array(x_1)).fit()
# Which prints out the p-value of each features in this model as an array
model.pvalues

Write a function to select feature based on p-value:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# x is feature df
# y is target df
# sl is significant level
def backwardElimination(x, y, sl):
cols = list(x.columns)
pmax = 1
while (len(cols) > 0):
p= []
X_1 = x[cols]
X_1 = sm.add_constant(X_1)
model = sm.OLS(y, X_1).fit()
p = pd.Series(model.pvalues.values[1:], index = cols)
pmax = max(p)
# .idxmax returns the index of maximum
feature_with_p_max = p.idxmax()
if(pmax > sl):
cols.remove(feature_with_p_max)
else:
break
selected_features_BE = cols
print(selected_features_BE)

backwardElimination(x, y, 0.05)

F-test

F-test is a statistical test to find whether there is a significant difference between two model. Least square error is calculated for each model and compared.

Here introduced the skitlearn package, we will use F-test to find first K best features:

For continues data type

1
2
3
from sklearn import sklearn.feature_selection
sklearn.feature_selection.f_regression(x, y) # where x is feature df(n_sample * n_features), y is target df (n_samples)
# output is set of F-score and p-value for each F-score

For classification data

1
2
sklearn.feature_selection.f_classif(x, y) # same with f_regression
sklearn.feature_selection.chi2(x, y) # if x is sparse, then only use chi2 can still keep it sparsity.

F-score is good for linear relation

Mutual infomation

If x and y is independent, MI is 0. And if x has y relation or x is a function of y then MI is 0.
More detail in sklearn Mutual information
Which is good for non-linear relation

1
2
3
4
5
6
7
# x is feature df, y is target df
# For discrete_features. If ‘auto’, it is assigned to False for dense X and to True for sparse X
# n_neighbor higher values reduce variance of the estimation
sklearn.feature_selection.mututal_info_regression(x, y, discrete_features=’auto’, n_neighbors=3, copy=True, random_state=None)
sklearn.feature_selection.mututal_info_classif(x, y)

# output is estimated MI between each feature and target

Variance threshold

Which is only care about feature itself: if it is not vary a lot, then it has poor predictive power.

1
sklearn.feature_selection.VarianceThreshold

Some tips to check

Firstly, we can check whether the data is magnitude matter or we just need to know whether it is positive/negative.

For the data span several orders of magnitude or output depends on the scale of the input, such as 2X12X-1, k-means clustering, nearest neighbors methods (KNN), radial basis function (RBF) kernel and other methods which used Euclidean distance, it is often a good idea to normalize the features to let the output stays at a expected scale.

For the logistic function, which is not sensitive to the scale. For example decision tree, gradient boosted machines and random forests. But sometime the input scale may increase which will let the features grows out of the range that tree was trained on. Then we may need to rescale the input or do the bin-counting method.

It is also very important to check the distribution for the numeric features. For the linear regression, we assume the prediction error follows the Gaussian distribution. In the case that prediction target spread over several order of magnitude which may reject this assumption. Then we may need to use Log Transformation (power transform)

Power transformation

Instead of Log transformation, A simple generalization of both the square root transform and log-transform is Box-Cox transform:

x^={xλ1λif λ0ln(x)if λ=0\hat{x} = \begin{cases} \frac{x^{\lambda}-1}{\lambda} & \quad \text{if } \lambda \neq 0\\ ln(x) & \quad \text{if } \lambda = 0 \end{cases}

To use Box-Cox transformation, we need the data be positive. If it is not, we can add a fixed positive constant. When applying this method, we also need to get the λ\lambda. The way to get λ\lambda are maximum likelihood (find the λ\lambda which maximizes the Gaussian likelihood or Bayesian methods). The package Scipy include these:

1
2
3
4
from scipy import stats
df_log = stats_boxcox(df, lambda=0)

df_boxcox , boxcox_params = stats.boxcox(df)

##Feature Scaling

If the model is sensitive to the scale of the input, then we need to scale the features

###Min-Max Scaling

x^=xmin(x)max(x)min(x)\hat{x} = \frac{x-min(x)}{max(x)-min(x)}

Standardization (variance scaling)

x^=xmean(x)var(x)\hat{x}= \frac{x-mean(x)}{\sqrt{var(x)}}

Note that, if we use these 2 methods to the sparse features, the sparse value may become dense which dramatically increase the computation

l2l^2 normalization

x^=xx2x2=x12+x22+...+xn2\begin{aligned} \hat{x} &=\frac{x}{||x||_2} \\ ||x||_2 &= \sqrt{x_1^2+x_2^2+...+x_n^2} \end{aligned}

##Interaction Features

An easy way to extend the linear model to include the interaction features is to add the product of each 2 features. But is cost a lot computation which O(n)O(n) goes to O(n2)O(n^2)

There are some methods to deal with it: we can only select the top information of these interaction features. Or handcraft some complex features.

Feature Selection

Feature selection is a method that try to select the useful features in order to reduce the complexity of the model.

Filtering

We can compute the p-value for each features: the correlation or the mutual information between each features and the response variable. The disadvantage is this method can not consider the model which means the selected features may not suitable for the model

Wrapper methods

This method is try to provide the score for each subset of features in this model. The advantage is it will not delete the uninformative features but the combination of them is useful. The disadvantage is it cost a lot of computation

Embedded methods

This method is the part of the traing process, for example: the decision tree need to find the feature to split the tree. Or the l1l_1 regularizer is trying to add a sparsity constraint to the model. Hence this method is specific to the model.

Text Data

Bag of words

Bag of words (BoW) which is a list of word vector and the counts of these words. Each word becomes a feature here.

The problem can be it will not show the correct information of the text. For example, “not bad” means “good” but in BoW these 2 words are separate.

Bag of n Grams

This is a extension of BoW, it used a window to select the n words from begin to end and create a count vector.

For example: I like cute cat. If we still use BoW, it will be 1-gram or called unigram. The feature will be “I”, “like”, “cute”, “cat”. The 2-grams or called “bigrams” will select features: “I like”, “like cute”, “cute cat”.

Bag of n grams usually become more sparse and larger which leads to more computation and store. The larger nn means more information and more cost.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas
import json
from sklearn.feature_extraction.text import CountVectorizer
# Load the first 10000 words in "data" json file
f = open('data.json')
js=[]

for i in range(10000):
js.append(json.loads(f.readline()))

f.close()

df = pd.DataFrame(js)
# Create feature transformers for unigrams, bigrams, and trigrams.
# The default ignores single-character words, which is useful in practice because # it trims uninformative words, but we explicitly include them in this example for # illustration purposes.
bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')

# Fit the transformers and look at vocabulary size >>> bow_converter.fit(review_df['text'])
words = bow_converter.get_feature_names()
bigram_converter.fit(review_df['text'])
bigrams = bigram_converter.get_feature_names()
trigram_converter.fit(review_df['text'])
trigrams = trigram_converter.get_feature_names()
print (len(words), len(bigrams), len(trigrams))
# 26047 346301 847545

Filtering the Text Features

Since we know in text data, there are capitalized word, grammer, phrase and stop words. Then we need to introduce several method to filter

Stopwords

Word such like “on”, “and”, “a” do not affect the meaning of the sentence but do take a position in features. The NLP package in Python, such like NLTK include a list of stop words for many languages.

1
nltk.download()

In stop word list, it is good to combine “didn” and “‘t’” together. Otherwise, it will be count as “didn” and “t” as 2 words.

Stemming(词干)

Word such like “dog” and “dogs” or “run”, “ran” and “running” may count as different words but has same meaning when we do some analysis. Stemming method is a NLP task to map thses different words as the same one. This method may only focus on English language which means it is not a common method for all language. For example in NLTK

1
2
3
4
5
6
7
import nltk
stemmer = nltk.stem.porter.PorterStemmer()
## try running, ran
stemmer.stem("running")
# run
stemmer.stem("ran")
# run

But somtimes the unique word such like “new” and “news” has different meaning. This is the disadvantage of this method.

Parsing and Tokenization

Parsing is a method which also consider the structure or the form of the original text. For example, if the original text is a Email, we need to pay more attention on header. Without parsing, the words in header will be treated as normal words which we may lose some useful information.

Tokenization transforms the string to a sequence of characters then into a sequence of tokens. Each token treated as a word. Tokenization need to know which character is end for a token and the end for the next token. For example, space characters or the “#” delimiters can be good separators.

Working only with sentence or the paragraph ranthe than whole document, word2vec is a better method.

Note that most strings encoded as ASCII or Unicode.

Collocation and Phrase

Tokens represent the list of the words or n-grams which is not good as Phrase. The Phrase or called collocation which is a combination of 2 or more words that is usually used in convention way.

Collocation has more meaning than the separate words, such like: “strong tea” where “strong” is no longer the physical strength here. Moreover, the “cute cat” has the same meaning for the separate and the sum of these 2 words. Then we do not consider this as a collocation. Also, the collocation not required the the words stay together. Such like “play with the cat” contain collocation “play cat”.

Collocation Extraction

One idea is trying to do the hypothesis test where

H0H_0: word A appears independent from word B p(BA)=p(BnotA)\Rightarrow p(B|A) = p(B|not A)

H1H_1: word A change the likelihood of seeing word B p(BA)p(BnotA)\Rightarrow p(B|A) \neq p(B|not A)

The statistic is the log-likelihood ratio:

logλ=logL(Data;H0)L(Data;H1)log\lambda = log \frac{L(Data; H_0)}{L(Data; H_1)}

There is an assumption of the data: Let the words generation follows the binomial distribution. Where for each word, we toss a coin, if it is head then we place this word otherwise we insert some other word.

The process can be:

  • Find the probability of appearance for each single word based on frequency of this word
  • Find the conditional probability of 2 words for all unique bigrams p(AB)p(A|B)
  • Compute the likelihood ratio logλlog\lambda for all unique bigrams
  • Sort the likelihood ratio and pick the small value of these bigrams as features

Tf-Idf

Some common word like “is”, “the” appear many times but have less meaning. However, “dramatic”, “magnificently” help us to understand the information in sentence which means they are valuable. Next, we will discuss how to find them out.

Tf-Idf stands for term-frequency-inverse-document-frequency. Rather than count each word in each documents in dataset, Tf-Idf looks at the normalized count where each count of word is divided by the number of the documents this word appears.

bow(w,d)=frenquency of the word w appears in document dtf_idf(w,d)=bow(w,d)*N/(number of documents that word w appears)\begin{aligned} bow(w,d) &= \text{frenquency of the word w appears in document d} \\ tf\_idf(w,d) &= \text{bow(w,d)*N/(number of documents that word w appears)} \end{aligned}

Where bow stands for Bag-of-Word and NN is the total number of documents in the dataset. And the NN/(number of documents that word w appears) is called inverse document frequency. For this frequency, we know that if the word ww appears in many documents, then this frequency will close to 1. If this word ww appears only in few documents, then this frequency will become much higher.

Note that we can also take a loglog on this frequency which is just a alternative way to show this frequency. The word appears in many documents will lead this frequency close to 0 and appears in few documents will make this frequency much bigger than before.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
bow_transform = text.CountVectorizer()
X_train_bow = bow_transform.fit_transform(training_data["text"])
X_test_bow = bow_transform.transform(test_data["text"])
### show the length of the bag of words
len(bow_transform.vocabulary_)

y_training = training_data["target"]
y_test = training_data["target"]

## Then the Tf-Idf is
tfidf_train_transform = text.TfidfTransformer(norm=None)
X_train_tfidf = tfidf_train_transformer.fit_transform(X_train_bow)
X_test_tfidf = tfidf_train_transformer.transform(X_test_bow)

## And the l_2 normalize of bag of word
X_train_l2 = preproc.normalize(X_train_bow, axis = 0)
X_test_l2 = preproc.normalize(X_test_bow, axis = 0)

Using Logistic Regression

Here we used Logistic Regression to build the model and compared the result for bag-of-word, tf-idf and l2l_2 method.

1
2
3
4
5
def logistic(X_train, y_train, X_test, y_test, description):
model = LogisticRegression().fit(X_train, y_train)
score = model.score(X_test, y_test)
print("The test has score with", description, "features:", score)
return model

Using this function we can compare these three method and the highest score means this method is better. But we need to know the Logistic regression is not a good classifier. Especially when the number of features are greater the number of observations, we need to use regularization to solve the high dimensional problem.

Feature Hashing

Hash function can map a potentially unbounded integer to a finite integer range [1,m][1, m]. Since many numbers may map into a same bin, then it is also callled collision. A uniform hash function assured that almost same number of numbers are mapped into each mm bins.

1
2
3
4
5
6
def hash_features(word_list, m):
output = [0] * m
for word in word_list:
index = hash_fun(word) % m
output[index] += 1
return output

Or use the package in scikit-learn and check the size of them. Moreover, hash mapping has a built-in data structure in Python called dictionary: dic={“key”: value, “key2”: value}.

1
2
3
4
5
6
7
h = FeatureHasher(n_feature=m, input_type="string")
f = h.transform(dataframe)

## check the storage size
from sys import getsizeof
print("Pandas Series", getsizeof(dataframe))
print("Hashed numpy array", getsizeof(f))

Bin Counting

Bin Counting selects the features for individual category variables by finding the conditional probability of the target under that value of the category rather than using the value of the category.

Example in “Big Learning Made Easy-with Counts” is

1
2
3
4
| user | number of clicks | number of nonclicks | probability of clicks |
|------|------------------|---------------------|-----------------------|
| A | 5 | 120 | 0.04 |
| B | 20 | 230 | 0.08 |

Hence here we used probability of clicks to replace the 2 original features. Others, such like odds ratio and log-odds ratio can replace the probability method.

The advantage of this method is to make a large, sparse, binary representation of the categorical variable into a small size by using statisitcal value to replace the original data.

Rare Categories

One way to deal with it called back-off, creat a bin to sum the rare category data. For the definition of the rare categories, we use a threshod to compute them.

PCA

PCA stands for principal componnet analysis, which is good when data lies in a linear subspace.

K-means

When data has a more complicated shape, k-means is useful here. Also other method such like LR, kNN, Random Forest and RBF SVM can do the same thing but have different efficiency.

Image Feature

Image Gradients

Image gradient is the differences between neighboring pixels(像素). But the problem is that individual pixels do not carry enough semantic information about the image. Therefore, they are bad atomic units for analysis.

Two simple way is calculating the difference horizontal (x) and vertical (y) axes of the image. The mask [1,0,1][1, 0, –1] takes the difference between the left neighbor and the right neighbor or the up-neighbor and the down-neighbor. We used the convolution which is common in signal processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import matplotlib.pyplot as plt 
import numpy as np
from skimage import data, color
### Load the example image and turn it into grayscale
image = color.rgb2gray(data.chelsea())
### Compute the horizontal gradient using the centered 1D filter.
### This is equivalent to replacing each non-border pixel with the
### difference between its right and left neighbors. The leftmost
### and rightmost edges have a gradient of 0.
gx = np.empty(image.shape, dtype=np.double)
gx[:,0]=0
gx[:, -1] = 0
gx[:, 1:-1] = image[:, :-2] - image[:, 2:]
### Same deal for the vertical gradient
gy = np.empty(image.shape, dtype=np.double)
gy[0,:]=0
gy[-1, :] = 0
gy[1:-1, :] = image[:-2, :] - image[2:, :]
### Matplotlib incantations
fig, (ax1, ax2, ax3) = plt.subplots(3, 1,figsize=(5, 9),sharex=True,sharey=True)

ax1.axis('off')
ax1.imshow(image, cmap=plt.cm.gray)
ax1.set_title('Original image')
ax1.set_adjustable('box')
ax2.axis('off')
ax2.imshow(gx, cmap=plt.cm.gray)
ax2.set_title('Horizontal gradients')
ax2.set_adjustable('box')
ax3.axis('off')
ax3.imshow(gy, cmap=plt.cm.gray)
ax3.set_title('Vertical gradients')
ax3.set_adjustable('box')

Image feature extractors such as SIFT and HOG are better.

In 1999, computer vision researchers figured out a better way to represent images using statistics of image patches: the Scale Invariant Feature Transform (SIFT) [Lowe, 1999].

SIFT was originally developed for the task of object recognition, which involves not only correctly tagging the image as containing an object, but pinpointing its location in the image. The process involves analyzing the image at a pyramid of possible scales, detecting interest points that could indicate the presence of the object, extract‐ ing features (commonly called image descriptors in computer vision) about the inter‐ est points, and determining the pose of the object.

Over the years, the usage of SIFT expanded to extract features not only for interest points but across the entire image. The SIFT feature extraction procedure is very sim‐ ilar to another technique, called the Histogram of Oriented Gradients (HOG) [Dalal and Triggs, 2005]. Both of them essentially compute histograms of gradient orienta‐ tions. We now describe this process in detail.

Fully Connected Layers

A fully connected neural network is simply a set of linear functions of all of the input features. Recall that a linear function can be written as an inner product between the input feature vector and a weight vector, plus a possible constant term. y=Wx+by=Wx+b. The reason why it called fully connected is every input can be used in output and there is no restriction on the Weighted matrix WW. (We know the convolutional layer only use a subset of imputes for each output)

Convolutional Layers

Convolutional Layers use only a subset of the inputs for each output. The transformation or called convolution operator moves across the input.

Applying a simple Gaussian filter on an image. Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from skimage import data, color
from scipy import signal
import matplotlib.pyplot as plt
# First create X,Y meshgrids of size 5x5 on which we compute the Gaussian
ind = [-1., -0.5, 0., 0.5, 1.]
X,Y = np.meshgrid(ind, ind)
X
## array([[-1. , -0.5, 0. , 0.5, 1. ],
## [-1. , -0.5, 0. , 0.5, 1. ],
# [-1. , -0.5, 0. , 0.5, 1. ],
# [-1. , -0.5, 0. , 0.5, 1. ],
# [-1. , -0.5, 0. , 0.5, 1. ]])

# G is a simple, unnormalized Gaussian kernel where the value at (0,0) is 1.0
G = np.exp(-(np.multiply(X,X) + np.multiply(Y,Y))/2)
G
## result of G
array([[0.36787944, 0.53526143, 0.60653066, 0.53526143, 0.36787944],
[0.53526143, 0.77880078, 0.8824969 , 0.77880078, 0.53526143],
[0.60653066, 0.8824969 , 1. , 0.8824969 , 0.60653066],
[0.53526143, 0.77880078, 0.8824969 , 0.77880078, 0.53526143],
[0.36787944, 0.53526143, 0.60653066, 0.53526143, 0.36787944]])

cat = color.rgb2gray(data.chelsea())
blurred_cat = signal.convolve2d(cat, G, mode='valid')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4), sharex=True, sharey=True)
ax1.axis('off')
ax1.imshow(cat, cmap=plt.cm.gray)
ax1.set_title('Input image')
ax1.set_adjustable('box')
ax2.axis('off')
ax2.imshow(blurred_cat, cmap=plt.cm.gray)
ax2.set_title('After convolving with a Gaussian filter')
ax2.set_adjustable('box')

## To read our own data img
import imageio
ws_image = imageio.imread("wushuang.png")
ws_image = color.rgb2gray(ws_image)
ws_image = np.loadtxt("wushuang.txt")
# where ws_image is ndarray of RGB

Rectified Linear Unit (ReLU) Transformation

Since we know the activation function between the output and the input usually be a non-linear transformation such like tanh function (a smooth nonlinear function bounded between -1 and 1), the sigmoid function (a smooth function non-linear function between 0 and 1) or called the rectified linear unit. ReLU is a linear function where the negative part is zeroed out and the range of it is (0,)(0, \infty)

Common Activation Function:

ReLU(x)=max(0,x)ReLU(x) = max(0, x)

tanh(x)=sinh(x)cosh(x)=exexex+extanh(x) = \frac{sinh(x)}{cosh(x)}=\frac{e^x-e^{-x}}{e^x+e^{-x}}

sigmoid(x)=11+exsigmoid(x) = \frac{1}{1+e^{-x}}

activation_function

Pooling Layers

Pooling Layer combine every neighborhood under its length to produce the output. This reduce the number of outputs in the hidden layer of deep learning network, which is effectively reduces the probability of overfitting the network to training data.

The methods to pool the inputs are: averaging, summing and the maximum value. AlexNet uses overlapping max pooling, moving through the image in strides of two pixels (or outputs) and pooling through three neighbors.

max pooling

An Item-Based Recommender

  1. Generalize information about the item
  2. Score all other items and find the similar ones.
  3. Return the score and the items

Import Data, Cleaning and Feature Selection

First, we need to define the what knid of papers are more useful for the users. The featues such like: Published date, Fields of study can be assumed to be useful. We may calculate the similarity score to define each papers. Usually, the cosine similarity provides a reasonable comparison between 2 non-zero vectors. (other similarity can be viewed in distance)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import pandas as pd

df = pd.read_json("filepath/filename.json", lines = True)
## df.shape. df.columns

df = df[feature == "value"].drop_duplicates(feature = "value", keep = "first").drop(["feature_1", "feature_2"], axis = 1)

## df.shape

unique_fos = sorted(list({feature
for paper_row in model_df.fos.fillna('0')
for feature in paper_row }))
unique_year = sorted(model_df['year'].astype('str').unique())

def feature_array(x, var, unique_array):
row_dict = {}
for i in x.index:
var_dict = {}
for j in range(len(unique_array)):
if type(x[i]) is list:
if unique_array[j] in x[i]:
var_dict.update({var + '_' + unique_array[j]: 1})
else:
var_dict.update({var + '_' + unique_array[j]: 0})
else:
if unique_array[j] == str(x[i]):
var_dict.update({var + '_' + unique_array[j]: 1})
else:
var_dict.update({var + '_' + unique_array[j]: 0})
row_dict.update({i : var_dict})
feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
return feature_df
###
year_features = feature_array(model_df['year'], unique_year)
fos_features = feature_array(model_df['fos'], unique_fos)
first_features = fos_features.join(year_features).T

from sys import getsizeof
print('Size of first feature array: ', getsizeof(first_features))

## define a “good” recommenda‐ tion as a paper that looks similar to the input.
from scipy.spatial.distance import cosine

def item_collab_filter(features_df):
item_similarities = pd.DataFrame(index = features_df.columns,
columns = features_df.columns)
for i in features_df.columns:
for j in features_df.columns:
item_similarities.loc[i][j] = 1 - cosine(features_df[i],
features_df[j])
return item_similarities

first_items = item_collab_filter(first_features.loc[:, 0:1000])

## heatmap to show the similarity
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
sns.set()
ax = sns.heatmap(first_items.fillna(0), vmin=0, vmax=1, cmap="YlGnBu", xticklabels=250, yticklabels=250)

ax.tick_params(labelsize=12)

Binning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))
temp_df = pd.DataFrame(index = model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision = 0)
X_yrs = pd.get_dummies(temp_df['yearBinned'])
X_yrs.columns.categories
IntervalIndex([(1831.0, 1841.0], (1841.0, 1851.0], (1851.0, 1860.0],
(1860.0, 1870.0], (1870.0, 1880.0] ... (1968.0, 1978.0],
(1978.0, 1988.0], (1988.0, 1997.0], (1997.0, 2007.0],
(2007.0, 2017.0]]
closed='right',
dtype='interval[float64]')
# plot the new distribution
fig, ax = plt.subplots()
X_yrs.sum().plot.bar(ax = ax)
ax.tick_params(labelsize=8)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)

Reference

Feature Engineering for Machine Learning, 2018, O’Reilly Media

Author: shixuan liu
Link: http://tedlsx.github.io/2019/09/18/feature-engine/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment