Introduction¶
With the past surge in pokemon enthusiast, courtesy of Pokemon Go many people have gained a new or renewed interest in pokemon. This is a tutorial on some simple ways to to collect data on the first six generations of pokemon. Afterwords it will process the data and output it so we can visualize it clearly. Analyze some patterns and draw conclusions based on them. By the end we hope to gain some insight on how to effectively hunt pokemon in various generations and as well as determine which pokemon to look for as the franchise continues.

We will do this using a variety of graphs analyzing data in a multitude of ways. As well as a bit of machine learning to establish future trends. By the end of this you won't need to "catch 'em all". The majority of this project will be observing trends in the pokemon as well as attempting to make a good hypothesis based on our observations.
Note: The code in this tutorial will be written is a long-winded fashion as to clearly show exactly what each cell is meant to do. Please bear in mind upon reading there are much more efficient ways to accomplish some of these tasks.¶
Tutorial content¶
In this tutorial, we will show how to do some plot some graphs and apply linear regression using Matplotlib, Pandas, and Sklearn.
We will be using data from a pokemon database. Explaining how to use various python commands to select and edit the dataset. Afterwords it will show how to display data in various formats. As well as properly predict trends, from which we shall draw several conclusions. You can re-create this tutorial with any of your favorite 90's Saturday morning cartoons, digimon, yugioh, etc.
We will be covering the following topics in this tutorial:
Installing the Libraries¶
First you need to have python installed (preferably Python 3 or greater). Next You'll want to have some of the base python libraries installed.
$ pip install numpy
$ pip install scipy
$ pip install scikit-learn
$ pip install pandas
Note: The majority of these steps can be skipped through anaconda navigator installation.¶
Loading and Plotting the Data¶
Now that everything is installed it's time to start downloading the necessary files. First you'll need the a spreadsheet of the first six generations of pokemon along with their 'mega' forms. This can be done by going to kaggle and searching for a dataset. The database that I'm using came from here.
Extracting the csv file then pasting that to wherever your python machine reads from is your first step. Since I am using a dataset with comma-seperated values so I read the file using pandas' read_csv function after importing the necessary libraries, I use Matplot, Pandas, and Numpy specifically as my defaults for most python projects.
# Importing basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# Reading in my dataset and displaying the lead rows
data = pd.read_csv("Pokemon.csv")
data.head()
Now my data set includes a total column which holds the number of the combined stats of that particular pokemon. And I can edit and arrange the data as I wish such as sorting it by the 'total' stat I mentioned previously. Although as you can see from the Mega Venasaur above it considers pokemon and mega forms as the same (which is in line with the official designation), but I consider all mega and primal evolutions as generation 6 variations and as such edit the data below to be in line with my thinking. In order to do this I need to use regular expressions in order to use case sensitive searches.
# Importing the regular expression library
import re
# Sorting dataframe by total score
data = data.sort_values('Total', ascending=False)
# Function to apply to all rows checking the name for mega and primal and setting the generation to 6.
def my_func(row):
result1 = re.search('Mega', row['Name'])
result2 = re.search('Primal', row['Name'])
if result1 != None or result2 != None:
row['Generation'] = 6
return row
data = data.apply(lambda x: my_func(x), axis = 1)
data
Using various pandas commands you can select pokemon based on a multitude of criteria, the one below is being selected by their pokedex number.
data[data['#'] == 48]
Alternatively you can call them explicitly by name. This haunter is being selected by their name specifically.
data[data['Name'] == 'Haunter']
The following set of data is being selected by their primary type, I shall be using this as my main form of comparison later on.
data[data['Type 1'] == 'Dragon']
To plot the data above you can use something similar to the below. In the following code I partition the set into the 6 generations, then I take the mean of the 6 stats across that particular generation. I print it out as a way to see what my results are.
# Creating a subframe of the full file from columns 6-13. (5-12 in the code since it is zero indexed)
stats = data.iloc[:,np.r_[5:12]]
groups = 6
labels = ('HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed')
# Creating and converting datasets of individual generations to list then removing the 'Generation' value.
gen1_means = (stats[stats['Generation'] == 1].mean()).tolist()
gen1_means = gen1_means[0:6]
gen2_means = (stats[stats['Generation'] == 2].mean()).tolist()
gen2_means = gen2_means[0:6]
gen3_means = (stats[stats['Generation'] == 3].mean()).tolist()
gen3_means = gen3_means[0:6]
gen4_means = (stats[stats['Generation'] == 4].mean()).tolist()
gen4_means = gen4_means[0:6]
gen5_means = (stats[stats['Generation'] == 5].mean()).tolist()
gen5_means = gen5_means[0:6]
gen6_means = (stats[stats['Generation'] == 6].mean()).tolist()
gen6_means = gen6_means[0:6]
print(gen1_means)
print(gen2_means)
print(gen3_means)
print(gen4_means)
print(gen5_means)
print(gen6_means)
Since the data above seems to be just a list of random numbers I shall edit the format. I want to compare the power change over generations. To best see the side by side comparison I'm plotting it using a bar graph.
# Values to plot bar graph side by side
index = np.arange(groups)
bar_width = 0.1
opacity = 0.8
gen_1 = plt.bar(index, gen1_means, bar_width,
alpha=opacity,
color='R',
label='Generation 1')
gen_2 = plt.bar(index+bar_width, gen2_means, bar_width,
alpha=opacity,
color='orange',
label='Generation 2')
gen_3 = plt.bar(index+(bar_width*2), gen3_means, bar_width,
alpha=opacity,
color='Y',
label='Generation 3')
gen_4 = plt.bar(index+(bar_width*3), gen4_means, bar_width,
alpha=opacity,
color='G',
label='Generation 4')
gen_5 = plt.bar(index+(bar_width*4), gen5_means, bar_width,
alpha=opacity,
color='B',
label='Generation 5')
gen_6 = plt.bar(index+(bar_width*5), gen6_means, bar_width,
alpha=opacity,
color='purple',
label='Generation 6')
plt.xlabel('Stats over the Generations')
plt.ylabel('Average')
plt.title('Average Stats Per Generations')
plt.xticks(index +(bar_width*3), labels)
plt.legend(loc=2,prop={'size':5})
plt.tight_layout()
plt.show()
The following code is due to the fact that not all types are in every generation of pokemon. To be able to put down each generation accurately, I use the aformentioned code to determine which types in my data sets are members of which generation.
types = set(data['Type 1'])
for i in types:
print(i + " type are generations:")
print(set(data[data['Type 1'] == i]['Generation']))
Using the output above I pull out the types that are not in every generation (Flying, Fairy, Dragon, Steel, & Dark) and manually map their graphs for the other types you can manually input them similarly to the code for the total average graphs. For both the code above and below I use a for-loop to iterate through the pokemon types more efficiently than manually going through all 18 types.
for poketype in set(data['Type 1']):
curr_poketype = data[data['Type 1'] == poketype]
poketype_avg = curr_poketype.groupby("Generation").mean()
poketype_avg = poketype_avg.reset_index()
poketype_avg = poketype_avg.drop(poketype_avg.columns[[0, 1, 2, 9]], axis=1)
fig, ax = plt.subplots()
print(poketype)
if poketype == 'Flying':
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='B', label='Generation 5')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='purple', label='Generation 6')
elif poketype == 'Fairy':
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='R', label='Generation 1')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='orange', label='Generation 6')
plt.bar(index+(bar_width*2), poketype_avg.loc[2], bar_width, alpha = opacity, color='G', label='Generation 4')
plt.bar(index+(bar_width*3), poketype_avg.loc[3], bar_width, alpha = opacity, color='purple', label='Generation 6')
elif poketype == 'Dragon':
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='R', label='Generation 1')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='Y', label='Generation 3')
plt.bar(index+(bar_width*2), poketype_avg.loc[2], bar_width, alpha = opacity, color='G', label='Generation 4')
plt.bar(index+(bar_width*3), poketype_avg.loc[3], bar_width, alpha = opacity, color='B', label='Generation 5')
plt.bar(index+(bar_width*4), poketype_avg.loc[4], bar_width, alpha = opacity, color='purple', label='Generation 6')
elif poketype == 'Steel':
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='orange', label='Generation 2')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='Y', label='Generation 3')
plt.bar(index+(bar_width*2), poketype_avg.loc[2], bar_width, alpha = opacity, color='G', label='Generation 4')
plt.bar(index+(bar_width*3), poketype_avg.loc[3], bar_width, alpha = opacity, color='B', label='Generation 5')
plt.bar(index+(bar_width*4), poketype_avg.loc[4], bar_width, alpha = opacity, color='purple', label='Generation 6')
elif poketype == 'Dark':
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='orange', label='Generation 2')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='Y', label='Generation 3')
plt.bar(index+(bar_width*2), poketype_avg.loc[2], bar_width, alpha = opacity, color='G', label='Generation 4')
plt.bar(index+(bar_width*3), poketype_avg.loc[3], bar_width, alpha = opacity, color='B', label='Generation 5')
plt.bar(index+(bar_width*4), poketype_avg.loc[4], bar_width, alpha = opacity, color='purple', label='Generation 6')
else:
plt.bar(index, poketype_avg.loc[0], bar_width, alpha = opacity, color='R', label='Generation 1')
plt.bar(index+bar_width, poketype_avg.loc[1], bar_width, alpha = opacity, color='orange', label='Generation 2')
plt.bar(index+(bar_width*2), poketype_avg.loc[2], bar_width, alpha = opacity, color='Y', label='Generation 3')
plt.bar(index+(bar_width*3), poketype_avg.loc[3], bar_width, alpha = opacity, color='G', label='Generation 4')
plt.bar(index+(bar_width*4), poketype_avg.loc[4], bar_width, alpha = opacity, color='B', label='Generation 5')
plt.bar(index+(bar_width*5), poketype_avg.loc[5], bar_width, alpha = opacity, color='purple', label='Generation 6')
plt.xlabel('Stats over the Generations')
plt.ylabel('Average')
plt.title('Average Stats Per Generations')
plt.xticks(index +(bar_width*3), labels)
plt.legend(loc=2,prop={'size':5})
plt.tight_layout()
plt.show()
Analysis and Machine Learning¶
The following code will be a proper analysis of the first 6 generations of the pokemon franchise. My null hypothesis will be that there is no significant change between generations, specifically that for an arbritrary generation (k), generation (k-1) will be equally as powerful. Since I reject this notion when I apply linear regression lines to the data I am going to output I should be a positive slope. If I recieve a type I error the linear regression line will be a relatively straight line.
The code below iterates through the types of the pokemon types and plots them according to the individual pokemon's total power for each generation. On top of the plot I have mapped the regression line for each type.
for i in set(data['Type 1']):
curr_type = data[data['Type 1'] == i]
curr_type.plot.scatter(x = 'Generation', y = 'Total')
x = curr_type['Generation']
y = curr_type['Total']
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-')
plt.suptitle(i)
plt.show()
By observing the regression we can see for 17 out of the 18 types our regression line increases over the generations. For the single type that goes down (flying), this can be attributed to the small sample size of pokemon which have flying as their first type, and despite this there is still a pokemon that has a comparable total score to the previous generation. Thus according to these plots we can safely say that basically every type of pokemon grew stronger as the generations passed.
Since we can say every type has gotten more powerful now I'm going through every stat and checking how they have changed over the generations. First mapping a box plot of the total power then iterating through the six stats and making a boxplot and regression line for all of them.
data.boxplot('Total', by = 'Generation')
x = data['Generation']
y = data['Total']
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-')
plt.show()
for i in labels:
data.boxplot(i, by = 'Generation')
x = data['Generation']
y = data[i]
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-')
plt.show()
Now we can see through the regression line that every single stat over time becomes larger becomes larger as a function of time/generation increase. While some are larger than others I feel we can accurately state that the pokemon do become more powerful as the generations go on, thus rejecting the null hypothesis.
Conclusion & Insight¶
Given our null hypothesis, our actual hypothesis, and the result achieved it seems to say that in general the new pokemon are usually stronger than their predecessors. In addition to this we can make some other observations about the data set and more specifically which pokemon we would want to pick in the following generations. These observations lead to two opposing opinions which are both equally plausible.
So the first idea we can draw is that in the new pokemon, picking the newer pokemon will most likely give you an edge over the older pokemon. This seems to be a supported trend no matter how we observed the data, more recent generations had better stats overall.
Based on the type graphs seen above we can conclude 1 of 2 things. The first being if the trends will continue then the graphs with the steepest slopes will continue to be the most powerful in the game. Looking at them you can telll ground and dragon type have the advantage here.
Conversely, if they seek to balance the game there will be less of the aformentioned types in the coming generations or they will be not as strong. Instead we will see a rise in pokemon with low population and whose power graphs being flatter would mark them as generally weaker than their counterparts, as such we may see an increase in flying types as well as an increase in the overall stats of them as well as an increase to ghost types.
Closing & Sources¶
This tutorial covered some basic ways to interact, edit, and observe data from the pokemon franchise. There are many more elaborate and extensive ways to display and process the data, links to some of them as well as all resources required to re-create this will be posted below. I hope you have enjoyed my python tutorial on Pokemon, thank you.