FIFA 2021 Player Analysis

By: Nathan Chung

Outline

Introduction

This Data Science Tutorial will guide you through the Data Science Pipeline. In this project we are analyzing soccer player's data from Fifa 2021. More specifically we will be finding relationships and trends within the datasets between the top 5 plyaers in the world to later on create and test a model.

Background Information

Fifa is the International Federation of Association Football. This organization organizes and manages world football, holding the World Cup Final that happens every four years. Over the past two decades, EA Sports has developed FIFA soccer games that are released annually. In this series of soccer games, each one has curated real world teams and players, where each individual soccer player is given a ranking out of 100, depending on their player statistics that vary from shooting, passing, dribbling, etc.

Set Up

In this project, we are using several useful libraries that exist in python. Down below are all of the packages that we will be using in this tutorial.

More information on python libraries:

Data Collection/Data Management

What is the data?

The datasets that are used in this project are made up of soccer player's data from the Career Mode in each Fifa game from 2021.

The data is taken from kaggle which is an online public data base. The link to the repository for the data is down below. All of the files are labeled players_XX.csv where XX denotes which Fifa year the data was taken from.

Getting Data

In order to get the data, you will go to the link https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset and download the file 'players_21.csv'.

Notice that each file ends with a .csv. These are 'Comma Separated Values' files which are extremely common in data science. These files are used for exchanging data between different applications. In order to make these files usable for this project, we need to read them into a dataframe by utilizing Python Pandas' dataframe with read_csv.

Once we have the datasets loaded into dataframes, we use .head() to display the data in table format to get a first glimpse of what we are looking at.

Data Cleaning

As we can see, there are 106 columns in each dataset. A lot of the these columns are not very practical and will not be used in this project such as rs, lw, lf, cf, etc (columns that give a player a value fore each and every position).

In this tutorial, our focus is to try to develop a model that can deem itself to be accurate enough to find trends and relationships in the data. Having all of the extra data is not needed so in this step, we will clean the data.

We use the iloc() function from pandas that is a method to retrieve rows from a Dataframe. We are also using slicing when writing iloc[:, 0:39] to only grab columns 1 through 40 in each dataset and dropping the rest. Slicing is an efficient method of organizing data as it takes away the extra work to write down each column name which in this case is more than 50 which would be too much work and not efficient at all.

As you can see, we now have what seems to be a more friendly dataset that can be analyzed without the extra jargon information.

More Data Cleaning

Our main goal of this project is to determine what makes a top tier soccer player. Which components/attributes come into play that best reflect a highly skilled soccer player? In order to do this, the best way is to condense the data set. For the roster of players from different FIFA 2021, there are at least 18,000 players in total. Since we are trying to hone into the individuals that have higher rankers, we can condense the data set and try to track the top 50 players from each year.

We can also knock our two birds with one stone. In the FIFA 2021 roster, each and every player for each and every position is recorded, including Goal Keepers. Goal Keepers have entirely different statistics that are recorded from their performance compared to field players like strikers and midfielders. As a matter of fact, for the statistics for field players such as shooting, passing, and dribbling, almost each and every goalkeeper has a missing NaN value which if we keep in the data will cause issues later on when applying regression.

In this project, we are mainly focusing on field players like midfielders or strikers, so we can eliminate all goal keepers that are listed in the data set.

Exploratory Data Analysis

Exploratory Data Analysis the processs where you do your initial analysis and breakdown of the data to look for any trends, patterns, anomalies, as well as summarizing the different characteristics of the data. EDA is done before any assumptions are made about the data set.

This process helps a data scientist find out how to best manipulate the data to get the answers of their desired interest.

Although this is only a brief overview on what Exploratory Data Analysis consists of, a more detailed source of information can be found below

These were the top 50 ranked players from FIFA 2021. Now from here, we can use this information to utilize each player's sofifa_id and see if they were in the top 50 ranked players from the previous FIFA rosters and if they are, keep them. But since we want consistency amongst each dataset, if we run into a player that was in the top 50 for one year and not in the top 50 for another year, we will simply remove that player from the new dataset we are about to make.

Understanding the Data

As we can see, there are several different columns in our newly cleaned and processed dataset. Whats important is that we know what each of these components are and that we understand what their purpose is. This is crucial before starting any EDA because you should have a good understanding of what your data consists of.

Our main focus is to understand the data values we will most focus on:

Attributes we will focus on are:

Now since we have the data, we can first start off by showing the order of players from best to worst using different attributes of the data set. This can help identify players that consistently have presence in towards the top of the rankings for each attribute.

As we can see from this display above, the top highest rated players are L. Messi , Cristiano Ronaldo , R. Lewandowski, Neymar Jr, and K. De Bruyne. From this we can can show the matrix where darker shades of blue cells are highly correlated with each other, and lighter blue cells are less correlated.

Looking at all of the rankings can only do so much. Our main goal is to find any relationships within the data. Constantly comparing each and every variable to each other can deem itself a challenging and time consuming task.

A more efficient approach is to use a correlation matrix. A correlation matrix shows any correlation between two variables, and is commonly used to summarize data to initiate the next step for analysis. Luckily in python, there is a library called seaborn that makes it easy to generate a correlation matrix.

Seaborn is a data visualization library in python that provides borth attractive and informative statistical graphs. More information can be found in the link below

In order to interpret a correlation map, you look at the scoring for each component and see which others have similar equated scores. In this case, on our correlation matrix we can see that shooting scores for players are related to the dribbling scores, as well as there being a relationship between mentality_vision scores and mentality_positioning scores.

It make's sense now looking at the relationships that a player's mentality_vision will heavily influence their passing and shooting skill's as well as positioning.

From this, we can use this information to further explore those relationships through visualizations and graphs.

Creating Useful Visualizations

Data visualization creates a visual summary which makes it much more easier to find relationships and trends in data rather than trying to process several columns and rows of data.

Gaining more insight is acheived with visual depictions of data becuase the data we are working with becomes more valuable when it is processed into a visualization.

One of the main goals of this entire project is to find out what make's the worlds best soccer player the best. In this case, Leonel Messi is the highest level ranked player in FIFA 2021, but what makes him so special?

To start, creating visuals to see how Messi differs from other players in different player stats can help. Since we are dealing with several different players of different caliber, a scatter plot would be the most ideal and adding a trend line will show whether a player is exceeding the average performance in that specific skill set or if they are below average.

From our correlation matrix, we found that player's ranking in passing, shooting, metality_vision, and mentality_positioning had correlated relationships so we will be focusing on those 4 areas when visualizing.

We will visualize the relationship between a player's shooting ranking and their mentality_vision ranking. A crucial skill to have on the soccer field is having the visual prowess to read where everyone is on the field, and with that knowledge being able to attack the opposing team's defense in the best logical way.

As we can see, Messi has the highest performance in mentality vision.

We will visualize the relationship between a player's passing ranking and their mentality_vision ranking.

Here, we are starting to see bigger differences amongst the top 5 players that were labeled. The ongoing debate on whether Ronaldo or Messi is better has brought up constant debate. In this visualization we can see how Messi outshines Ronaldo in passing which makes us believe that the difference in this skill plays a big role in the overall ranking.

In this visualiztion, we are observing the relationship between a player's skill in passing and shooting.

Here we are taking a look at the relationship between a player's skill in dribbling and mentality_vison.

Here, we see that Lewandowski, Ronaldo, and Neymar are all below the trend line for the average dribbling skill. This shows these player's weaknesses.

Another form of visualization

Scatter plots and trend lines are great and all, but sometimes, more than one type of visualization can be effective when trying to find the answer to youre question. In the next set of graphs, we are creating joint line graphs where we will take two players, and graph their set of skill rankings in a line plot. The two players will have different colored lines so that we can compare them in hand.

Here, we will compare the statistics of Messi and Ronaldo, two of the most well knowned and coveted soccer players in the world. Finding any major differences in their skillset will help us out in finding what minute differences these top tier players have in their game that make one better than the other.

By looking at each player's line (Messi-Blue, Ronaldo-Green), we see that both players are pretty similar with most skillsets. But one thing to note, is that towards the middle of each line plot where the data shows the players attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control, Leonel Messi has notably higher rankings in those skillsets compared to Ronaldo.

This observation may be what we are looking for. It may be that because Messi is more skilled in attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control than Ronaldo, that is what sets him apart.

Here, we will compare Messi with Neymar

We see a similar trend when we compared Ronaldo with Messi. In this visualization, (Messi-Blue, Neymare-Green), we see that both players are again, pretty similar with most skillsets.

But with the skillsets attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control, Leonel Messi has generally higher rankings in those skillsets compared to Neymar.

Now to take a step back, here, we find a similar/identical observation where Messi has generally higher rankings in those skillsets than Neymar. This only reinforces our initial findings.

Finally, for one more comparison, we will compare Messi and Lewandowski.

Once again, we see the same trend from when we compared Messi with Ronaldo and Neymor. In this visualization, (Messi-Blue, Lewandowski-Red), we see that both players are again, pretty similar with most skillsets.

But with the skillsets attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control, Leonel Messi has generally higher rankings in those skillsets compared to Lewandowski.

Now what have these findings told us? To recollect our thoughts and purpose, the reason why we decided to compare Messi (the top ranked player) with other elite players such as Ronaldo and Neymar was because comparing a highly skilled player with another player with a similar skillset is much more meaning full, whereas comparing Messi who is ranked 90+ with a player who is ranked 60 in FIFA will not help note any differences that help our case, because there will always be huge differences in the statistics. So that was our reason why we pulled the top 5 ranked players and compared them with each other.

Now we wanted to find what parts of Messi's game allows him to out perform other players of the same caliber as him. From our findings, we saw the same thing in each comparison. Messi is higher skilled in attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control, which leads us to believe that these skill sets are what matter when looking at a grouping of players that are ranked higher in the game.

Machine Learning/Linear Regression

In data science, Machine Learning is a method of data analysis by using algorithms to find patterns in data. ML applies artificial intelligence where a system/model has the ability to learn and improve calculations/predictions based off of observation and experience, without having the need to do a lot of programing.

When given the task to predict some trend for a given data, it's very hard to come up with a fucntion or write code that can be accurate all the time because different variables in data are always changing. This is where machine learning comes into play. ML practices the use of algorithms to take in data into a system, learn from it, and then forecase and predict trends.

More information can be found in the link below:

Linear Regression

Linear Regression is a type of predictive analysis with two goals

It is one of the most commonly used forms of regression/machine learning.

Linear regression is a part of machine learning where an algorithm is used to calculate a predicted outcome when inputed parameters/variables what we find have a large influence in the outcomes we are trying to predict.

More information can be found below:

Applying Linear Regression to our sets that we have made.

To perform linear regression in python, we first have to prepare what is called train and test sets.

How to Interpret the results from running linear regression?

Reading the results from linear regression

As we can see, the independent variables age, pace, dribbling, physic,attacking_heading_accuracy, skill_dribbling, skill_ball_control, and mentality_positioning all have P-values less than 0.05 which means that those variables are significant and they can not be removed.

Our Adjusted R value is 0.999 which is very good, as it means that our model does well with explaining 99.9% of the variables within the dataset.

This also confirms our findings in our EDA stage where we saw that the with top tier players, those who had higher skills in attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control such as Messi were ones that were able to stand out and out perform other players of the same calliber. The finding from the linear regression model where similar skillsets have p-values less then 0.05 shows that these are components of a player's game that plays a large role in their ranking.

Conclusion

In this tutorial, we went through the Data Science Pipeline where we learn to about collecting data, reading from csv files and transfering data from a website/online database into your own personal workspace, cleaning the data (unstructured to structured data), and starting the process of Exploratory Data Analysis.

Our main focus was to analyze soccer player's statistics from FIFA 2021, and find out which parts of a players game allows them to truly outshine others in their respective skill set with exception to the obvious differences that were surface level such as (skill rating 60 and skill rating 90, using overall rank), or very broad statistics like height or weight. Our goal was to dig deep into analysis and find out what specific components and variables came into play. In our EDA, we found trends and relationships by first utilizing a correlation matrix, and then with those findings from the correlation matrix, beginning the process of visualizing the data into graphs and charts. Our main methods of visualizations were through scatter plots with regression/trend lines as will is line plots to compare two players. We honed into and focused on the higher rated players in the data because it only made sense to compare a player of class A(similar skill) to another player of class A(similar skill).

Our most effective visualizations was the correlation matrix and the line plots where we observed a similar trend where Messi outperformed players of the came calliber in attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accurace, skill_long_passing, and skill_ball control, and in our Machine Learning phase, we further validated out findings by looking at the P-values of those skillsets to find out that they were less then 0.05 which meant they were significantly valueable.