I do not always come up with new ideas for my blog, but rather get inspired by the great work of others.
In this case, it was a reddit post by u/Cheapo_Sam, who charted world footballs goal scorers in a marvelous way.
According to the post, the was gathered manually which I thought is too tedious (Ain’t nobody got time for that!).
So I decided to automatize this step.

This post exclusively deals with acquisition and cleaning of the data. The next (one or two)
posts will handle the analysis.

library(tidyverse)  # for data wrangling
library(rvest)      # for web scraping
library(lubridate)  # for date formats 

As a data source, I found transfermarkt.co.uk to be a good place.
It has very detailed records of goals scored by players, dating back to the 1960’s.
As an example, we will get the data of Zlatan Ibrahimovic.

url 

The URL contains a very detailed table of all goals scored by Zlatan.
read_html() reads the html content of the page and html_table() puts all
html tables (

...) into a list of data frames. The fill = TRUE is necessary, since
there are some empty cells in columns. Note that there are several tables on the
page. The second one is what we want.
tab 
## Observations: 444
## Variables: 18
## $ Competition      "Season 01/02", "", "", "", "", "", "", "", ""...
## $ Competition      "Season 01/02", "Eredivisie", "Eredivisie", "U...
## $ Day              "Season 01/02", "3", "4", "First Round", "Firs...
## $ Date             "Season 01/02", "8/26/01", "9/8/01", "9/20/01"...
## $ Venue            "Season 01/02", "A", "A", "H", "A", "H", "", "...
## $ For              "Season 01/02", "", "", "", "", "", "", "", ""...
## $ For              "Season 01/02", "(15.)", "(10.)", "", "", "(1....
## $ vs.              "Season 01/02", "", "", "", "", "", "", "", ""...
## $ vs.              "Season 01/02", "Feyenoord  (1.)", "FC Twente ...
## $ Result           "Season 01/02", "1:2", "0:2", "2:0", "0:3", "4...
## $ Pos.             "Season 01/02", "", "", "CF", "CF", "", "", ""...
## $ Minute           "Season 01/02", "64'", "19'", "4'", "64'", "68...
## $ `At this point`  "Season 01/02", "0:1", "0:1", "1:0", "0:2", "3...
## $ `Type of goal`   "Season 01/02", "Tap-in", "", "", "Header", ""...
## $ Provider         "Season 01/02", "Hatem Trabelsi", "", "Nikolao...
## $ `NA`             "Season 01/02", NA, NA, NA, NA, NA, NA, NA, NA...
## $ `NA`             "Season 01/02", NA, NA, NA, NA, NA, NA, NA, NA...
## $ `NA`             "Season 01/02", NA, NA, NA, NA, NA, NA, NA, NA...

The table is rather Messi messy, so we need to do some serious cleaning.
We start by extracting the columns we want and give them better names.

tab 
## Observations: 444
## Variables: 9
## $ competition  "Season 01/02", "Eredivisie", "Eredivisie", "UEFA ...
## $ day          "Season 01/02", "3", "4", "First Round", "First Ro...
## $ date         "Season 01/02", "8/26/01", "9/8/01", "9/20/01", "9...
## $ venue        "Season 01/02", "A", "A", "H", "A", "H", "", "A", ...
## $ against      "Season 01/02", "Feyenoord  (1.)", "FC Twente  (2....
## $ minute       "Season 01/02", "64'", "19'", "4'", "64'", "68'", ...
## $ standing     "Season 01/02", "0:1", "0:1", "1:0", "0:2", "3:0",...
## $ type         "Season 01/02", "Tap-in", "", "", "Header", "", "T...
## $ provider     "Season 01/02", "Hatem Trabelsi", "", "Nikolaos Ma...

Next we deal with the actual data. Notice that the first row only contains the
entry “Season 01/02”. This row occurs at the beginning of each season, so we need
to delete these rows. The against columns contains some additional information
(the current position), which we do not need. We remove these using a “simple” regex
(if such a thing really exists…). Finally, we turn the minute column into a
numeric value.

tab 
## Observations: 427
## Variables: 9
## $ competition  "Eredivisie", "Eredivisie", "UEFA Cup", "UEFA Cup"...
## $ day          "3", "4", "First Round", "First Round", "10", "", ...
## $ date         "8/26/01", "9/8/01", "9/20/01", "9/27/01", "12/4/0...
## $ venue        "A", "A", "H", "A", "H", "", "A", "A", "A", "H", "...
## $ against      "Feyenoord", "FC Twente", "Apol. Limassol", "Apol....
## $ minute       64, 19, 4, 64, 68, 79, 68, 78, 95, 47, 11, 34, 58,...
## $ standing     "0:1", "0:1", "1:0", "0:2", "3:0", "4:0", "1:1", "...
## $ type         "Tap-in", "", "", "Header", "", "Tap-in", "", "", ...
## $ provider     "Hatem Trabelsi", "", "Nikolaos Machlas", "Wambert...

I just recently stumbled upon the readr::parse_number() function and really like it.
Note though that we lose some information.

readr::parse_number("90'+4")
## [1] 90

So, goals scored in stoppage time will count towards the 90th minute.

Yet another problem is that we are actually missing an important piece of information
in the data frame. The club Zlatan was scoring for. Look at how the
information is given on transfermarkt.

The club is only given as an image. html_table() only reads text, so it misses that information and
returns an empty string (see the For column in the first code chunk). We need to get this information independently.
While we are at it, we also get the link to the club crests.

club %
    html_nodes(xpath = "//td/a/img") %>%
    html_attr("alt")
club %
  html_nodes(xpath = "//td/a/img") %>%
  html_attr("src")
club_crest 

So what is happening here? html_nodes() looks for all table cells (

) which have a link ()
and within the link an image (). The name of the club is stored as the alt attribute of the
image, which we grab with html_attr(). The problem is, that the vs. column above also contains the
same structure. Since the table is read rowwise, we can skip every other entry in the club vector.
The same principle can be used for the crests, yet here we need the src attribute which points to the
image file.

You may have noticed some empty entries in several columns. We can’t do anything about
the missing entries in the provider and type column, since this information is, in fact, missing.
If you look at the actual webpage, you notice that if the player scores two or more goals in a game
all information is omitted in that row. We can get that information simply by copying it from the row
above. We do the same with the club (and crest) vector, though here we use the append() function which basically
duplicates the ((i-1))th entry.

for(i in 1:nrow(tab)){
  if(tab$competition[i]==""){
    tab$competition[i]=tab$competition[i-1]
    tab$date[i]=tab$date[i-1]
    tab$day[i]=tab$day[i-1]
    tab$venue[i]=tab$venue[i-1]
    tab$against[i]=tab$against[i-1]
    club 

We can also delete the last row of the data frame because it only contains the totals.
Further, we can transform the date column to an actual date and append the club and crest vector
to the data.

tab 2020)+0
tab$date 

You may wonder about the last two rows. This is a dirty hack to get around the following issue.

as.Date("10/02/65",format = "%m/%d/%y")
## [1] "2065-10-02"

transfermarkt displays the year with two digits. So as soon as we get the data from a
player who was active before 1970, we run into this problem (relevant xkcd).

We finish the data cleaning process by adding and tweaking some variables.
First, we create a column that has the current goal count, which
is just the row number. Second, the crest we saved is too tiny to use. Luckily, by
altering the link a bit, we get a bigger version. Third, we turn the club variable into
a factor, which is ordered by the players career trajectory.

# add a goal count column
tab % mutate(goals=row_number())

# get bigger crests
tab$club_crest % 
  str_remove("_.*") %>%
  paste0(".png") %>% 
  str_replace("tiny","head")

#convert club to factor
tab$club 
glimpse(tab)
## Observations: 426
## Variables: 12
## $ competition  "Eredivisie", "Eredivisie", "UEFA Cup", "UEFA Cup"...
## $ day          "3", "4", "First Round", "First Round", "10", "10"...
## $ date         2001-08-26, 2001-09-08, 2001-09-20, 2001-09-27, 2...
## $ venue        "A", "A", "H", "A", "H", "H", "A", "A", "A", "H", ...
## $ against      "Feyenoord", "FC Twente", "Apol. Limassol", "Apol....
## $ minute       64, 19, 4, 64, 68, 79, 68, 78, 95, 47, 11, 34, 58,...
## $ standing     "0:1", "0:1", "1:0", "0:2", "3:0", "4:0", "1:1", "...
## $ type         "Tap-in", "", "", "Header", "", "Tap-in", "", "", ...
## $ provider     "Hatem Trabelsi", "", "Nikolaos Machlas", "Wambert...
## $ club         Ajax Amsterdam, Ajax Amsterdam, Ajax Amsterdam, Aj...
## $ club_crest   "https://tmssl.akamaized.net//images/wappen/head/6...
## $ goals        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...

That’s it! We now have a clean data frame of all goals scored by Zlatan Ibrahimovic with a lot
of detailed information.

You can wrap all the steps above into a funtion to easily get the data of other players.
The skeleton is shown below.

get_goals 

All you need to do is navigate to the page of the player you want and get the player name
and id of the link. Lionel Messi, for example, would be player = "lionel-messi" and id = "28003".

You may also consider to grab some additional data. For instance, the players birthday.
The birthday is good to have when we want to calculate the age of the
player at any given goal scored.

birthdate % 
  html_text(birthdate) %>% 
  str_squish() %>% 
  as.Date(format="%b %d, %Y")

This is where you need some basic knowledge about css again. If you look at the
html code of the webpage, you will find the birthdate being located in a

. The html_node() in conjunction with html_text() allows
us to extract the birthday as plain text by navigating along these css classes.
The str_squish() eliminates all the excessive whitespace.

This allows us now to add yet another column to the data frame which holds the exact age of
Zlatan for each goal.

tab$age  

In a similar manner, you can get the clear name of the player and a link to his profile
picture.

  player % html_text()
  portrait % html_attr("src")

You can also put these three variables into the function and return everything as a list.
The complete code for the function get_goals() is given at the bottom.

In the next post, we will use this data to explore the career goals of (not only!)
Zlatan and create some visualizations of the data.

get_goals %
    rvest::html_nodes(xpath = "//td/a/img") %>%
    rvest::html_attr("alt")
  club %
    rvest::html_nodes(xpath = "//td/a/img") %>%
    rvest::html_attr("src")
  club_crest 2020)+0
  tab$date % dplyr::mutate(goals=dplyr::row_number())

  tab$club_crest % 
    stringr::str_remove("_.*") %>%
    paste0(".png") %>% 
    stringr::str_replace("tiny","head")

  tab$club 

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...























Source link https://www.r-bloggers.com/analyzing-the-greatest-strikers-in-football-i-getting-data/

LEAVE A REPLY

Please enter your comment!
Please enter your name here