Skip to main content

See how easily you can create plots from the data available in this package

library(amazonasdatahub)

Datasets tutorials

For each dataset, you can check the documentation using ? before the dataset name.

?agriculture_idam
?aids_amazonas
?humidity_manaus
?malaria_amazonas
?pib_trimestral
?rionegro_amazonas
?school_read_levels

Scatter plot of planted area and harvested area from cassava production

The dataset agriculture_idam provides crop production data from Amazonas. We can make a scatter plot of planted area and harvested area of filtered productions, In this example, we will use cassava production.

To plot a simple scatter plot in R, without using external packages, we will use the plot function.

# Filtering data
mandioca_prod <- agriculture_idam[agriculture_idam$cultivation == "Mandioca", ]

# Scatter Plot
plot(
mandioca_prod$planted,
mandioca_prod$harvested,
xlab = "Planted area (hectare)",
ylab = "Harvested area (hectare)",
main = "Cassava production in Amazonas",
sub = "Planted area x Harvested Area"
)

Time Series of AIDS case counts in Manaus

The dataset aids_amazonas contains data of the AIDS occurrences in each municipality from Amazonas.

One of the analysis that can be made is: visualize the time series of counts filtered by municipality, where each case is grouped by the sex/gender of each observation. To do this, we will use the dplyr package to structure the data and the ggplot2 package to create and customize the chart.

# Loading dplyr and ggplot to structure the data
require(dplyr)
require(ggplot2)

# Filtering by municipality and ploting case count by gender
aids_amazonas %>%
filter(name_muni == "Manaus") %>%
group_by(gender) %>%
ggplot(aes(x = year, y = cases, group = gender, color = gender)) +
geom_line() +
scale_color_manual(values = c("blue", "red")) +
theme_minimal() +
labs(
title = "AIDS ocurrences in Manaus (2011-2023)",
x = "Year",
y = "Case count",
color = "Gender"
)

Time Series of relative humidity from Manaus (2010 - 2020)

The humidity_manaus consists of the minimum relative humidity observed in the city of Manaus from January 2009 to December 2020. We can visualize the time series of the relative humidity during this time interval.

Using dplyr, we can create a date column, which will be composed of the month and year, and ggplot2, we can create the time series chart.

# Loading dplyr and ggplot to structure the data
require(dplyr)
require(ggplot2)

# Creating date column and plotting the time series
humidity_manaus %>%
mutate(date = as.Date(paste0(year, "-", month, "-","01"))) %>%
ggplot(aes(x = date, y = rh)) +
geom_line() +
theme_minimal() +
labs(
title = "Relative Humidity of Amazonas (2010 - 2020)",
x = "Date",
y = "Relative Humidity"
)

Time series of Quarterly GDP of Amazonas

With the data from pib_trimestral, we can perform analyses regarding the patterns observed in the distribution of data over the observed interval (2010 to 2021).

Using dply and `ggplot2, we can select the variables of interest and creating a line chart. This example will demonstrate this application, as well as more advanced customizations, including colors, title font formatting and line types.

# Loading dplyr and ggplot2
require(dplyr)
require(ggplot2)

# Selecting only year and taxes and ploting
pib_trimestral %>%
select(year, taxes) %>%
ggplot(., aes(x = year, y = taxes)) +
geom_line(linewidth = 1L, colour = "#cb181d") +
geom_hline(
yintercept = mean(pib_trimestral$taxes),
linetype = "dashed",
size = 1
) +
theme_light() +
theme(
plot.title = element_text(face = "bold", size = 16)
) +
labs(
x = "Year",
y = "Taxes",
title = "Quarterly GDP of Amazonas",
subtitle = "2010 - 2021"
)

Boxplots of water level (in meters) of Rio Negro (Amazonas)

With the data provided by rionegro_amazonas, one of the analysis that can be done is visualizing a chart of boxplots of water level over the years.

We will be using dplyr and ggplot2.

# Loading ggplot
require(ggplot2)

rionegro_amazonas %>%
ggplot(aes(x = year, y = level_m, group = year)) +
stat_boxplot(geom = "errorbar") +
geom_boxplot() +
theme_minimal() +
labs(
x = "Year",
y = "Water level (m)"
)

Code
require(dplyr)
require(ggplot2)

# Filtering dates for the second half of 2010
rionegro_amazonas_2010_02 <- rionegro_amazonas %>%
filter(date >= "2010-06-01" & date <= "2010-12-31")

# Graphical Visualization
rionegro_amazonas_2010_02 %>%
ggplot(., aes(x = date, y = level_m)) +
geom_line(size = 1L, colour = "#006994") +
geom_hline(
aes(
yintercept = mean(rionegro_amazonas_2010_02$level_m),
color = "Mean"
),
linetype = "dashed",
size = 1
) +
geom_hline(
aes(
yintercept = min(rionegro_amazonas_2010_02$level_m),
color = "Min"
),
linetype = "dotted",
size = 1
) +
geom_hline(
aes(
yintercept = max(rionegro_amazonas_2010_02$level_m),
color = "Max"
),
linetype = "dotted",
size = 1
) +
scale_color_manual(
name = "Statistics",
values = c(
"Mean" = "orange",
"Min" = "red",
"Max" = "green"
)) +
scale_x_date(
date_breaks = "1 month"
) +
theme_light() +
theme(
plot.title = element_text(face = "bold", size = 16)
) +
labs(
x = "Year",
y = "Water level (m)",
title = "Rio Negro: water level (second half of 2010)"
)

Missing data and outliers

Part of the Statistician’s job is to identify and find certain errors and inconsistencies in the data. As we can see, the graph above shows that the level of the Rio Negro in meters was at 0. This is strange and uncommon, as it would indicate that the river completely dried up.

We can conclude that these “0” values correspond to missing data (NAs), which were filled with a zero value. We will replace these zero values with NAs.

require(tidyr)
rionegro_amazonas_2010_02 <- rionegro_amazonas_2010_02 %>%
mutate(
level_m = case_when(
date == "2010-08-07" ~ NA_real_,
date == "2010-12-22" ~ NA_real_,
TRUE ~ as.numeric(level_m)
),
increase_decrease_cm = case_when(
date == "2010-08-07" ~ NA_real_,
date == "2010-12-22" ~ NA_real_,
TRUE ~ as.numeric(increase_decrease_cm)
)
)

Handling Missing Values

Now that we have defined the missing values, we can choose a method to handle them. In this example, we will use Forward-Fill, but we encourage you to research and try other methods to learn different ways of handling missing values.

require(tidyr)

rionegro_amazonas_2010_02 <- rionegro_amazonas_2010_02 %>%
fill(level_m, increase_decrease_cm)

With the processed data, we can recreate the plot and visualize the level of the Rio Negro.

Code
require(dplyr)
require(ggplot2)

# Graphical Visualization
rionegro_amazonas_2010_02 %>%
ggplot(., aes(x = date, y = level_m)) +
geom_line(size = 1L, colour = "#006994") +
geom_hline(
aes(
yintercept = mean(rionegro_amazonas_2010_02$level_m),
color = "Mean"
),
linetype = "dashed",
size = 1
) +
geom_hline(
aes(
yintercept = min(rionegro_amazonas_2010_02$level_m),
color = "Min"
),
linetype = "dotted",
size = 1
) +
geom_hline(
aes(
yintercept = max(rionegro_amazonas_2010_02$level_m),
color = "Max"
),
linetype = "dotted",
size = 1
) +
scale_color_manual(
name = "Statistics",
values = c(
"Mean" = "orange",
"Min" = "red",
"Max" = "green"
)) +
scale_x_date(
date_breaks = "1 month"
) +
theme_light() +
theme(
plot.title = element_text(face = "bold", size = 16)
) +
labs(
x = "Year",
y = "Water level (m)",
title = "Level of the Rio Negro (processed) in the second half of 2010."
)

Therefore, it is noteworthy that the treatment of these outlie values, which were considered missing, made all difference in the conclusion about the data on the level of the Rio Negro.