Machine Learning with Golang: Data Preprocessing and Visualization

This blog teaches you how to use Golang packages and tools for data preprocessing and visualization. You will learn how to use Gota for data manipulation and Gonum/plot for data visualization.

Table of Contents

1. Introduction

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions. Machine learning applications are widely used in various domains, such as natural language processing, computer vision, recommender systems, and more.

However, before applying any machine learning algorithm to a dataset, it is important to perform some data preprocessing and visualization steps. Data preprocessing is the process of transforming raw data into a suitable format for machine learning. Data visualization is the process of creating graphical representations of data to explore its characteristics and patterns.

In this blog, you will learn how to use Golang packages and tools for data preprocessing and visualization. Golang, or Go, is a fast, simple, and reliable programming language that is gaining popularity among developers. Golang has a rich set of libraries and frameworks for various tasks, including machine learning.

Specifically, you will learn how to use the following Golang packages and tools:

Gota: A data frame library for data manipulation and analysis.
Gonum: A set of numeric libraries for scientific computing and data analysis.
Gonum/plot: A plotting library for creating and customizing various types of charts.

By the end of this blog, you will be able to perform data preprocessing and visualization with Golang and prepare your data for machine learning.

Are you ready to get started? Let’s go!

2. Data Preprocessing with Gota

Data preprocessing is an essential step in machine learning, as it prepares the data for the algorithm. Data preprocessing involves tasks such as loading, exploring, cleaning, and transforming the data. In this section, you will learn how to use Gota, a data frame library for Golang, to perform data preprocessing with ease and efficiency.

Gota is a library that provides a convenient and fast way to work with tabular data in Golang. It supports common operations such as reading and writing data from various sources, filtering, sorting, grouping, aggregating, joining, and reshaping data. Gota also provides methods to access and modify the data elements, such as rows, columns, cells, and values.

To use Gota, you need to import the github.com/go-gota/gota/dataframe package in your Golang code. You can also import the github.com/go-gota/gota/series package to work with individual data series. Here is an example of how to import Gota in your code:

package main

import (
    "fmt"
    "github.com/go-gota/gota/dataframe"
    "github.com/go-gota/gota/series"
)

In the next two subsections, you will learn how to use Gota to load and explore data, and how to clean and transform data for machine learning. You will use a sample dataset of iris flowers, which contains 150 rows and 5 columns. The columns are sepal length, sepal width, petal length, petal width, and species. You can download the dataset from here.

2.1. Loading and Exploring Data

The first step in data preprocessing is to load the data from a source and explore its basic properties. In this subsection, you will learn how to use Gota to load the iris dataset from a CSV file and perform some exploratory analysis on it.

To load the data from a CSV file, you can use the ReadCSV function from the dataframe package. This function takes an io.Reader as an argument and returns a dataframe object. You can use the os.Open function from the standard library to open the CSV file and pass it to the ReadCSV function. Here is an example of how to load the iris dataset from a CSV file:

// Open the CSV file
f, err := os.Open("iris.csv")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

// Load the data into a dataframe
df := dataframe.ReadCSV(f)

Once you have loaded the data into a dataframe, you can explore its basic properties, such as the number of rows and columns, the column names, the data types, and the summary statistics. You can use the following methods from the dataframe package to do so:

Dim: Returns the dimensions of the dataframe as a pair of integers (rows, columns).
Names: Returns the names of the columns as a slice of strings.
Types: Returns the data types of the columns as a slice of series.Type constants.
Describe: Returns a new dataframe with summary statistics for each column, such as count, mean, standard deviation, min, max, and quartiles.

Here is an example of how to use these methods to explore the iris dataset:

// Print the dimensions of the dataframe
fmt.Println(df.Dim())

// Print the names of the columns
fmt.Println(df.Names())

// Print the data types of the columns
fmt.Println(df.Types())

// Print the summary statistics of the dataframe
fmt.Println(df.Describe())

The output of the above code is as follows:

(150, 5)
[sepal_length sepal_width petal_length petal_width species]
[Float Float Float Float String]
[        sepal_length   sepal_width   petal_length   petal_width       species]
[0              count   150.000000   150.000000   150.000000   150.000000            ]
[1               mean     5.843333     3.054000     3.758667     1.198667            ]
[2                std     0.828066     0.433594     1.764420     0.763161            ]
[3                min     4.300000     2.000000     1.000000     0.100000       setosa]
[4                25%     5.100000     2.800000     1.600000     0.300000            ]
[5                50%     5.800000     3.000000     4.350000     1.300000   versicolor]
[6                75%     6.400000     3.300000     5.100000     1.800000    virginica]
[7                max     7.900000     4.400000     6.900000     2.500000            ]

As you can see, the iris dataset has 150 rows and 5 columns. The column names are sepal length, sepal width, petal length, petal width, and species. The data types are float for the numeric columns and string for the categorical column. The summary statistics show the basic descriptive measures for each column, such as the mean, standard deviation, and quartiles.

By loading and exploring the data, you can get a sense of its structure and distribution. This can help you identify any potential issues or anomalies in the data, such as missing values, outliers, or incorrect data types. In the next subsection, you will learn how to use Gota to clean and transform the data for machine learning.

2.2. Cleaning and Transforming Data

After loading and exploring the data, the next step in data preprocessing is to clean and transform the data for machine learning. Cleaning the data involves removing or replacing any missing, incorrect, or irrelevant values. Transforming the data involves changing the format, scale, or distribution of the data to make it more suitable for the machine learning algorithm. In this subsection, you will learn how to use Gota to perform some common data cleaning and transforming tasks on the iris dataset.

One of the data cleaning tasks that you may need to perform is to handle missing values. Missing values are values that are not present in the data, either because they were not recorded or because they were corrupted. Missing values can cause problems for machine learning algorithms, as they may reduce the accuracy or lead to errors. Therefore, you need to either remove or replace the missing values before applying the machine learning algorithm.

To handle missing values with Gota, you can use the DropNA or the FillNA methods from the dataframe package. The DropNA method returns a new dataframe with any rows or columns that contain missing values removed. The FillNA method returns a new dataframe with any missing values replaced by a specified value or a function. Here is an example of how to use these methods to handle missing values in the iris dataset:

// Assume that the iris dataset has some missing values denoted by NaN
// Print the original dataframe
fmt.Println(df)

// Drop any rows that contain missing values
df = df.DropNA()

// Print the new dataframe
fmt.Println(df)

// Replace any missing values with the mean of the column
df = df.FillNA(dataframe.FillMean)

// Print the new dataframe
fmt.Println(df)

Another data cleaning task that you may need to perform is to handle outliers. Outliers are values that are significantly different from the rest of the data, either because they are extreme or because they are erroneous. Outliers can also cause problems for machine learning algorithms, as they may skew the distribution or affect the performance. Therefore, you need to either remove or adjust the outliers before applying the machine learning algorithm.

To handle outliers with Gota, you can use the Filter method from the dataframe package. The Filter method returns a new dataframe with only the rows that satisfy a given condition. You can use this method to filter out any rows that have values that are beyond a certain threshold or range. Here is an example of how to use this method to handle outliers in the iris dataset:

// Assume that the iris dataset has some outliers in the sepal length column
// Print the original dataframe
fmt.Println(df)

// Filter out any rows that have sepal length values greater than 8 or less than 4
df = df.Filter(
    dataframe.F{
        Colname: "sepal_length",
        Comparator: series.ComparisonNotIn,
        Comparando: []float64{4, 8},
    },
)

// Print the new dataframe
fmt.Println(df)

One of the data transforming tasks that you may need to perform is to normalize or standardize the data. Normalizing the data involves scaling the values of each column to a range between 0 and 1. Standardizing the data involves scaling the values of each column to have a mean of 0 and a standard deviation of 1. Normalizing or standardizing the data can help improve the performance of machine learning algorithms, especially those that are sensitive to the scale or distribution of the data, such as linear regression, logistic regression, or neural networks.

To normalize or standardize the data with Gota, you can use the Normalize or the Standardize methods from the series package. These methods return a new series with the values of the original series scaled to the desired range or distribution. You can apply these methods to each column of the dataframe that you want to transform. Here is an example of how to use these methods to normalize or standardize the iris dataset:

// Print the original dataframe
fmt.Println(df)

// Normalize the numeric columns
df = df.Mutate(
    series.New(df.Col("sepal_length").Float().Normalize(), series.Float, "sepal_length"),
    series.New(df.Col("sepal_width").Float().Normalize(), series.Float, "sepal_width"),
    series.New(df.Col("petal_length").Float().Normalize(), series.Float, "petal_length"),
    series.New(df.Col("petal_width").Float().Normalize(), series.Float, "petal_width"),
)

// Print the new dataframe
fmt.Println(df)

// Standardize the numeric columns
df = df.Mutate(
    series.New(df.Col("sepal_length").Float().Standardize(), series.Float, "sepal_length"),
    series.New(df.Col("sepal_width").Float().Standardize(), series.Float, "sepal_width"),
    series.New(df.Col("petal_length").Float().Standardize(), series.Float, "petal_length"),
    series.New(df.Col("petal_width").Float().Standardize(), series.Float, "petal_width"),
)

// Print the new dataframe
fmt.Println(df)

Another data transforming task that you may need to perform is to encode the categorical data. Encoding the categorical data involves converting the values of a categorical column into numeric values that can be used by the machine learning algorithm. Encoding the categorical data can help deal with the problem of different data types or formats, such as strings, booleans, or dates.

To encode the categorical data with Gota, you can use the Factorize method from the series package. This method returns a new series with the values of the original series mapped to integer codes. You can also specify the order of the codes or the labels for the codes. You can apply this method to each column of the dataframe that you want to encode. Here is an example of how to use this method to encode the iris dataset:

// Print the original dataframe
fmt.Println(df)

// Encode the species column
df = df.Mutate(
    series.New(df.Col("species").Factorize(nil, nil), series.Int, "species"),
)

// Print the new dataframe
fmt.Println(df)

By cleaning and transforming the data, you can make it more suitable and ready for the machine learning algorithm. You can also improve the quality and accuracy of the data, and avoid any potential errors or issues. In the next section, you will learn how to use Gonum/plot to create and customize various types of charts for data visualization.

3. Data Visualization with Gonum/plot

Data visualization is the process of creating graphical representations of data to communicate its insights and patterns. Data visualization can help you explore, analyze, and present your data in an effective and engaging way. In this section, you will learn how to use Gonum/plot, a plotting library for Golang, to create and customize various types of charts for data visualization.

Gonum/plot is a library that provides a simple and flexible API for creating and manipulating plots in Golang. It supports common plot types such as scatter plots, line plots, bar charts, histograms, box plots, and pie charts. It also allows you to customize the appearance and behavior of the plots, such as the title, labels, legend, axes, colors, markers, and fonts.

To use Gonum/plot, you need to import the gonum.org/v1/plot package in your Golang code. You also need to import the subpackages for the specific plot types that you want to use, such as gonum.org/v1/plot/plotter, gonum.org/v1/plot/plotutil, or gonum.org/v1/plot/vg. Here is an example of how to import Gonum/plot in your code:

package main

import (
    "fmt"
    "gonum.org/v1/plot"
    "gonum.org/v1/plot/plotter"
    "gonum.org/v1/plot/plotutil"
    "gonum.org/v1/plot/vg"
)

In the next two subsections, you will learn how to use Gonum/plot to create and customize plots, and how to plot different types of charts. You will use the same iris dataset that you used in the previous section, after cleaning and transforming it with Gota. You can assume that the dataframe is stored in a variable called df.

3.1. Creating and Customizing Plots

To create and customize plots with Gonum/plot, you need to follow three main steps:

Create a plot object that holds the general settings and options for the plot, such as the title, labels, legend, and axes.
Add one or more plotter objects that represent the specific plot types and data that you want to display, such as scatter plots, line plots, bar charts, histograms, box plots, and pie charts.
Save the plot object to a file or display it on the screen using the vg package, which provides various output formats and sizes for the plot.

In this subsection, you will learn how to perform each of these steps using Gonum/plot. You will also learn how to customize the appearance and behavior of the plots using various methods and options from the plot and plotter packages.

The first step is to create a plot object using the plot.New function from the plot package. This function returns a pointer to a plot.Plot struct, which holds the general settings and options for the plot. You can access and modify these settings and options using the fields and methods of the plot.Plot struct. Here is an example of how to create a plot object and set its title and labels:

// Create a new plot object
p, err := plot.New()
if err != nil {
    log.Fatal(err)
}

// Set the title and labels of the plot
p.Title.Text = "Iris Dataset"
p.X.Label.Text = "Sepal Length"
p.Y.Label.Text = "Sepal Width"

The second step is to add one or more plotter objects to the plot object using the Add method of the plot.Plot struct. A plotter object is an interface that implements the Plot method, which draws the plot on a canvas. The plotter package provides various types of plotter objects, such as scatter plots, line plots, bar charts, histograms, box plots, and pie charts. You can create these plotter objects using the constructors and functions from the plotter package. Here is an example of how to create and add a scatter plot object to the plot object:

// Create a slice of XY values from the dataframe
pts := make(plotter.XYs, df.Nrow())
for i := 0; i < df.Nrow(); i++ {
    pts[i].X = df.Elem(i, 0).Float()
    pts[i].Y = df.Elem(i, 1).Float()
}

// Create a new scatter plot object
s, err := plotter.NewScatter(pts)
if err != nil {
    log.Fatal(err)
}

// Add the scatter plot object to the plot object
p.Add(s)

The third step is to save the plot object to a file or display it on the screen using the vg package. The vg package provides various output formats and sizes for the plot, such as PNG, JPEG, PDF, SVG, and GIF. You can use the Save method of the plot.Plot struct to save the plot object to a file with a specified format and size. You can also use the Show function from the plotutil package to display the plot object on the screen. Here is an example of how to save and show the plot object:

// Save the plot object to a PNG file with a width of 10 cm and a height of 8 cm
err = p.Save(10*vg.Centimeter, 8*vg.Centimeter, "iris.png")
if err != nil {
    log.Fatal(err)
}

// Show the plot object on the screen
err = plotutil.Show(p)
if err != nil {
    log.Fatal(err)
}

By following these three steps, you can create and customize plots with Gonum/plot. You can also use various methods and options from the plot and plotter packages to further customize the appearance and behavior of the plots, such as the colors, markers, fonts, legends, axes, and grids. In the next subsection, you will learn how to plot different types of charts with Gonum/plot and see some examples of these customizations.

3.2. Plotting Different Types of Charts

In the previous subsection, you learned how to create and customize plots with Gonum/plot. In this subsection, you will learn how to plot different types of charts with Gonum/plot and see some examples of these customizations. You will use the same iris dataset that you used in the previous section, after cleaning and transforming it with Gota. You can assume that the dataframe is stored in a variable called df.

One of the most common types of charts that you may want to plot is a scatter plot. A scatter plot is a chart that shows the relationship between two variables by plotting them as points on a Cartesian plane. A scatter plot can help you visualize the correlation, trend, or outliers of the data. To plot a scatter plot with Gonum/plot, you can use the plotter.NewScatter function from the plotter package. This function takes a slice of XY values as an argument and returns a plotter.Scatter struct, which implements the plotter interface. You can customize the appearance and behavior of the scatter plot using the fields and methods of the plotter.Scatter struct, such as the Color, Shape, and GlyphStyle fields. Here is an example of how to plot a scatter plot of the sepal length and sepal width columns of the iris dataset, and how to customize its color and shape:

// Create a slice of XY values from the dataframe
pts := make(plotter.XYs, df.Nrow())
for i := 0; i < df.Nrow(); i++ {
    pts[i].X = df.Elem(i, 0).Float()
    pts[i].Y = df.Elem(i, 1).Float()
}

// Create a new scatter plot object
s, err := plotter.NewScatter(pts)
if err != nil {
    log.Fatal(err)
}

// Set the color and shape of the scatter plot
s.Color = color.RGBA{R: 255, G: 0, B: 0, A: 255} // red color
s.Shape = draw.CircleGlyph{} // circle shape

// Add the scatter plot object to the plot object
p.Add(s)

Another common type of chart that you may want to plot is a line plot. A line plot is a chart that shows the change of a variable over time or another variable by connecting the data points with a line. A line plot can help you visualize the trend, pattern, or variation of the data. To plot a line plot with Gonum/plot, you can use the plotter.NewLine function from the plotter package. This function takes a slice of XY values as an argument and returns a plotter.Line struct, which implements the plotter interface. You can customize the appearance and behavior of the line plot using the fields and methods of the plotter.Line struct, such as the Color, Width, and DashPattern fields. Here is an example of how to plot a line plot of the petal length and petal width columns of the iris dataset, and how to customize its color and width:

// Create a slice of XY values from the dataframe
pts := make(plotter.XYs, df.Nrow())
for i := 0; i < df.Nrow(); i++ {
    pts[i].X = df.Elem(i, 2).Float()
    pts[i].Y = df.Elem(i, 3).Float()
}

// Create a new line plot object
l, err := plotter.NewLine(pts)
if err != nil {
    log.Fatal(err)
}

// Set the color and width of the line plot
l.Color = color.RGBA{R: 0, G: 0, B: 255, A: 255} // blue color
l.Width = vg.Points(2) // 2 points width

// Add the line plot object to the plot object
p.Add(l)

A third common type of chart that you may want to plot is a bar chart. A bar chart is a chart that shows the frequency or proportion of a variable by using rectangular bars of different heights or lengths. A bar chart can help you visualize the distribution, comparison, or composition of the data. To plot a bar chart with Gonum/plot, you can use the plotter.NewBarChart function from the plotter package. This function takes a slice of XY values and a vg.Length as arguments and returns a plotter.BarChart struct, which implements the plotter interface. You can customize the appearance and behavior of the bar chart using the fields and methods of the plotter.BarChart struct, such as the Color, LineStyle, and Horizontal fields. Here is an example of how to plot a bar chart of the species column of the iris dataset, and how to customize its color and orientation:

// Create a slice of XY values from the dataframe
pts := make(plotter.Values, df.Nrow())
for i := 0; i < df.Nrow(); i++ {
    pts[i] = float64(df.Elem(i, 4).Int())
}

// Create a new bar chart object with a width of 0.5 cm
b, err := plotter.NewBarChart(pts, 0.5*vg.Centimeter)
if err != nil {
    log.Fatal(err)
}

// Set the color and orientation of the bar chart
b.Color = color.RGBA{R: 0, G: 255, B: 0, A: 255} // green color
b.Horizontal = true // horizontal orientation

// Add the bar chart object to the plot object
p.Add(b)

By plotting different types of charts with Gonum/plot, you can create and customize various graphical representations of your data. You can also use other types of charts from the plotter package, such as histograms, box plots, and pie charts, depending on your data and analysis. In the next section, you will learn how to conclude your blog and summarize the main points and takeaways.

4. Conclusion

In this blog, you learned how to use Golang packages and tools for data preprocessing and visualization. You learned how to use Gota, a data frame library for Golang, to perform data manipulation and analysis. You learned how to use Gonum/plot, a plotting library for Golang, to create and customize various types of charts. You also learned how to apply these packages and tools to a sample dataset of iris flowers, and prepare your data for machine learning.

By following this blog, you gained some practical skills and knowledge on how to use Golang for data science tasks. You also saw how Golang can offer a fast, simple, and reliable way to work with data and create graphical representations. You can use these skills and knowledge to explore, analyze, and present your own data in an effective and engaging way.

We hope you enjoyed this blog and learned something new and useful. If you have any questions, comments, or feedback, please feel free to leave them in the comment section below. We would love to hear from you and improve our content. Thank you for reading and happy coding!