#### Introduction

For beginners in data science and machine learning, a common problem is to get hands on good, clean data set for quick practice. Regression and classification are two most common supervised machine learning tasks, that a practitioner of data science have to deal with. It is not always possible to get well-structured data set for practicing various algorithms that one learns.

It would be great to have a convenient function to quickly generate synthetic examples of regression and classification problems with controllable size, complexity, and randomness.

Now, Scikit-Learn, the leading machine learning library in Python, does provide random data set generation capability for regression and classification problems. However, the user have no easy control over the underlying mechanics of the data generation and the regression output are not a definitive function of inputs — they are truly random. While this may be sufficient for many problems, one may often require a controllable way to generate these problems based on a well-defined function (involving linear, nonlinear, rational, or even transcendental terms).

In this article, we go over the sample code to accomplish that using SymPy, the great symbolic computation package from Python ecosystem. You can download the Jupyter notebooks from my GitHub repository here and modify the functions further as per your need.

#### Scikit-Learn’s dataset generation functions and limitations

Scikit-learn provides two utility functions for generating random regression and classification problem. They are listed under sklearn.dataset API. They are as follows,

sklearn.dataset.make_regression: Generate a random regression problem. The output is generated by applying a (potentially biased) random linear regression model with a definite number of nonzero regressors to the previously generated input and some Gaussian centered noise with some adjustable scale.

sklearn.dataset.make_classification: Generate a random n-class classification problem. This initially creates clusters of points normally distributed about vertices of an n-dimensional hypercube and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

These are very good random problem generation utility but they do not allow user to create data based on some underlying deterministic function. However, one may want to generate datasets for controllable analytics/machine learning experiments. For example, we want to evaluate the efficacy of the various kernelized SVM classifiers on datasets with increasingly complex separators (linear to non-linear) or want to demonstrate the limitation of linear models for regression datasets generated by rational or transcendental functions. It will be difficult to do so with these functions of scikit-learn. Moreover, user may want to just input a symbolic expression as the generating function (or the logical separator for classification task). There is no easy way to do so using only scikit-learn’s utility and one has to write his/her own function for each new instance of the experiment.

#### SymPy to rescue!

For solving the problem of symbolic expression input, one can easily take advantage of the amazing Python package SymPy, which allows comprehension, rendering, and evaluation of symbolic mathematical expressions up to a fairly high level of sophistication. More details can be found in their website. Here are few basic examples,

#### Random regression and classification dataset generation using symbolic expression supplied by user

The details of code can be found in my GitHub repo, but the idea is simple. We have a symbolize function which converts a Python input string into a SymPy symbol object and a eval_multinomial() function which takes a SymPy symbol expression and a (vals) argument as list, dictionary, or tuple and internally creates a (symbol,value) pair to evaluate the mathematical expression. The main utility functions are as follows,

The gen_classification_symbolic()function generates classification samples based on a symbolic expression. It calculates the output of the symbolic expression at randomly generated (Gaussian distribution) points and assigns binary classification based on sign.
m : The symbolic expression. Needs x1, x2, etc as variables and regular python arithmetic symbols to be used.
n_samples: Number of samples to be generated
n_features: Number of independent variables. This is automatically inferred from the symbolic expression. So, this input is ignored in case a symbolic expression is supplied. However if no symbolic expression is supplied then a default simple polynomial can be invoked to generate classification samples with n_features.
flip_y: Probability of flipping the classification labels randomly. A higher value introduces more noise and make the classification problem harder.
Return: Returns a numpy ndarraywith dimension (n_samples,n_features+1). Last column is the response vector.

The gen_regression_symbolic()function generates classification samples based on a symbolic expression. It calculates the output of the symbolic expression at randomly generated (Gaussian distribution) points.
m : The symbolic expression. Needs x1, x2, etc as variables and regular python arithmetic symbols to be used.
n_samples: Number of samples to be generated
n_features: Number of variables. This is automatically inferred from the symbolic expression. So this is ignored in case a symbolic expression is supplied. However if no symbolic expression is supplied then a default simple polynomial can be invoked to generate classification samples with n_features.
noise: Magnitude of noise (default Gaussian) to be introduced (added to the output).
noise_dist: Type of the probability distribution of the noise signal. Currently supports: Normal, Uniform, Beta, Gamma, Poission, Laplace.
Return: Returns a numpy ndarraywith dimension (n_samples,n_features+1). Last column is the response vector.

#### Examples

Here are few code snippets and resulting data sets visualized.

Classification Samples

Regression Samples

#### Not limited to single symbolic variable

Although the above examples are shown using one or two examples, the functions are not limited by number of variables. In fact, the internal methods are coded to automatically infer the number of independent variables from your symbolic expression input and sets up the problem accordingly. Here is an example, where n_features are not even given by the user but the function infers the number of features to be 3 from the symbolic expression.

#### Summary and future expansions

The basic code is set up to mimic the scikit-learn’s dataset generation utility functions as closely as possible. One can easily extend it by providing a Pandas DataFrame output or a CSV file output for using in any other programming environment and saving the data to the local disk. Up to a certain degree of complexity, it is also possible to provide user with a string representation of the LaTeX formula for the symbolic expression. Readers are certainly encouraged to send their comments or indicate in the GitHub repo.

If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.