In-depth exploration of data collection processes
Some of my most popular repositories on GitHub have been about data collection, either through web scraping or using an Application Programming Interface (API). My approach had always been to find a resource from where I can get the data and then directly start fetching it. After collecting the data, just save it, draw insights and that’ll be it.
But what if you want to share the data? What if someone is looking for this dataset and they don’t know how to go about it? What if they have this dataset but don’t know what each column means or where to browse for if they need more information? These questions arise because data sharing and usability is important but almost no one tries to make an effort to make it reproducible and easily accessible.
This is where the best practices of data collection come into being. The metadata along with your data is almost as important because without it your data might be useless. Let’s explore in-depth, what this is and what everyone should do to make the process of data collection right!
Start by figuring out what to collect
First step, as always, is to look for data that already exists. Someone might have collected a similar or the same data you wanted to collect for their problem. If you find such a data, take it (if made available by them) and properly cite your source wherever and whenever you use that data for any analysis. That’s it!
However, if you don’t find the data you need, you’ll have to collect it yourself. It could be a list of Wikipedia pages that you scrape off their website, repositories information you might want to grab for your GitHub account using the GitHub API or data collected from a sensor. The things that you can collect are almost limitless.
Collect the data
Whatever you decided to collect, start collecting your data. You can use BeautifulSoup to extract information from HTML pages, access APIs as needed using their documentation or maybe create an Android application that reads data from a sensor and saves it to a CSV file.
Once you have the data you want, you might want to share your work with others. You would want others to understand what you collected, why you collected and maybe use your data by properly citing your work. It then becomes essential to have the data in a proper format that others can understand and use.
Data about your data — Metadata
Now, I’ll tell you something that we always use but often overlook as an essential part of the data. Yes, I’m talking about the metadata. The information that tells you what each column means, what are the units of measurement, when was the data collected, and a lot more.
Let’s understand the importance of metadata using an example. The UCI Machine Learning repository includes a long list of datasets which you can use for your analysis and prediction. Let’s pick the Beast Cancer Data Set. This is how the dataset looks:
Just by looking at the data and no additional information, we just cannot figure out what each column even means, let alone do any analysis on it. But just when I show the below picture that has column description, we can use the dataset, extract information, perform exploratory analysis and do predictions.
This is why information about the data is really important. This one essential step can make or break your dataset.
But what all should we collect?
If you think about it, you’ll find that there are a lot of things that you can collect as metadata such as date of collection, location, column description, and more. Thus, there exists a unified collection of metadata standards that one can choose from such that others can get complete information. Some of the common ones are as follows:
The Dublin Core includes a list of elements that you need to specify about the data such as Date Created, Creator and other information.
Metadata Encoding and Transmission Standard
The Metadata Encoding and Transmission Standards (METS) is a metadata standard for descriptive and structural data represented as eXtensible Markup Language (XML).
International Organization for Standardization (ISO)
The ISO defines a list of standards which are followed worldwide. The standards may vary based on usage and area. For example, for a standard way to represent time — there is the ISO 8601 standard which signifies how to write the date and time in a commonly understood pattern.
There are other standards which exist as well but the usage depends on what data you’re trying to collect. The basic general point when collecting metadata is that if someone today or sometime in the future, decides to work on your data, the data and metadata should be self-sufficient in describing everything.
However, to do so, there is another essential information along with metadata — provenance.
The provenance includes information about the process of data collection and if any transformations were made on that data. While collecting data, we keep track of when and how was the data collected, measuring devices, the process, data collector, any limitations, and everything about the process of data processing (if done).
The complete package of data along with metadata and provenance makes the data future proof in a usable format.
Feel free to share your thoughts, ideas and suggestions. I’d love to hear from you.