When learning data science a lot of people will use sanitized datasets they downloaded from somewhere on the internet, or the data provided as part of a class or book. This is all well and good, but working with “perfect” datasets that are ideally suited to the task prevents them from getting into the habit of checking data for completeness and accuracy.
Out in the real world, while working with data for an employer or client, you will undoubtedly run into issues with data that you will need to check for and fix before being able to do any useful analysis. Here are some of the more common problems I’ve seen:
- Apostrophes – I absolutely hate apostrophes, also know as “single quotes”, because they are part of so many company names (or last names, if you’re Irish) yet so many databases and anayltics programs choke on them. In a CSV you can just search and destroy, but other cases aren’t so easy. And what if the dataset really does include quotes for some reason? You’ll have to find and replace by column rather than en masse.
- Misspellings or multiple spellings – God help the data scientist whose dataset includes both “Lowe’s” (the home improvement company) and “Loews” (the hotel company). You’ll have “lowe’s,” “Lowe’s,” “Lows,” “Loew’s,” “loews” and probably some I’m not even listing. Which is which? The best way to fix is by address, if that’s included in the dataset. If not, good luck.
- Not converting currency – Ever had a client who assumed that dollars were dollars, whether they came from Singapore or the USA? And if you’re forced to convert after the fact, which exchange rate should you use? The one for the date of the transaction, the one for the date it cleared, or something else?
- Different currency formats – Some use a comma to signify thousands, some use periods.
- Different date formats – Is it Month/Date/Year, or is it Date/Month/Year? Depends on who you ask. As with many things this is different outside the US versus inside.
- Using zero for null values – Sometimes a problem, sometimes not. But you have to know the difference. Applying the fix is easy enough, knowing when to do it is the key.
- Assuming a number is really a number - In most analytics software you should treat certain numbers (ZIP codes, for example) as text. Why? Because the number doesn’t represent a count of something, it represents a person, place, or selection. Rule of thumb: if it’s not a quantity, it’s probably not a number.
- Analytics software that only accepts numbers – In RapidMiner, for example, you have to convert binary options (“yes” and “no,” or “male” and “female”) to 1 and 0.