Hidden Treasures of Python

Rarely used libraries and how to use them

From Unsplash

There are so many thousands of libraries in the Python programming language that the title of this article can be essentially related to almost all of them, except for a couple of hundreds. Describing all the Python libraries would probably require a real book library. In this article, though, we’re going to focus on getting a taste of a few ones, designed to solve certain specific tasks or used for fun.

To practice our libraries, we’ll download a dataset from Kaggle — Animal Care and Control Adopted Animals.

import pandas as pd
df = pd.read_csv('animal-data-1.csv')
print('Number of pets:', len(df), '\n')
print(df.columns.tolist())
Output:
Number of pets: 10290 
['id', 'intakedate', 'intakereason', 'istransfer', 'sheltercode', 'identichipnumber', 'animalname', 'breedname', 'basecolour', 'speciesname', 'animalage', 'sexname', 'location', 'movementdate', 'movementtype', 'istrial', 'returndate', 'returnedreason', 'deceaseddate', 'deceasedreason', 'diedoffshelter', 'puttosleep', 'isdoa']

1. Missingno

Library installation: pip install missingno

Missingno is a special library for displaying missing values in a dataframe. Of course, for this purpose we can use a seaborn heatmap or a bar plot from any visualization library. However, in such cases, we’ll have to create first a series containing missing values in each column using df.isnull().sum(), while missingno does everything under the hood. This library offers a few types of charts:

  • matrix displays density patterns in data completion for up to 50 columns of a dataframe, and it is analogous to the seaborn missing value heatmap. Also, by means of the sparkline at right, it shows the general shape of the data completeness by row, emphasizing the rows with the maximum and minimum nullity.
  • bar chart shows nullity visualization in bars by column.
  • heatmap measures nullity correlation that ranges from -1 to 1. Essentially, it shows how strongly the presence or absence of one variable affects the presence of another. Columns with no missing values, or just the opposite, completely empty, are excluded from the visualization, having no meaningful correlation.
  • dendrogram, like the heatmap, measures nullity relationships between columns, but in this case not pairwise but between groups of columns, detecting clusters of missing data. Those variables that are located closer on the chart show a stronger nullity correlation. For dataframes with less than 50 columns the dendrogram is vertical, otherwise, it flips to a horizontal.

Let’s try all these charts with their default settings on our pet dataset:

import missingno as msno
%matplotlib inline
msno.matrix(df)
msno.bar(df)
msno.heatmap(df)
msno.dendrogram(df)

We can make the following observations about the dataset:

  1. In general, there are rather few missing values.
  2. The most empty columns are deceaseddate and returndate.
  3. The majority of pets are chipped.
  4. Nullity correlation:
  • slightly negative between being chipped and being dead,
  • slightly positive — being chipped vs. being returned, being returned vs. being dead.

There are a few options to customize missingno charts: figsize, fontsize, sort (sorts the rows by completeness, in either ascending or descending order), labels (can be True or False, meaning whether to show or not the column labels). Some parameters are chart-specific: color for matrix and bar charts, sparkline (whether to draw it or not) and width_ratios (matrix width to sparkline width) for matrix, log (logarithmic scale) for bar charts, cmap colormap for heatmap, orientation for dendrogram. Let’s apply some of them to one of our charts above:

msno.matrix(
df,
figsize=(25,7),
fontsize=30,
sort='descending',
color=(0.494, 0.184, 0.556),
width_ratios=(10, 1)
)

Finally, if there is still something we would like to tune, we can always add any functionality of matplotlib to the missingno graphs. To do this, we should add the parameter inline and assign it to False. Let’s add a title to our matrix chart:

import matplotlib.pyplot as plt
msno.matrix(
df,
figsize=(25,7),
fontsize=30,
sort='descending',
color=(0.494, 0.184, 0.556),
width_ratios=(10, 1),
inline=False
)
plt.title('Missing Values Pet Dataset', fontsize=55)
plt.show()

Missingno Documentation

2. Tabulate

Library installation: pip install tabulate

This library serves for pretty-printing tabular data in Python. It allows smart and customizable column alignment, number and text formatting, alignment by a decimal point.

The tabulate() function takes a tabular data type (dataframe, list of lists or dictionaries, dictionary, NumPy array), some other optional parameters, and outputs a nicely formatted table. Let’s practice it on a fragment of our pet dataset, starting with the most basic pretty-printed table:

from tabulate import tabulate
df_pretty_printed = df.iloc[:5, [1,2,4]]
print(tabulate(df_pretty_printed))
Output:
-  -----------  -----------------------  ------
0 Jadzia Domestic Short Hair Female
1 Gonzo German Shepherd Dog/Mix Male
2 Maggie Shep Mix/Siberian Husky Female
3 Pretty Girl Domestic Short Hair Female
4 Pretty Girl Domestic Short Hair Female
- ----------- ----------------------- ------

We can add a headers parameter to our table. If we assign headers='firstrow', the first row of data is used, if headers='keys' — the keys of a dataframe / dictionary. For table formatting, we can use a tablefmt parameter, which can take one of the numerous options (assigned as a string): simple, github, grid, fancy_grid, pipe, orgtbl, jira, presto, pretty, etc.

By default, tabulate aligns columns containing float numbers by a decimal point, integers — to the right, text columns — to the left. This can be overridden by using numalign and stralign parameters (right, center, left, decimal for numbers, or None). For text columns, it’s possible to disable the default leading and trailing whitespace removal.

Let’s customize our table:

print(tabulate(
df_pretty_printed,
headers='keys',
tablefmt='fancy_grid',
stralign='center'
))
Output:
╒════╤══════════════╤═════════════════════════╤═══════════╕
│ │ animalname │ breedname │ sexname │
╞════╪══════════════╪═════════════════════════╪═══════════╡
│ 0 │ Jadzia │ Domestic Short Hair │ Female │
├────┼──────────────┼─────────────────────────┼───────────┤
│ 1 │ Gonzo │ German Shepherd Dog/Mix │ Male │
├────┼──────────────┼─────────────────────────┼───────────┤
│ 2 │ Maggie │ Shep Mix/Siberian Husky │ Female │
├────┼──────────────┼─────────────────────────┼───────────┤
│ 3 │ Pretty Girl │ Domestic Short Hair │ Female │
├────┼──────────────┼─────────────────────────┼───────────┤
│ 4 │ Pretty Girl │ Domestic Short Hair │ Female │
╘════╧══════════════╧═════════════════════════╧═══════════╛

The only thing to keep in mind here is that pretty-printed tables are best displayed on laptops and computers, but sometimes can have issues on smaller screens (smartphones and iPhones).

Tabulate Documentation

3. Wikipedia

Library installation: pip install wikipedia

Wikipedia library, as its name suggests, facilitates accessing and fetching information from Wikipedia. Some of the tasks that can be accomplished with it include:

  • searching Wikipedia — search(),
  • getting article summaries — summary,
  • getting full page contents, including images, links, any other metadata of a Wikipedia page — page(),
  • selecting the language of a page — set_lang().

In the pretty-printed table above, we saw a dog breed called Siberian Husky. As an exercise, we’ll set the language to Russian (my native language 🙂) and search for some suggestions of the corresponding Wikipedia pages:

import wikipedia 
wikipedia.set_lang('ru')
print(wikipedia.search('Siberian Husky'))
Output:
'Сибирский хаски', 'Древние породы собак', 'Породы собак по классификации кинологических организаций', 'Маккензи Ривер Хаски', 'Ричардсон, Кевин Майкл']

Let’s take the first suggestion and fetch the first sentence of that page’s summary:

print(wikipedia.summary('Сибирский хаски', sentences=1))
Output:
Сибирский хаски — заводская специализированная порода собак, выведенная чукчами северо-восточной части Сибири и зарегистрированная американскими кинологами в 1930-х годах как ездовая собака, полученная от аборигенных собак Дальнего Востока России, в основном из Анадыря, Колымы, Камчатки у местных оседлых приморских племён — юкагиров, кереков, азиатских эскимосов и приморских чукчей — анкальын (приморские, поморы — от анкы (море)).

Now, we’re going to get a link to a picture of Husky from this page:

print(wikipedia.page('Сибирский хаски').images[0])
Output:
https://upload.wikimedia.org/wikipedia/commons/a/a3/Black-Magic-Big-Boy.jpg

and visualize this beautiful creature:

From Wikipedia

Wikipedia Documentation

4. Wget

Library installation: pip install wget

Wget library allows downloading files in Python without the necessity to open them. We can add also a path where to save this file as a second argument.

Let’s download the picture of Husky above:

import wget
wget.download('https://upload.wikimedia.org/wikipedia/commons/a/a3/Black-Magic-Big-Boy.jpg')
Output:
'Black-Magic-Big-Boy.jpg'

Now we can find the picture in the same folder as this notebook, because we didn’t specify a path where to save it.

Since any webpage on the Internet is actually a HTML file, another very useful application of this library is to crawl the whole webpage, with all its elements. Let’s download the Kaggle webpage where our dataset is located:

wget.download('https://www.kaggle.com/jinbonnie/animal-data')
Output:
'animal-data'

The resulting animal-datafile looks like the following (we’ll display only several first rows):

<!DOCTYPE html>
<html lang="en">
<head>
<title>Animal Care and Control Adopted Animals | Kaggle</title>
<meta charset="utf-8" />
<meta name="robots" content="index, follow" />
<meta name="description" content="animal situation in Bloomington Animal Shelter from 2017-2020" />
<meta name="turbolinks-cache-control" content="no-cache" />

Wget Documentation

5. Faker

Library installation: pip install Faker

This module is used to generate fake data, including names, addresses, emails, phone numbers, jobs, texts, sentences, colors, currencies, etc. The faker generator can take a locale as an argument (the default is en_US locale), to return localized data. For generating a piece of text or a sentence, we can use the default lorem ipsum; alternatively, we can provide our own set of words. To ensure that all the created values are unique for some specific instance (for example, when we want to create a long list of unique fake names), the unique property is applied. If instead, it’s necessary to produce the same value or data set, the seed() method is used.

Let’s look at some examples:

from faker import Faker
fake = Faker()
print(
'Fake color:', fake.color(), '\n'
'Fake job:', fake.job(), '\n'
'Fake email:', fake.email(), '\n'
)
# Printing a list of fake Korean and Portuguese addresses
fake = Faker(['ko_KR', 'pt_BR'])
for _ in range(5):
print(fake.unique.address()) # using the `unique` property
print('\n')
# Assigning a seed number to print always the same value / data set
fake = Faker()
Faker.seed(3920)
print('This English fake name is always the same:', fake.name())
Output:
Fake color: #bde2f9 
Fake job: Transport planner
Fake email: chad52@yahoo.com

Rua Marcos Vinicius Costela, 66
Vila Nova Gameleira 2ª Seção
86025006 da Costa / MG
충청남도 평창군 언주1거리 (우진장읍)
Núcleo de Peixoto, 87
Havaí
90291-013 Campos / MS
Lago da Luz
Minas Brasil
85538436 Porto da Mata / TO
인천광역시 중랑구 서초중앙0로


This English fake name is always the same: Kim Lopez

Returning to our dataset, we found out that there are at least two unlucky pets with not really nice names:

df_bad_names = df[df['animalname'].str.contains('Stink|Pooh')]
print(df_bad_names)
Output:
     identichipnumber animalname            breedname speciesname 
1692 NaN Stinker Domestic Short Hair Cat
3336 981020023417175 Pooh German Shepherd Dog Dog
3337 981020023417175 Pooh German Shepherd Dog Dog

sexname returndate returnedreason
1692 Male NaN Stray
3336 Female 2018-05-14 00:00:00 Incompatible with owner lifestyle
3337 Female NaN Stray

The dog from the last 2 rows is actually the same one, returned to the shelter because of being incompatible with the owner’s lifestyle. With our new skills, let’s save the reputation of both animals and rename them into something more decent. Since the dog is a German Shepherd, we’ll select a German name for her. As for the cat, according to this Wikipedia page, Domestic Short Hair is the most common breed in the US, so for him, we’ll select an English name.

# Defining a function to rename the unlucky pets
def rename_pets(name):
if name == 'Stinker':
fake = Faker()
Faker.seed(162)
name = fake.name()
if name == 'Pooh':
fake = Faker(['de_DE'])
Faker.seed(20387)
name = fake.name()
return name
# Renaming the pets
df['animalname'] = df['animalname'].apply(rename_pets)
# Checking the results
print(df.iloc[df_bad_names.index.tolist(), :] )
Output:
    identichipnumber            animalname     breedname speciesname
1692 NaN Steven Harris Domestic Short Hair Cat
3336 981020023417175 Helena Fliegner-Karz German Shepherd Dog Dog
3337 981020023417175 Helena Fliegner-Karz German Shepherd Dog Dog

sexname returndate returnedreason
1692 Male NaN Stray
3336 Female 2018-05-14 00:00:00 Incompatible with owner lifestyle
3337 Female NaN Stray

Steven Harris and Helena Fliegner-Karz sound a little bit too bombastic for a cat and a dog, but definitely much better than their previous names!

Faker Documentation

6. Numerizer

Library installation: pip install numerizer

This small Python package is used for converting natural language numerics into numbers (integers and floats) and consists of only one function — numerize().

Let’s try it right now on our dataset. Some pets’ names contain numbers:

df_numerized_names = df[['identichipnumber', 'animalname', 'speciesname']][df['animalname'].str.contains('Two|Seven|Fifty')]   
print(df_numerized_names)
Output:
      dentichipnumber animalname speciesname
2127 NaN Seven Dog
4040 981020025503945 Fifty Lee Cat
6519 981020021481875 Two Toes Cat
6520 981020021481875 Two Toes Cat
7757 981020029737857 Mew Two Cat
7758 981020029737857 Mew Two Cat
7759 981020029737857 Mew Two Cat

We’re going to convert the numeric part of these names into real numbers:

from numerizer import numerize
df['animalname'] = df['animalname'].apply(lambda x: numerize(x))
print(df[['identichipnumber', 'animalname','speciesname']]\
.iloc[df_numerized_names.index.tolist(), :])
Output:
     identichipnumber animalname speciesname
2127 NaN 7 Dog
4040 981020025503945 50 Lee Cat
6519 981020021481875 2 Toes Cat
6520 981020021481875 2 Toes Cat
7757 981020029737857 Mew 2 Cat
7758 981020029737857 Mew 2 Cat
7759 981020029737857 Mew 2 Cat

Numerizer Documentation

7. Emoji

Library installation: pip install emoji

By means of this library, we can convert strings to emoji, according to the Emoji codes as defined by the Unicode Consortium, and, if specified use_aliases=True, complemented with the aliases. The emoji package has only two functions: emojize() and demojize(). The default English language (language='en') can be changed to Spanish (es), Portuguese (pt), or Italian (it).

import emoji
print(emoji.emojize(':koala:'))
print(emoji.demojize('🐨'))
print(emoji.emojize(':rana:', language='it'))
Output:
🐨
:koala:
🐸

Let’s emojize our animals. First, we’ll check their unique species names:

print(df['speciesname'].unique())
Output:
['Cat' 'Dog' 'House Rabbit' 'Rat' 'Bird' 'Opossum' 'Chicken' 'Wildlife' 'Ferret' 'Tortoise' 'Pig' 'Hamster' 'Guinea Pig' 'Gerbil' 'Lizard' 'Hedgehog' 'Chinchilla' 'Goat' 'Snake' 'Squirrel' 'Sugar Glider' 'Turtle' 'Tarantula' 'Mouse' 'Raccoon' 'Livestock' 'Fish']

We have to convert these names into lower case, add leading and trailing colons to each, and then apply emojize() to the result:

df['speciesname'] = df['speciesname']\
.apply(lambda x: emoji.emojize(f':{x.lower()}:', use_aliases=True))
print(df['speciesname'].unique())
Output:
['🐱' '🐶' ':house rabbit:' '🐀' '🐦' ':opossum:' '🐔' ':wildlife:' ':ferret:' ':tortoise:' '🐷' '🐹' ':guinea pig:' ':gerbil:' '🦎' '🦔' ':chinchilla:' '🐐' '🐍' ':squirrel:' ':sugar glider:' '🐢' ':tarantula:' '🐭' '🦝' ':livestock:' '🐟']

Let’s rename the house rabbit, tortoise, and squirrel into their synonyms comprehensible for the emoji library and try emojizing them again:

df['speciesname'] = df['speciesname']\
.str.replace(':house rabbit:', ':rabbit:')\
.replace(':tortoise:', ':turtle:')\
.replace(':squirrel:', ':chipmunk:')
df['speciesname'] = df['speciesname']\
.apply(lambda x: emoji.emojize(x, variant='emoji_type'))
print(df['speciesname'].unique())
Output:
['🐱' '🐶' '🐇️' '🐀' '🐦' ':opossum:️' '🐔' ':wildlife:️' ':ferret:️' '🐢️' '🐷' '🐹' ':guinea pig:' ':gerbil:️' '🦎' '🦔' ':chinchilla:️' '🐐' '🐍' '🐿️' ':sugar glider:' '🐢' ':tarantula:️' '🐭' '🦝' ':livestock:️' '🐟']

The remaining species are or collective names (wildlife and livestock), or don’t have an emoji equivalent, at least not yet. We’ll leave them as they are, removing only the colons and converting them back into title case:

df['speciesname'] = df['speciesname'].str.replace(':', '')\
.apply(lambda x: x.title())
print(df['speciesname'].unique(), '\n')
print(df[['animalname', 'speciesname', 'breedname']].head(3))
Output:
['🐱' '🐶' '🐇️' '🐀' '🐦' 'Opossum️' '🐔' 'Wildlife️' 'Ferret️' '🐢️' '🐷' '🐹' 'Guinea Pig' 'Gerbil️' '🦎' '🦔' 'Chinchilla️' '🐐' '🐍' '🐿️' 'Sugar Glider' '🐢' 'Tarantula️' '🐭' '🦝' 'Livestock️' '🐟'] 

animalname speciesname breedname
0 Jadzia 🐱 Domestic Short Hair
1 Gonzo 🐶 German Shepherd Dog/Mix
2 Maggie 🐶 Shep Mix/Siberian Husky

Emoji Documentation

8. PyAztro

Library installation: pip install pyaztro

PyAztro seems to be designed more for fun than for work. This library provides a horoscope for each zodiac sign. The prediction includes the description of a sign for that day, date range of that sign, mood, lucky number, lucky time, lucky color, compatibility with other signs. For example:

import pyaztro
pyaztro.Aztro(sign='taurus', day='tomorrow').description
Output:
"If the big picture is getting you down, narrow your focus a bit and try to enjoy the smaller aspects of life. It's a good day to remember what you're truly thankful for in life and to spread the word."

Good advice! Indeed, I will not wait for tomorrow and narrow my focus on the dataset already now, in research of some relevant information 😀

There are a cat and a dog called Aries:

print(df[['animalname', 'speciesname']][(df['animalname'] == 'Aries')])
Output:
     animalname  speciesname
3036 Aries 🐱
9255 Aries 🐶

and plenty of pets called Leo:

print('Leo:', df['animalname'][(df['animalname'] == 'Leo')].count())
Output:
Leo: 18

Let’s assume that those are their corresponding zodiac signs 😉 With PyAztro, we can check what the stars have prepared for these animals for today:

aries = pyaztro.Aztro(sign='aries')
leo = pyaztro.Aztro(sign='leo')
print('ARIES: \n',
'Sign:', aries.sign, '\n',
'Current date:', aries.current_date, '\n',
'Date range:', aries.date_range, '\n',
'Sign description:', aries.description, '\n',
'Mood:', aries.mood, '\n',
'Compatibility:', aries.compatibility, '\n',
'Lucky number:', aries.lucky_number, '\n',
'Lucky time:', aries.lucky_time, '\n',
'Lucky color:', aries.color, 2*'\n',

'LEO: \n',
'Sign:', leo.sign, '\n',
'Current date:', leo.current_date, '\n',
'Date range:', leo.date_range, '\n',
'Sign description:', leo.description, '\n',
'Mood:', leo.mood, '\n',
'Compatibility:', leo.compatibility, '\n',
'Lucky number:', leo.lucky_number, '\n',
'Lucky time:', leo.lucky_time, '\n',
'Lucky color:', leo.color)
Output:
ARIES: 
Sign: aries
Current date: 2021-02-22
Date range: [datetime.datetime(2021, 3, 21, 0, 0), datetime.datetime(2021, 4, 20, 0, 0)]
Sign description: Throw away your old to-do list and start over. There may be some stuff on it that just holds you back because you know you'll never do it and you might pop off some cool new projects while you're at it.
Mood: Refreshed
Compatibility: Scorpio
Lucky number: 67
Lucky time: 1am
Lucky color: Sky Blue

LEO:
Sign: leo
Current date: 2021-02-22
Date range: [datetime.datetime(2021, 7, 23, 0, 0), datetime.datetime(2021, 8, 22, 0, 0)]
Sign description: Try something new and different today - eat out somewhere you've never been, experiment with new techniques to clean your house or just pick an activity at random and give it a go!
Mood: Curious
Compatibility: Taurus
Lucky number: 75
Lucky time: 11am
Lucky color: Teal

These forecasts are valid for 22.02.2021, so if you want to check our pets’ horoscope (or maybe your own one) for the current day, you have to re-run the code above. All the properties, apart from, evidently, sign and date_range, change every day for each zodiac sign at midnight GTM.

PyAztro Documentation

Certainly, there are many other funny Python libraries like PyAztro, including:

  • Art — for converting text to ASCII art, like this: ʕ •`ᴥ•´ʔ
  • Turtle — for drawing,
  • Chess — for playing chess,
  • Santa — for randomly pairing Secret Santa gifters and recipients,

and even

  • Pynder — for using Tinder.

We can be sure that with Python we’ll never get bored!

Conclusion

To sum up, I wish all the pets from the dataset to find their loving and caring owners, and the Python users — to discover more amazing libraries and apply them to their projects.


Hidden treasures of Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.