How to construct a knowledge graph from the text?

From the previous stories, we know what the knowledge graph is and we get the required information in order to extract information for knowledge graph creation. In this story, we will combine these two pieces of information and create our own knowledge graph!

Introduction

Even folks who do not interested in geography or history have heard about Balkans. Here is the Wikipedia page:

As you can see there is a lot of information there, not only in the form of text but also in hyperlinks and pictures.

Most of the information is relevant and useful for research about Balkans. However, we cannot directly use this data source in our programs. In order to make this data readable for our machines and also interpretable by us, we will transform it into a knowledge graph!

Before getting started with building our knowledge graph, let’s see how we embed information in these graphs. As in most of the graphs, we have entities represented as nodes and the connections between them, namely edges.

If we directly map the first sentence in the Wikipedia page of Balkans, which is “The Balkans known as the Balkan Peninsula”, into our graph we get the following simple graph:

This example is made by our hands, however, it is not feasible or scalable for us to manually build a whole knowledge graph, thus we need to extract the entities and the relations by machines! However, here comes the challenge: Machines cannot interpret natural language. In order for our machines to understand our texts, we will make use of Natural Language Processing techniques, namely NLP, such as sentence segmentation, dependency parsing, parts of speech tagging, and entity recognition. We have discussed and experienced these techniques in the previous story. Let’s use these in here!

Knowledge Graph Creation

A knowledge graph consists of facts based on the relationship that connects the entities. The facts are in the form of triples, subject-object-predicate. For example;

“The Balkans is known as the Balkan Peninsula.”

As a triple, the above fact can be represented as isKnownAs(The Balkans, the Balkan Peninsula) where,

  • Subject: The Balkans
  • Predicate: isKnownAs
  • Object: the Balkan Peninsula.

There are several possible ways of extracting the triplets from the text. One can create his\her own sets of rules for the specific data source. In this story, we will use an already existing library, Krzysiekfonal’s textpipeliner that is created for advanced text mining. Let’s start with creating a text using spaCy:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The Balkans, also known as the Balkan Peninsula, \
is a geographic area in Southeast Europe with \
various definitions and meanings, including \
geopolitical and historical. The region takes its name \
from the Balkan Mountains that stretch throughout the \
whole of Bulgaria from the Serbian–Bulgarian border \
to the Black Sea coast..."

Now we can use these sentences produced by spaCy in textpipeliner that provides an easy way of extracting parts of sentences in the form of structured tuples from unstructured text. textpipeliner provides 2 main parts: Pipes and PipelineEngine. From pipes, you can create a structure that will be used to extract parts from every sentence in the document. The Engine will use this pipes structure and apply the processing of it for every sentence in the provided document and return list of extracted tuples.

pipes_structure = [
SequencePipe([
FindTokensPipe("VERB/nsubj/*"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
FindTokensPipe("VERB"),

AnyPipe([
SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("GPE"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()]),

SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("LOC"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()])])]
engine = PipelineEngine(pipes_structure, Context(doc), [0, 1, 2]) process = engine.process()

The extracted tuples are:

[([The, Achaemenid, Persian, Empire], [incorporated], [Macedonia]),
([Romans], [considered], [[the, Peninsula, of, Haemus], [Greek]]),
([Bulgars], [arrived], [the, Bulgarian, Empire])]

We can change the parameters according to entity types listed in spaCy. Let’s use these extracted tuples to create our knowledge graph. To do so, first, we need to identify what are the source nodes, target nodes, and relations. Using simple python operations you can directly get the following lists and store them into a DataFrame:

source = ['The Achaemenid Persian Empire', 'Romans', 'Bulgars'] target = ['Macedonia', 'the Peninsula of Haemus Greek', 'the Bulgarian Empire']
edge = ['incorporated ', 'considered ', 'arrived ']
kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':edge})

After extracting the lists and creating a DataFrame, we can use this DataFrame in the NetworkX package to draw our knowledge graph:

G=nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)
nx.draw_networkx_edge_labels(G, pos=pos)
plt.show()

And in the end, we can see the final graph:

Here, our resulting graph is so small, since we only used one type of Pipeline that only considers the named entities of type location and people. If you enrich your pipeline as well as the raw text you’d get a greater graph in which you could also perform inference!

All in all…

In this series of stories, we learned how to use NLP techniques to extract information from a given text in the form of triples and then build a knowledge graph from it. Even though we only use a small dataset and create a very limited knowledge graph, we are able to build quite informative knowledge graphs with our current experience. Knowledge Graphs are one of the most fascinating concepts in data science. I encourage you to explore this field of information extraction more.

References