from myth (-1.00) to power (+1.00), a poster series and linguistic mirror reflecting on the subject of certainty in text mining.

About

Text mining: an effective technology that brings power, as it has definitely proven to absolutely always verify truth? Or is this technology rather a ridiculous myth, nonsense and a lie?

This poster series from myth (-1.00) to power (+1.00) are the product of a poetic translation excercise based on the script modality.py, written by the developers of the text mining software package Pattern to detect the degree of certainty as a value between -1.00 and +1.00, where values > +0.50 represent facts1.

Rule-based

Modality.py is a rule-based program, one of the older types of text mining techniques. The series of calculations in a rule-based program are determined by a set of rules, written after a long intensive period of linguistic research on a specific subject. A rule-based program is very precise, effective, but also very static and specific, which makes them an expensive type of text-mining technique, in terms of time, labour, and the difficulty to re-use a program on different types of text.

To overcome these expenses, rule-based programs have been massively replaced these days by pattern recognition techniques such as supervised learning and neural networks, where the rules of a program are based on patterns in large datasets.

Modality.py

The program on which these posters are based, called Modality.py, is written to calculate a degree of certainty in academic papers, and express this degree in a value between -1.00 & +1.00. The sources used for modality.py are academic papers from a dataset called 'BioScope' and Wikipedia training data from CoNLL2010 Shared Task 12. Part of this dataset are 'weasel words'3, words that are annotated as 'vague' by the Wikipedia community. Examples of weasel words are: some people say, many scholars state, it is believed/regarded, scientists claim, it is often said4.

The script modality.py is an example of a rule-based program, full with pre-defined values. The words fact (+1.00), evidence (+0.75) and (even) data (+0.75) indicate a high level of certainty. As opposed to words like fiction (-1.00), and belief (-0.25).

In the script, the concept of being certain is divided up in 9 categories:

after which a set of words is connected to each category, for example this set of nouns:

Linguistic mirror

A poetic translation exercise,

from an interest in a numerical perception of human language,

while bending structuralistic categories,

to reflect on the human-machine collaboration in text-based machine learning processes.

References

  1. Description of modality.py in the online documentation of Pattern, https://www.clips.uantwerpen.be/pages/pattern-en#modality, September 2017
  2. Note in modality.py on line #482 https://github.com/clips/pattern/blob/master/pattern/text/en/modality.py#L482, September 2017
  3. More about the specific meaning of weasel words in Wikipedia https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch#Unsupported_attributions, September 2017
  4. Idem.