Appendix

From Story to Data: Understanding Natural Language Processing

About this Graphic

This is a hand-illustrated graphic recording, created live during a keynote presentation at the Maternal Health Equity Workshop: From Story to Data to Action, May 18, 2023, which was convened by the Association of American Medical Colleges Center for Health Justice. Graphic recordings were created by Drawing Change. The effect is dynamic, with large flowing swoops that look like brain neural pathways. There are limited illustrations in this piece and text is separated by background shading and different lettering sizes.

From Story to Data: Understanding Natural Language Processing

With Maria Antoniak, Allen Institute for Artificial Intelligence

What is NLP?

In the top left opening section, a Black woman reads the title out loud in a speech bubble. A small emoji asks, What is NLP? Using computational methods to study human language. It branches into two subpoints: analyzing human language (Google); and generating human language (Alexa). The small emoji asks next, Why is it exciting? It can do so much: answer questions, measure biases, extract information. Three clouds highlight: language is fascinating; there are large language models now. The emoji next asks, Why are people wary? Encoded biases are hard to detect; there are poorly documented training data; climate concerns due to energy needed for data processing; the results are hard to interpret.

From Story to Data

This main section addresses how things change from words to numbers. A small cartoon male figure says, “You shall know a word by the company it keeps,” attributed to Firth (1957). Which words often appear together? Deduce meaning from usage. There are three points:

A matrix can create a table with words becoming vectors
There can be a lot of pre-processing here (tokenization) and assumptions are brought into huge datasets
We can graph this! There can be a static graph, e.g., LSA or Word2Vec, or contextualized graphic (ChatGPT), where we can measure and learn a lot from the graphing.

From Data to Model

A Black woman feeds dots representing information into a computer. The caption says “supervised,” with two points: Like a child, we say ‘do this, don’t do that’; we can text and label. We need to balance between oversimplifying and too much “fit” or noise, and there is a drawing of a balance beam with the words “bias” and “variance” labeled on either side. Next to this is a drawing of the same woman turning away from the same computer, which is processing the same information without her touching it. The caption says “unsupervised,” and the main points are that the computer finds codes, patterns, and relationships, on its own; there is text but no labels. Instead, there are clusters (drawn by dots and stars).

Language Modeling

The main caption in this section is Language Modeling: Given a sequence of words you can predict the next sequence of words.

Large Language Models, e.g. a “pre-trained model” with web scrapes, Wikipedia, publications, books.
Ethics: lack of interpretability, data sets difficult to document, English focus, biases baked in.

From Model to Action

In the final section, there are five dots and the word “downstream” above it. From left to right, the dots are labelled thinking, writing, dataset, model, classifier. Underneath this is written “upstream (learn about biases)” and there is an arrow moving from right to left, to complete the cycle. Two speech bubbles labeled “Qual” and “Quan” with connecting arrows labeled “stories” and “patterns.” In the overlap is written “methods working together.”

Topic: