Using NLP to quantify racism

Dr. Candace Makeda Moore
1 באוג׳ 2019
זמן קריאה 2 דקות

I got into natural language processing (NLP) in order to look at radiology reports. It's a mild obsession.

But NLP works on texts, any old texts. In fact, I have a friend who helped me with some of my radiology work who actually teaches the youngsters of Israel how to program for AI, and he works on texts on websites in his industry job.

He often points out to me a fruitful approach to looking at texts that I never would have thought of programming. Once upon a time, I programmed radiology reports as a simple one dimensional matrix. You can program any text as a one dimensional matrix and run algorithms like Latent Dirichilet allocation on them. Not fully satisfied, I kept running different algorithms on these one dimensional matrices. One could conceptualize of them as X-rays of the texts. But then, my friend pointed me towards the way to get a CT, if you will extend the metaphor.

Yes, texts can be multidimensional matrices. But what does this have to do with racism?

Well, as I mentioned, my friend uses texts to quantify stuff on the internet. From tweets to online news articles- it's all text. So if you look at things as multidimensional matrices, you can see how two words might be close in one matrix, yet far in another. A concrete example might be the words 'king' and 'princess', these words both denote royalty, so might often appear together in one dimension, but they denote different genders and ages, so they might be far if we had a gender or age demographic dimension to our matrix.

So guess what happens when you run 'White sounding names and 'African-American sounding names through statistical programming when they appear in matricixed texts you examine. Well James and Tyrone are both names, so you would expect some statistics, and places in matrices about them to be similar. But, it turns out, the distance on the matrix from a name like Tyrone to negative words like idiot is different from the distance between the name James and similar words. You can ask statistically about associations as well. And, unless you have been in a coma, you know Tyrone pops up more negative associations. The texts on the internet are, after all, written by people, in whom certain demographics are over-represented.

I used to explain that my parents named me Candace Makeda because in the past, having a first name like Makeda (means Queen of Sheba in an ancient Ethiopian language, more common among Africans and African-Americans) was a problem on documents like resume. Now I can explain to people that either it's a problem in the present, or the internet is a dumpster-fire of racism that in no way represents the thoughts of many people. The truth is probably somewhere in between.

I wonder if I can make a matrix with parameters that can help me quantify that... just kidding. The problem this kind of statistics is that at some point you go down a rabbit hole of endless models, none of which actually precisely capture reality. So I dedicate this post to George Box- whose words are as true today as ever. See picture...

Radiomic Texture Analysis

Data in the days of COVID

Free code notebooks and Youtube

So you want to code (computer program)

Using NLP to quantify racism

Comments