This talk will provide a brief introduction to some of the core concepts of analyzing text using computational tools.
We will demonstrate how standard calculations can be scaled to work on very large data sets through simple parallelization strategies that are easy to deploy in an HPC environment using job arrays.
These ideas will be illustrated by a concrete example implemented in Python using the pandas, re, and nltk libraries. The example that we will tackle in this talk comes from social science research where multiple data sets refer to the same individuals and they need to be merged while accounting for deviations in how individuals are named or described.
In order to illustrate a typical solution, we will demonstrate 3 key steps:
- text parsing and cleaning with data frames and regular expressions
- a parallelization strategy using blocking keys
- approximate text matching, string similarity measures, and reduction to a well-defined machine learning problem.
This problem and solution process are representative of a very large class of data analysis problems that involve text comparison.
We will close by indicating some powerful extensions to the presented solution that can be used to apply this overall strategy to more complex problems of text analysis.