Data Science Room

Natural Language Processing

How computers and AI models understand, interpret, and generate human language

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. While developers communicate with computers through code, NLP allows us to interact with technology using human language, making technology more accessible and user-friendly.

NLP is an evolving field that sits at the intersection of computer science, artificial intelligence, and linguistics, enabling machines to interpret, generate, and respond to human language in a way that is both meaningful and useful.

How does NLP Work?

NLP combines computational linguistics (rule-based language modeling) with statistical learning, machine learning, and deep learning models. Together, these technologies enable computers to process human language as text or voice data and to decipher its full meaning, including the speaker or writer’s intent and sentiment.

NLP requires some understanding of the various components of linguistics. These are a few of those components:

  1. Text preprocessing: This step involves cleaning and preparing the text for analysis. It includes several tasks:
    • Tokenization is the process of breaking text into smaller units, or tokens. Tokens could be words or phrases.
    • Stop-word removal is a process of removing common words like "and," "the," and "is," which don’t add significance to text
    • Stemming involves reducing words to their empirical or root form, removing suffixes or prefixes. For example, stemming would convert "sleeping" to "sleep."
    • Lemmatization is similar to stemming as it also focuses on reducing words to their root or base form. However, lemmatization is more sophisticated, for example, by converting "better" to "good." Both lemmatization and stemming can be considered morphological analyses.
  2. Syntax and Parsing: Syntax examines the grammatical structure and arrangement of words and phrases to create correct sentences. Parsing analyzes sentence structure to represent the relationships between words and phrases. A parse tree can show the hierarchical structure of a sentence, identifying the nouns, verbs, and their relationship.

    In the sentence "The dog chased the ball," a parse tree would deconstruct the sentence into its grammatical components, composed of the noun phrase "the dog" and the verb phrase, "chased the ball." The parse tree would continue to break down these components so "The" and "the" are categorized as determiners, "ball" is another noun," and "chased" is a past-tense verb. Parsing helps NLP systems understand the structure of sentences, which is essential for tasks like translating languages or answering questions accurately.

  3. Semantic analysis: The focus of this analysis is on understanding the meaning of the text based on the words used. This involves tasks like Word Sense Disambiguation (determining which meaning of a word is used in a sentence based on its context) and Named Entity Recognition (identifying entities like names and locations).

    In the sentence, "George Washington went to a bank in Virginia to deposit money," Word Sense Disambiguation would identify a "bank" as a financial institution rather than the side of a river. Named Entity Recognition would identify "George Washington" as a person’s name and "Virginia" as a location.

  4. Pragmatics and Discourse: Pragmatics considers the context of the text. It involves understanding the intention behind the words, considering the context, implied meanings and conversational cues, and the nuances of human language. Discourse studies the relationship of sentences or segments to each other, and how they form coherent narratives or arguments.

    Consider a conversation where Person A says, "I’m freezing," and Person B says, "let me close the window." Pragmatics deals with understanding that "I’m freezing" isn’t simply a statement of fact that Person A is cold, but likely a polite request for some action to warm the room. Discourse identifies that this is a conversation, and Person B is providing a relevant and logical response to Person A. Discourse analysis ensures that the conversation remains coherent and connected, with each participant responding appropriately to the context.

    Pragmatics and Discourse allow NLP systems to go beyond literal meanings to understand implied requests and maintain a logical flow in conversations.

  5. Sentiment analysis: This evaluates the emotional tone of the text, effectively categorizing segments as positive, negative, or neutral. This may require disambiguating words with multiple meanings to recognize the sentiment and intent of the text. This is important in understanding the opinions or attitudes conveyed in a body of text.

Challenges in NLP

Of course, Natural Language Processing, despite its remarkable advancements, still faces notable challenges. These challenges may arise from the nature of the human language, which is inherently ambiguous, context-dependent, and culturally nuanced.

Ambiguity: Human language is often ambiguous. There is lexical ambiguity, where words have multiple meanings to be decoded based on context. Consider the word "bank" again, which may refer to a financial institution or the side of a river. Syntactic ambiguity refers to the potential interpretations of a sentence. For example, the sentence, "visiting relatives can be boring" could either mean the act of visiting relatives is a boring experience or it could mean that relatives who visit can be boring. Referential ambiguity is another category that simply refers to uncertainty of which entities pronouns like "he" "she" and "it" may be referring to within a text.

Cultural and linguistic diversity: Language is heavily influenced by culture and constantly evolving, which poses difficulties for NLP models to interpret cultural references or colloquial expressions with accuracy. Different languages and cultures have unique idiomatic expressions, such as "when pigs fly," that are difficult to translate. Regional dialects or slang expressions can also be difficult for NLP models, especially if training data is limited in representing these variations.

Data scarcity and imbalance: Some languages or dialects may lack sufficient data to train effective NLP models, which may create disparities in NLP capabilities across languages. NLP models may also struggle to perform well in specific domains, such as legal, medical, or scientific texts, due to the specialized vocabulary and structures in these fields.

Bias and fairness: NLP models can inadvertently learn biases present in training data. There are methods to mitigate innate biases, from collecting a diverse and representative set of training data to developing NLP algorithms. There are also bias detection methods, applied to detect biases that may be based on demographic factors like race, gender, age, or others. Data preprocessing may be one of the most important ways to train data to mitigate biases, which includes more technical methods like debiasing word embeddings (such as ensuring words like "doctor" and "nurse" are not unfairly gendered in embeddings and balancing class distributions, ensuring classes are evenly represented such that majority classes are not undersampled and minorities oversampled).

The Future of NLP

The future of NLP is very promising, with ongoing research focused on improving the robustness and generalizability of NLP models. The development of powerful language models like GPT-4, Gemini, and BERT have directly impacted the forward trajectory of NLP’s advancement. Even today, companies are investing in NLP to mine volumes of unstructured data to generate insights. Expert.ai’s 2023 Expert NL Survey of current NLP practitioners reported that 77% of organizations surveyed expect to spend more on NLP projects in the next 12-18 months, and 80% already have NLP models in production. In fact, Fortune Business Insights predicts that the NLP market will grow from $21 billion in 2021 to reach $127 billion in 2028. As NLP technology advances, it holds the potential to transform industries and improve the way we interact with technology.

Author

Keerti Hariharan

Keerti Hariharan joined Alkymi in early 2022. As one our of Product Managers, she uses her deep expertise in the Alkymi platform and investment workflows to create value for our customers.

Keerti Hariharan

Schedule a demo of Alkymi

Interested in learning how Alkymi can help you go from unstructured data to instantly actionable insights? Schedule a personalized product demo with our team today!