Parsing keywords from unstructured text data with dbt and regex

Which support messages mention “Feature A”? Which keywords are found in our most effective email campaigns? To answer text-based questions, your gut reaction may be to pull out your Python notebook, but Snowflake (and other cloud vendors) actually have some powerful text processing functions that you can leverage, along with dbt, to do some cool text mining without ever leaving SQL.

Method:

  1. Define keywords with a regex classifier for each

  2. Detect each keyword in each sentence using pivot and an aggregation function

  3. Unpivot it into a table with all keyword detections

As an example, here I collected text data from the subreddit, r/dataengineering, and used string matching techniques to identify tools mentioned in the posts. A github repo containing the complete model is here.

Example:

Import text and keyword data

Posts were extracted using the Pushshift Reddit API. The last 500 posts were collected and data from the columns id, author, url, title, selftext, and domain were saved as a .csv file.

Data engineering keywords were collected from the following github repos:

  • Github - gunnarmorling/awesome-opensource-data-engineering
  • Github - igorbarinov/awesome-data-engineering
  • Github - meirwah/awesome-workflow-engines

Keywords were saved in a .csv file. To account for synonyms, nicknames and acronyms, keywords were provided in both their expected format, as synonyms, and as common typographical errors:

keyword additional_regex category
apache spark spark analytics
trino analytics
hue hadoop user experience | hadoop user interface business intelligence

Format data

To obtain required regular expressions for matching, keyword_regex. The keywords.csv file was passed to the model shown below:

with source as (
  select * from {{ ref('keywords') }}
),

renamed as (
  select
    -- keys
    keyword :: text as keyword,
    -- details
    additional_regex :: text as additional_regex,
    category :: text as keyword_category,
    case when additional_regex is not null
      then keyword || '|' || additional_regex
      else keyword
    end as keyword_regex
  from
    source
)

select * from renamed

Reddit posts were split into sentences, then flattened to allow the keyword search to be carried out sentence by sentence. The SQL regexp functions available match only the first instance of the text in a string so more matches can be extracted by searching a post sentence by sentence rather than in a complete text block.

with posts as (
  select * from {{ ref("posts") }}
),

sentences as (
    select 
        *,
        replace(selftext,'"', '') as self_text,
        split(regexp_replace(self_text, '[!.?)]+\\s', 'xxx'), 'xxx') as sentence_text_list
    from posts
),

flatten as (
    select 
        id as post_id,
        author,
        url,
        title,
        s.value :: text as sentence_text,
        {{ dbt_utils.surrogate_key(['post_id', 'index', 'sentence_text']) }} as sentence_id
    from sentences,
        lateral flatten(input => sentence_text_list) as s
)

select * from flatten

Match keyword to text

To find keyword matches within the text, I used a regular expression of the form “dbt | data build tool | d b t”. This searches for exact matches to each of the strings in the list. Additional regular expression characters can be added to further tailor the search. For example, “\sorc\s” or “\borc\b” might be used where using just the string “orc” would result in matches to words such as “orchid” or “orchard”.

First, the query_results parameter is set as a dictionary containing the keyword, keyword_regex, and keyword_category from the keyword base model. I then used a for loop to loop through each keyword in query_results. For each keyword, the regex parameter is set to the given keyword regex and the category parameter is set to the given keyword category. The regex parameter is combined with the regexp_substr() and regexp_instr() functions to return the substring that matches the regular expression and its position within the string respectively. Note only the first match found in the string will be returned. The category parameter is passed directly to the output function and labeled category. The output function object_construct_keep_null() is used to return each group of results as a dictionary with null values included. Finally, to ensure the given column name is SQL compatible, I removed spaces or | characters from the loop variable k and cast it to lowercase.

{%- set query_results = dbt_utils.get_query_results_as_dict('select keyword, keyword_regex, keyword_category from ' ~ ref('base_keywords')) -%}

with base_posts as(
    select * from {{ ref('base_post_sentences') }}
),

mentions as (
    select
        sentence_id,
        post_id,
        author,
        url,
        title,
        sentence_text,

        {% for k in query_results['KEYWORD'] | list %}
        {%- set regex = query_results['KEYWORD_REGEX'][loop.index-1] %}
        {%- set category = query_results['KEYWORD_CATEGORY'][loop.index-1] %}
        object_construct_keep_null(
            'substring', regexp_substr( sentence_text, '{{ regex }}', 1, 1, 'i'),
            'category', '{{category}}',
            'start_position', regexp_instr( sentence_text, '{{ regex }}', 1, 1, 0, 'i'),
            'end_position', regexp_instr( sentence_text, '{{ regex }}', 1, 1, 1, 'i')
        ) as {{ k | replace(" ","_") | replace("|","_") | lower }} {{ "," if not loop.last}}
       {%- endfor %}

    from base_posts
)

select * from mentions

A similar result can be achieved by using the dbt_utils.get_column_values() macro. However, for data sets where multiple column values are required (here I used keyword regex and keyword category), dbt_utils.get_query_results_as_dict() is preferred. The model shown below works well for matching single keywords with no additional regular expression information:

{% set keywords = dbt_utils.get_column_values(table=ref('base_keywords'), column='keyword_regex') %}

with base_posts as (
  select * from {{ ref("'base_post_sentences'") }}
),

mentions as (
    select
        sentence_id,
        post_id,
        author,
        url,
        title,
        sentence_text,

        {% for keyword in keywords %}
        object_construct_keep_null(
            'substring', regexp_substr( sentence_text, '{{keyword}}', 1, 1, 'i'),
            'start_position', regexp_instr( sentence_text, '{{keyword}}', 1, 1, 0, 'i'),
            'end_position', regexp_instr( sentence_text, '{{keyword}}', 1, 1, 1, 'i')
        ) as {{keyword | replace(" ","_") | lower }} {{ "," if not loop.last}}
       {%- endfor %}

    from base_posts
)

select * from mentions

Pivot match, keyword, start and end position

At this point, the data is structured as below:

sentence_id post_id sentence_text apache_spark dbt
f2d7a5b388d753ceee69707bde2c98be pscnk0 Main topics are: … and the Apache Spark… {“category”: “analytics”, “end_position”: 213, “start_position”: 201, “substring”: “Apache Spark”} {“category”: “data transformation”, “end_position”: 0, “start_position”: 0, “substring”: null}

In order to extract useful information, such as a count of keywords or the proportion of posts containing keywords, I needed to flatten the data contained in the keyword columns. In this case, I do so using the dbt_utils.unpivot() macro, which pivots the keyword values from columns to rows. This is done using the code:

with unpivoted as (
  {{ 
    dbt_utils.unpivot(
      ref('stg_keyword_extraction'),
      cast_to='variant',
      field_name='keyword',
      value_name='details',
      exclude=[
        'post_id',
        'author',
        'url',
        'title',
        'sentence_text'
      ],
      remove = []
    ) 
  }}
)
select

  {{ dbt_utils.surrogate_key(['post_id', 'sentence_text', 'keyword']) }}
    as keyword_sentence_id,
  post_id,
  author,
  url,
  title,
  keyword,
  details:category :: text as keyword_category,
  sentence_text,
  details:substring :: text as match_substring,
  details:start_position :: integer as start_position,
  details:end_position  :: integer as end_position

from unpivoted
where match_substring is not null

Analyze keyword data

The data in its final form is structured as follows:

keyword_sentence_id post_id keyword keyword_category sentence_text match_substring start_position end_position
fbbff5eeecea63afab562c4411324b96 pscnk0 APACHE_SPARK analytics Main topics are: … and the Apache Spark… Apache Spark 201 213

It can now be used however you see fit. I started with visualizations showing the proportion of posts that contain data engineering tool keywords, the count of posts containing each keyword, and the proportion of posts containing keywords from each category. The match start and end position can be used to produce highlighted and labeled text outputs using tools such as displaCy or prodi.gy. Word Clouds like the one shown below can also provide a quick summary of the resulting keyword matches.

Final Thoughts:

With smaller datasets, dbt and regular expressions can be used directly to efficiently identify and locate keywords. I hope this model serves as food for thought and/or inspiration for your next text model built using dbt. Credit to data guru, stkbailey, whose insight provided the framework for this model.