Which support messages mention “Feature A”? Which keywords are found in our most effective email campaigns? To answer text-based questions, your gut reaction may be to pull out your Python notebook, but Snowflake (and other cloud vendors) actually have some powerful text processing functions that you can leverage, along with dbt, to do some cool text mining without ever leaving SQL.
Method:
-
Define keywords with a regex classifier for each
-
Detect each keyword in each sentence using pivot and an aggregation function
-
Unpivot it into a table with all keyword detections
As an example, here I collected text data from the subreddit, r/dataengineering, and used string matching techniques to identify tools mentioned in the posts. A github repo containing the complete model is here.
Example:
Import text and keyword data
Posts were extracted using the Pushshift Reddit API. The last 500 posts were collected and data from the columns id, author, url, title, selftext, and domain were saved as a .csv file.
Data engineering keywords were collected from the following github repos:
- Github - gunnarmorling/awesome-opensource-data-engineering
- Github - igorbarinov/awesome-data-engineering
- Github - meirwah/awesome-workflow-engines
Keywords were saved in a .csv file. To account for synonyms, nicknames and acronyms, keywords were provided in both their expected format, as synonyms, and as common typographical errors:
keyword | additional_regex | category |
---|---|---|
apache spark | spark | analytics |
trino | analytics | |
hue | hadoop user experience | hadoop user interface | business intelligence |
Format data
To obtain required regular expressions for matching, keyword_regex
. The keywords.csv file was passed to the model shown below:
with source as (
select * from {{ ref('keywords') }}
),
renamed as (
select
-- keys
keyword :: text as keyword,
-- details
additional_regex :: text as additional_regex,
category :: text as keyword_category,
case when additional_regex is not null
then keyword || '|' || additional_regex
else keyword
end as keyword_regex
from
source
)
select * from renamed
Reddit posts were split into sentences, then flattened to allow the keyword search to be carried out sentence by sentence. The SQL regexp
functions available match only the first instance of the text in a string so more matches can be extracted by searching a post sentence by sentence rather than in a complete text block.
with posts as (
select * from {{ ref("posts") }}
),
sentences as (
select
*,
replace(selftext,'"', '') as self_text,
split(regexp_replace(self_text, '[!.?)]+\\s', 'xxx'), 'xxx') as sentence_text_list
from posts
),
flatten as (
select
id as post_id,
author,
url,
title,
s.value :: text as sentence_text,
{{ dbt_utils.surrogate_key(['post_id', 'index', 'sentence_text']) }} as sentence_id
from sentences,
lateral flatten(input => sentence_text_list) as s
)
select * from flatten
Match keyword to text
To find keyword matches within the text, I used a regular expression of the form “dbt | data build tool | d b t”. This searches for exact matches to each of the strings in the list. Additional regular expression characters can be added to further tailor the search. For example, “\sorc\s” or “\borc\b” might be used where using just the string “orc” would result in matches to words such as “orchid” or “orchard”.
First, the query_results
parameter is set as a dictionary containing the keyword, keyword_regex, and keyword_category from the keyword base model. I then used a for loop to loop through each keyword in query_results
. For each keyword, the regex
parameter is set to the given keyword regex and the category
parameter is set to the given keyword category. The regex
parameter is combined with the regexp_substr() and regexp_instr() functions to return the substring that matches the regular expression and its position within the string respectively. Note only the first match found in the string will be returned. The category
parameter is passed directly to the output function and labeled category
. The output function object_construct_keep_null()
is used to return each group of results as a dictionary with null values included. Finally, to ensure the given column name is SQL compatible, I removed spaces or | characters from the loop variable k and cast it to lowercase.
{%- set query_results = dbt_utils.get_query_results_as_dict('select keyword, keyword_regex, keyword_category from ' ~ ref('base_keywords')) -%}
with base_posts as(
select * from {{ ref('base_post_sentences') }}
),
mentions as (
select
sentence_id,
post_id,
author,
url,
title,
sentence_text,
{% for k in query_results['KEYWORD'] | list %}
{%- set regex = query_results['KEYWORD_REGEX'][loop.index-1] %}
{%- set category = query_results['KEYWORD_CATEGORY'][loop.index-1] %}
object_construct_keep_null(
'substring', regexp_substr( sentence_text, '{{ regex }}', 1, 1, 'i'),
'category', '{{category}}',
'start_position', regexp_instr( sentence_text, '{{ regex }}', 1, 1, 0, 'i'),
'end_position', regexp_instr( sentence_text, '{{ regex }}', 1, 1, 1, 'i')
) as {{ k | replace(" ","_") | replace("|","_") | lower }} {{ "," if not loop.last}}
{%- endfor %}
from base_posts
)
select * from mentions
A similar result can be achieved by using the dbt_utils.get_column_values()
macro. However, for data sets where multiple column values are required (here I used keyword regex and keyword category), dbt_utils.get_query_results_as_dict()
is preferred. The model shown below works well for matching single keywords with no additional regular expression information:
{% set keywords = dbt_utils.get_column_values(table=ref('base_keywords'), column='keyword_regex') %}
with base_posts as (
select * from {{ ref("'base_post_sentences'") }}
),
mentions as (
select
sentence_id,
post_id,
author,
url,
title,
sentence_text,
{% for keyword in keywords %}
object_construct_keep_null(
'substring', regexp_substr( sentence_text, '{{keyword}}', 1, 1, 'i'),
'start_position', regexp_instr( sentence_text, '{{keyword}}', 1, 1, 0, 'i'),
'end_position', regexp_instr( sentence_text, '{{keyword}}', 1, 1, 1, 'i')
) as {{keyword | replace(" ","_") | lower }} {{ "," if not loop.last}}
{%- endfor %}
from base_posts
)
select * from mentions
Pivot match, keyword, start and end position
At this point, the data is structured as below:
sentence_id | post_id | … | sentence_text | apache_spark | … | dbt |
---|---|---|---|---|---|---|
f2d7a5b388d753ceee69707bde2c98be | pscnk0 | Main topics are: … and the Apache Spark… | {“category”: “analytics”, “end_position”: 213, “start_position”: 201, “substring”: “Apache Spark”} | … | {“category”: “data transformation”, “end_position”: 0, “start_position”: 0, “substring”: null} |
In order to extract useful information, such as a count of keywords or the proportion of posts containing keywords, I needed to flatten the data contained in the keyword columns. In this case, I do so using the dbt_utils.unpivot()
macro, which pivots the keyword values from columns to rows. This is done using the code:
with unpivoted as (
{{
dbt_utils.unpivot(
ref('stg_keyword_extraction'),
cast_to='variant',
field_name='keyword',
value_name='details',
exclude=[
'post_id',
'author',
'url',
'title',
'sentence_text'
],
remove = []
)
}}
)
select
{{ dbt_utils.surrogate_key(['post_id', 'sentence_text', 'keyword']) }}
as keyword_sentence_id,
post_id,
author,
url,
title,
keyword,
details:category :: text as keyword_category,
sentence_text,
details:substring :: text as match_substring,
details:start_position :: integer as start_position,
details:end_position :: integer as end_position
from unpivoted
where match_substring is not null
Analyze keyword data
The data in its final form is structured as follows:
keyword_sentence_id | post_id | … | keyword | keyword_category | sentence_text | match_substring | start_position | end_position |
---|---|---|---|---|---|---|---|---|
fbbff5eeecea63afab562c4411324b96 | pscnk0 | APACHE_SPARK | analytics | Main topics are: … and the Apache Spark… | Apache Spark | 201 | 213 |
It can now be used however you see fit. I started with visualizations showing the proportion of posts that contain data engineering tool keywords, the count of posts containing each keyword, and the proportion of posts containing keywords from each category. The match start and end position can be used to produce highlighted and labeled text outputs using tools such as displaCy or prodi.gy. Word Clouds like the one shown below can also provide a quick summary of the resulting keyword matches.
Final Thoughts:
With smaller datasets, dbt and regular expressions can be used directly to efficiently identify and locate keywords. I hope this model serves as food for thought and/or inspiration for your next text model built using dbt. Credit to data guru, stkbailey, whose insight provided the framework for this model.