Inverted Index for full-text searches or common words detection

 Sometimes there are properties in the document with unstructured text, like newspaper articles, blog posts, or book abstracts. The inverted index is easy to build and is similar to the data structures search engines use. 

Such document structures can help in various complex search patterns, like common word detection, full-text searches, or document similarity searches, using humming distance or l2distance algorithms. Inverted indexes are useful when the number of keywords is not too large and when the existing data is either totally immutable or rarely changed, but frequently searched.

Usually, the documents are "parents," and the words inside the document are "children." To build an inverted index, we invert this relation to make the words "parents" and documents "children":
  • Take all or a subset of keywords from the document and pair it with the document ID
    • DocId1: keyword1
    • DocId1: keyword2
    • DocId1: keyword3
    • DocId2: keyword
    • DocId2: keyword1
  • Revert the order by taking all unique keywords and making a list of documents where those keywords appear.


Read more on the Inverted index and other data modeling structures in my blog here.

Yours,
Maria

Comments

Popular posts from this blog

Unlocking Microsoft Fabric: A Simple Guide when you only have a personal account.

ETL to ELT journey: Break free your Transformations and discover Happiness and Data Zen

The backbone your data pipelines have been waiting for.