April 30, 2023, 10:14 p.m. | Sagnik Bandyopadhyay

DEV Community dev.to




Problem statement





Context



  • Lets assume that we have data pipeline(s) dumping messages into Google BigQuery tables (lets call them raw tables).

  • There maybe duplicate messages being stored in the raw table due to reasons like:


    • Duplicate messages sent from source

    • Message inserted multiple times due to network issues and retries between the data pipeline and big query (although this can be addressed to some extent by using unique request ids while loading the data into BQ)



  • BigQuery doesn't have unique …

big big query bigquery call context data data pipeline deduplication duplicate google googlecloud messages multiple network pipeline query raw sql strategies table tables

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analytics & Insight Specialist, Customer Success

@ Fortinet | Ottawa, ON, Canada

Account Director, ChatGPT Enterprise - Majors

@ OpenAI | Remote - Paris