Aug. 11, 2022, 3:36 a.m. | /u/Particular-Turn35

Natural Language Processing www.reddit.com

I'm using [BERT pretrained](https://huggingface.co/tftransformers/bert-base-cased) for the 1st time & found smtg weird here. The original word 'demonstrators' has **split into 3 tokens that have different meanings**.

original = "Thousands of demonstrators"
tokenized= ["Thousands", "of", "demons", "##tra", "##tors"]

1. Will this affect the model performance?
2. What's the function of '##' here?

bert good languagetechnology

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Staff Software Engineer, Generative AI, Google Cloud AI

@ Google | Mountain View, CA, USA; Sunnyvale, CA, USA

Expert Data Sciences

@ Gainwell Technologies | Any city, CO, US, 99999