Web: https://www.reddit.com/r/datascience/comments/xil2c2/i_need_help_creating_a_dataset_how_do_you/

Sept. 19, 2022, 6:50 p.m. | /u/asking_for_a_friend0

Data Science reddit.com

I have taken a task to

1. archive an entire website (in webarchive format)
2. scrape all its text (article wise with timestamp and tags)
3. prepare a dataset and release it

I found about archiving, it sounds simple, way better than saving .html files.

I can somehow clear single article, extract text and metadata.

My question is: How do I discover all the links? Like all the damn links on that website. All articles that were ever published, still …

datascience dataset

Research Scientists

@ ODU Research Foundation | Norfolk, Virginia

Embedded Systems Engineer (Robotics)

@ Neo Cybernetica | Bedford, New Hampshire

2023 Luis J. Alvarez and Admiral Grace M. Hopper Postdoc Fellowship in Computing Sciences

@ Lawrence Berkeley National Lab | San Francisco, CA

Senior Manager Data Scientist

@ NAV | Remote, US

Senior AI Research Scientist

@ Earth Species Project | Remote anywhere

Research Fellow- Center for Security and Emerging Technology (Multiple Opportunities)

@ University of California Davis | Washington, DC

Staff Fellow - Data Scientist

@ U.S. FDA/Center for Devices and Radiological Health | Silver Spring, Maryland

Staff Fellow - Senior Data Engineer

@ U.S. FDA/Center for Devices and Radiological Health | Silver Spring, Maryland

Tech Business Data Analyst

@ Fivesky | Alpharetta, GA

Senior Applied Scientist

@ Amazon.com | London, England, GBR

AI Researcher (Junior/Mid-level)

@ Charles River Analytics Inc. | Cambridge, MA

Data Engineer - Machine Learning & AI

@ Calabrio | Minneapolis, Minnesota, United States