For a side project, I turned to Wikipedia tables as a data source. Despite their inconsistencies, they proved quite useful. I explored three methods for extracting this data:
- Google Sheets: Easily scrape tables using the =importHTML function.
- Pandas and Python: Use pd.read_html to load tables into dataframes.
- Beautiful Soup and Python: Handle more complex scraping, such as extracting data from both tables and their preceding headings.
These methods simplify data extraction, though some cleanup is needed due to inconsistencies in the tables. Overall, leveraging Wikipedia as a free and accessible resource made data collection surprisingly easy. With a little effort to clean and organize the data, it's possible to gain valuable insights for any project.
Podden och tillhörande omslagsbild på den här sidan tillhör
HackerNoon. Innehållet i podden är skapat av HackerNoon och inte av,
eller tillsammans med, Poddtoppen.