Tell me more ×
Answers OnStartups is a question and answer site for entrepreneurs looking to start or run a new business. It's 100% free, no registration required.

what would the process be for aggregating data from hundreds of websites into a master database? I would then do calculations on the master data base, then resell.

Thank you.

share|improve this question

3 Answers

  1. Get legal permission to use each web site's data
  2. Hire a team of web script writers to scrape the data from each web site
  3. Collect the data
  4. Verify data collection is correct (each time you scrape a site)
  5. Store in master data base
  6. Update scripts as needed for data collection
share|improve this answer

It depends on the data you are getting. If it is data that the websites you are targetting readily share then it will likely be accessible through some kind of data feed that they provide. If its not readily available you may need to 'scrape' it as Gary mentions.

Once you have the data, depending on how you got it, you may need to 'map' it into your database. Basically, if you scraped it you or whoever did the scraping may have been able to get each data set from each site in the same way and so it will already be arranged in a table in the same way. If you have got it from a feed, a .csv or .xml file for example you may need to write a script to 'map' each data column to fit the arrangement of your database.

Specific methods would depend on what type of data you're getting. What kind of database you want to use and what processes you would like to do with the data. In my experience I have used PHP to get xml, csv, txt feeds, and aggregate them into MySQL databases but have not done much 'scraping'.

share|improve this answer

Have a look at scraperwiki.com - It is the Mekanikal Turk for Scraping websites.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.