I need a small web crawler app built. It needs to run in either C or Java on a linux aws cloud server. It needs to pull data from a mysql database. the table will have 5 fields.
id, int domain, varchar(255) page, varchar(255) date_crawled, datetime(4) page_content, ntext status_code,int The app needs to read from the table, get a record, fetch content from the domain + page fields, set a specific http_user_agent, and store the data in a database in the page_content field. The app needs to be multi-threaded and be able to process multiple pages at the same time. I need to be able to set the number of concurrent threads/pages/downloads so that I can upgrade or downgrade the cloud server dependign on available resources. I need to be able to set the default timeout limit for a single http request.
After a success/failure of an http request, the app needs to update the status_code, page_content, & date_crawled. If you need to add another field to the table to handle processing a record, we can do that as well. The app needs to remove all non-printable characters from the source code before it saves it in the database. And lastly, the app needs to output its progress to a log file that I can monitor with tail -f logname.log if this project works out - i will have alot more in the future.