Web scraping, web harvesting or web data extraction are similar terms that refer to the act of collecting information from websites. Web scraping bots access the world wide web through the HTTP (Hypertext Transfer Protocol) directly or even through a web browser. While web scraping has been done manually in the past, today it generally refers to an automated process that uses a bot or a web crawler. You can call it a form of downloading specific data available openly in the web into personal/business storage locations. The data is stored for later use or immediate processing.
Web scraping generally consists of two parts- 1. Fetching the HTML page. 2. Extracting specific data points from the HTML page. Businesses now leverage web scraping as a major source of data. This data is then used in numerous processes, from customer acquisition to e-commerce pricing strategies.
How does it work?
- Requirements submission – To start the cycle, you would have to submit certain information to PromptCloud, such as the sites to be crawled, fields to be extracted and frequency of crawls. The third requirement is necessary since a lot of factors would change depending on whether you need the data to be updated every hour, or every month.
- Feasibility study – Their team would then analyze your requirements and run a feasibility check. This is important to assess the complexity of the process, man-hours necessary, technologies to be used and more. Not all websites allow web crawling and the team makes sure no rules or terms of service is violated by scraping a particular website. The adherence to the robots.txt file of the target site is the most important aspect in this regard.
- Payment – There are different pricing options that we discuss in detail below.
- Crawler setup – Basically, the feasibility study helps make a plan and this step helps set up the scraping infrastructure based on the data requirement. Once the crawlers are setup, web scraping can start.
- Data cleansing – According to PromptCloud, their USP lies in end-to-end web-scraping solutions, and that is the reason that they take data cleansing just as seriously as data scraping.
- Data delivery – Once all the steps are done, data is delivered to you by PromptCloud in the format that you agreed upon earlier so that you can use the data in a plug-and-play format with your existing system.
2. Crawlboard – Self service tool
Crawlboard is a requirement gathering tool which makes the web scraping workflow even simpler. Anyone can sign up, and once you do, all you need to fill up is your organization name, email, and country.
To specify our needs, we had to mention a group name for the scraping requirement. It could be anything such as e-commerce, stock-info, etc and we also had to specify in another text box, the fields that we would be needing. Next, in a numbered list, we had to provide the websites, that we needed to extract data from.
Once the requirement was set, only the frequency and format of the data delivery was left to fill. We could submit all this along with an additional info area.
Once this is done, PromptCloud’s team will be getting back to you and you can see the end result of their feasibility study on the information that you provided as well as the pricing. The look and feel were minimalistic but intuitive and we had never seen a semi-automated web scraping engine before and thought that it was quite cool, that the entire process could happen without the need for hundreds of phone calls or emails.
3. Data delivery methods
Promptcloud can deliver data to you in various ways. While the PromptCloud delivery API requires no extra cost, S3, Dropbox, Box, or FTP deliveries all cost an extra $30 per month with additional benefits. You can ask for the data in XML, JSON or CSV formats. Data can also be downloaded directly with a button click in CrawlBoard. However, automated data download via API is highly efficient when dealing with large-volume.
4. Pros and cons:
While everything does have pros and cons, we found more pros than cons in this fully-managed web scraping service.
- The huge amount of customization options available is one of the most important factors that caught my eye. Not only do you have full freedom to specify which data fields you want and which format you want your data in, but you can also change the requirements at a later point of time, depending on the changes in your own business workflow.
- The flexibility of crawler is crucial as well since it can be programmed to fetch and aggregate data from pages with complex workflows. For example, data fields constituting one record might have been scattered over several pages.
- One of the most obvious ones is how Crawlboard makes data acquisition very intuitive. Getting scraped data from the web has become as simple as buying goods on Amazon.
- The ability to scale up is another beauty of the system. You could very well start with scraping data from 5 target sites and end up with 25 at the end of the year as you find your business booming. This way, you will be able to scale up (or even down) as and when required.
You would actually be needing your own technical team to absorb the data (e.g., data download automation via API). Also, note that PromptCloud will only be providing you the data and how to consume it is all up to you.
5. Customer support
With their belief in a complete scraping solution, PromptCloud actually provides all their clients, dedicated customer support, so that whenever you have an issue with the data, or you need some changes, or there is a doubt, you can reach out to them directly and have the issues resolved.
Customer support is very important when you are availing a web scraping service — you always get quick help for troubleshooting and solving unforeseen issues.
PromptCloud has different pricing options for all. Among major plans are the $49 (Monthly crawl), $79 (Weekly Crawl) and $99 (Daily Crawl) per site plans. These plans have varying free records and data fields. Promptcloud’s web scraping service directed for enterprises that need very large amounts of data starts from $3999 per month. It is a customized solution with no limitations on free records, data fields, API access and data archival period. However, for this one, you will have to request a quote. There is a separate $99 one time setup fee that they waive off in case of annual maintenance plans. Separate charges apply for custom requests such as image downloads and more.
They also have something unique that other crawling service providers don’t offer — each client gets PromptPoints (their loyalty point) based on the number of sites, referrals, and long-term partnership. These points can be redeemed to get discounts on monthly invoices.
Many companies are struggling with data and it seems PromptCloud is trying its best to democratize access to web data for them. They are making it simpler for a business owner to explain web scraping needs and get the results back within a short span of time. For small and non-tech companies whose core technical expertize doesn’t allow team building for web scraping in terms of ROI, PromptCloud’s fully managed web scraping service can prove to be a boon.