Data scraping with web services and R
On Saturday the seminar of the Research Group "Machine Learning and Social Computing" was held. It included a workshop on data scraping and further discussion of its research application.
Participants of our Research Group Anastasia Kuznetsova and Viktor Karepin told how to scrape data from the Internet. Anastasiya talked about a special extension for Google Chrome - Data Miner - a flexible application with a user-friendly design and the ability to specify their own settings and scripts for download. Data Miner is a user-friendly service because you mostly do not have to write your own code for web scraping. Also, the data is immediately collected in csv and xlsx formats. At the same time, it has great opportunities to create public scripts ("recipes"). However, if the site has a bad markup, it may be difficult to download the required data.
The solution to this problem is scraping with R. Viktor described the application of the rvest package for downloading data using html-page tags. A more complex part of the workshop consisted the application of the RSelenium package. It is a tool for automating the actions of a web browser and it allows you to extract data from a web page.
The seminar was attended by Anna Shirokanova, senior research fellow at the Laboratory for Comparative Social Research:
It is only rarely that I happen to visit the workshops of the Research Group on Machine Learning, but they always make a wonderful experience. The workshop on topic modeling last year grew into a scientific collaboration, conference presentations, and a publication -- and today was no less interesting! What I liked most was the engaging, excited, and yet clear way both speakers presented their topics -- which is hugely contagious. Besides, it makes real peer-to-peer learning. I also enjoyed how smoothly the guys switched to the examples of how they applied those methods in their own scientific projects on the hotel information about hotels in St. Petersburg or on mapping the tags on the 'museum night.' Personally, I was also interested in comparing the work of RSelenium with what we learned at a Python elective. I think that data culture, which everyone is talking about, is cultivated at seminars like this one.
In the end, the seminar participants discussed the possibilities of applying this method of data search. Participants of the Research Group talked about their experience in data search. In particular, the TripAdvisor case was discussed in detail as its data was used to analyze reviews on hotels in St. Petersburg and create geo-ratings.
Anna Shirokanova
Associate Professor