Build a simple scraper with Ruby

Cover image for Build a simple scraper with Ruby

Web scraping can often seem like a daunting task, especially when dealing with complex websites with many different elements. However, using Ruby and the Nokogiri gem, we can simplify this task and scrape websites more effectively.

In this post, I will illustrate how to scrape Wikipedia to fetch specific elements from a webpage. Please remember that web scraping should always be done responsibly, in compliance with the website’s terms of service.

Setting Up:

Firstly, you’ll need to install the nokogiri gem. You can do this with the following command: gem install nokogiri.

The Code:

require 'nokogiri'
require 'open-uri'

def scrape_wikipedia(url)
  document = Nokogiri::HTML(URI.open(url))

  page_title = document.css('h1').first.text
  puts "Page Title: #{page_title}"

  infobox_vevent = document.css('.infobox.vevent')
  infobox_title = infobox_vevent.css('.infobox-title.summary').text
  puts "Infobox Title: #{infobox_title}"

  tbody_tr_elements = infobox_vevent.css('tbody tr')
  third_tr_element = tbody_tr_elements[2] 
  fourth_tr_element = tbody_tr_elements[3]

  if third_tr_element && fourth_tr_element
    third_label = third_tr_element.css('.infobox-label').text
    third_data = third_tr_element.css('.infobox-data').text
    puts "#{third_label}: #{third_data}"

    fourth_label = fourth_tr_element.css('.infobox-label').text
    fourth_data = fourth_tr_element.css('.infobox-data').text
    puts "#{fourth_label}: #{fourth_data}"
  end
end

scrape_wikipedia('https://en.wikipedia.org/wiki/Ruby_(programming_language)')

Code Breakdown:

The script above uses Ruby’s open-uri module and nokogiri gem to scrape data from Wikipedia. The scrape_wikipedia function receives a URL and fetches the HTML content from that URL. Nokogiri then parses this HTML content into a format we can work with in Ruby.

We use CSS selectors to target specific elements on the page. document.css('h1').first.text fetches the text of the first h1 element, which is usually the page’s title.

The ‘infobox.vevent‘ and ‘infobox-title summary‘ classes are used to fetch the infobox on Wikipedia pages, which generally holds summary information about the page’s topic.

We also fetch specific rows within the infobox’s tbody (the 3rd and 4th row) and extract the labels and data.

Conclusion:

And there you have it! A simple way to extract specific information from a Wikipedia page using Ruby and Nokogiri.

Remember, while web scraping can be a powerful tool, it’s essential to use it responsibly to respect the website’s terms and resources.

Happy scraping!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>