Introduction to Nokogiri. Extract core website data
Recently I was thinking of building a directory website that would list other websites/apps under different categories.
I would need to display some core data for each website, like:
title
description
- logo (
favicon
) - screenshot/
opengraph image
Gem Nokogiri lets us easily parse HTML. It is one of the foundational Ruby gems, that many other gems rely on.
Here’s how we can parse any URL and extract code data with Nokogiri:
require "nokogiri"
require "open-uri"
require "uri"
class UrlCrawlerJob < ApplicationJob
queue_as :default
def perform(url)
html = URI.open(url)
# Parse the HTML with Nokogiri
doc = Nokogiri::HTML(html)
# Extract the page title
page_title = doc.at("title")&.text
# Extract the favicon URL
favicon = doc.at('link[rel="icon"]')
favicon_url = favicon ? URI.join(url, favicon["href"]).to_s : nil
# Extract the meta description
meta_description = doc.at('meta[name="description"]')&.[]("content")
# Extract the OpenGraph image
og_image = doc.at('meta[property="og:image"]')&.[]("content")
og_image_url = og_image ? URI.join(url, og_image).to_s : nil
{
page_title:,
favicon_url:,
meta_description:,
og_image_url:
}
end
end
Store, Retrieve data #
Let’s assume you want to have a model that has a url and stores the extracted data.
I prefer to store this kind of data in json
instead of creating new attributes each time.
rails g model website url payload:json
Parse an url and store the extracted data:
url = "http://superails.com/"
website = Website.create(url:)
payload = UrlCrawlerJob.perform_now(website.url)
website.update(payload:)
Now you can get values with digging the json/hash:
website.payload["page_title"]
=> "SupeRails"
You can make it easier with OpenStruct
:
# app/models/website.rb
require "ostruct"
def payload_data
OpenStruct.new payload
end
# payload_data.favicon
# website.payload_data.page_title
# search anywhere in the payload:
# Listing.ransack(payload_cont: "superails").result
ransacker :payload do
Arel.sql("payload")
end
That’s it!
Did you like this article? Did it save you some time?