FR-DE EU Electronic eInvoicing standard (FacturX, ZUGFeRD)
Reading invoice data from PDF is a $1B+ problem.
Currently, the most common solutions are:
- Paid specialized software like Mindee, where you upload the PDF and they return a JSON with data. (I do not recommend Mindee)
- OpenAI or similar tools that have OCR (optical character recognition) (Good solution)
- You can use Ruby gem like pdf-reader to extract text from the PDF, but you still need to turn it into an object with key-value pairs.
🇪🇺EU eInvoicing standard has a solution: Attach XML metadata to PDF. No need for AI, OCR and other invoice-parsing dependencies. Just upload PDF & read XML.
2025+ France & Germany will accept only eInvocice (facturX) for B2G transactions.
A facturX-compliant PDF will have an XML file named factur-x.xml
or xrechnung.xml
attached in it’s metadata.
This XML file will contain ALL the important data that is visible in the PDF structured as an object in XML format.
I hope other countries will catch up soon.
I hope invoicing software will catch up soon, to stay competitive.
Useful links:
- Read about 🇪🇺 EU eInvoicing
- 🇫🇷 La facturation électronique entre entreprises
- Download technical documentation
I build a small tool to convert eInvoices to JSON data:
Here’s how you can do it.
Extract XML from PDF with Ruby #
Install gem hexapdf to parse PDF metadata.
bundle add hexapdf
Create a job (in a Rails app), or a PORO.
rails g job PdfToXmlJob
Here’s a job that extracts embedded files from a PDF document.
Only files that are named factur-x.xml
or xrechnung.xml
are extracted.
# app/jobs/pdf_to_xml_job.rb
# file_path = 'db/fixtures/factur-x/BASIC/BASIC_Einfach.pdf'
# file_path = 'db/fixtures/factur-x/EXTENDED/EXTENDED_Fremdwaehrung.pdf'
# PdfToXmlJob.perform_now(file_path)
class PdfToXmlJob < ApplicationJob
queue_as :default
VALID_FILENAME = %w[factur-x.xml xrechnung.xml].freeze
def perform(file_path)
pdf = HexaPDF::Document.open(file_path)
catalog = pdf.catalog
if catalog.key?(:Names) && catalog[:Names].key?(:EmbeddedFiles)
embedded_files_tree = catalog[:Names][:EmbeddedFiles]
embedded_files = embedded_files_tree.value[:Names]
embedded_files.each_slice(2) do |name, ref|
file_spec = pdf.object(ref)
file_stream = file_spec[:EF][:F]
file_name = file_spec[:UF] ? file_spec[:UF].to_s : name
next unless VALID_FILENAME.include?(file_name)
new_file = File.basename(file_path).gsub!('.pdf', '.xml')
File.binwrite("db/fixtures/xml/#{new_file}", file_stream.stream)
puts "Extracted file: #{new_file}"
end
else
puts 'No embedded files found in the PDF.'
end
end
end
Write tests:
# test/jobs/pdf_to_xml_job_test.rb
require 'test_helper'
class PdfToXmlJobTest < ActiveJob::TestCase
# try importing all the files from subfolders within db/fixtures/factur-x
test 'BASIC_Einfach' do
file_path = 'db/fixtures/factur-x/BASIC/BASIC_Einfach.pdf'
PdfToXmlJob.perform_now(file_path)
assert File.exist?('db/fixtures/xml/BASIC_Einfach.xml')
File.delete('db/fixtures/xml/BASIC_Einfach.xml')
end
test 'BASIC_Rechnungskorrektur' do
file_path = 'db/fixtures/factur-x/BASIC/BASIC_Rechnungskorrektur.pdf'
PdfToXmlJob.perform_now(file_path)
assert File.exist?('db/fixtures/xml/BASIC_Rechnungskorrektur.xml')
File.delete('db/fixtures/xml/BASIC_Rechnungskorrektur.xml')
end
end
Parse the stored XML file with Ruby #
Now you can, for example, find the IBAN
or Account Name
:
file_path = 'db/fixtures/xml/EXTENDED_Fremdwaehrung_invalid.xml'
raw_xml = File.read(file_path)
parsed_xml = Nokogiri::XML(raw_xml)
iban = parsed_xml.xpath('//ram:IBANID').text
account_name = parsed_xml.xpath('//ram:AccountName').text
I personally prefer parsing data in JSON.
Install gem ‘nori’ to convert the XML to JSON
file_path = 'db/fixtures/xml/EXTENDED_Fremdwaehrung_invalid.xml'
raw_xml = File.read(file_path)
parser = Nori.new
hash = parser.parse(raw_xml)
parsed_hash = parser.parse(hash)
invoice_currency_code = hash.dig('rsm:CrossIndustryInvoice', 'rsm:SupplyChainTradeTransaction', 'ram:ApplicableHeaderTradeSettlement', 'ram:InvoiceCurrencyCode')
due_payable_amount = hash.dig('rsm:CrossIndustryInvoice', 'rsm:SupplyChainTradeTransaction', 'ram:ApplicableHeaderTradeSettlement', 'ram:SpecifiedTradeSettlementHeaderMonetarySummation', 'ram:DuePayableAmount')
invoice_currency_code = hash.dig('rsm:CrossIndustryInvoice', 'rsm:SupplyChainTradeTransaction', 'ram:ApplicableHeaderTradeSettlement', 'ram:InvoiceCurrencyCode')
invoice_date = DateTime.parse hash.dig('rsm:CrossIndustryInvoice', 'rsm:ExchangedDocument', 'ram:IssueDateTime', 'udt:DateTimeString')
invoice_number = hash.dig('rsm:CrossIndustryInvoice', 'rsm:ExchangedDocument', 'ram:ID')
Next steps in a Rails app #
- Upload & store PDFs in an
Invoice
model - Validate whether the PDF has a valid FacturX XML attached (& store the validity boolean)
- Store the extracted XML files
Did you like this article? Did it save you some time?