Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
URL to PDF Microservice (github.com/alvarcarto)
319 points by kimmobru on Oct 5, 2017 | hide | past | favorite | 92 comments


Unfortunately, the "sensible defaults" don't seem to check input URLs correctly and allow file:// URLs.

Just try ?url=file:///etc/passwd on the demo instance.

That seems to be a quite common issue with services like this built on generic libraries.


Oh wow, that's a bit embarrassing :) The urls are now restricted to http and https only. Thanks for noticing!


I am not at a computer noe, so I can’t test it, but do you take redirects into account? I hope you are not just whitelisting the initial URL, but also any URL’s it redirects to. If you don’t already, you should probably just disable redirects in whatever library you use.


I gave this a thought for a moment. Since we're using a real browser, there are huge amount of different ways to get the browser display a file:// link. Redirect is one, window.location.href is another, etc. The service shouldn't be run publicly in the internet for real use cases. If you do, the server should be designed in a way that it's not dangerous if the web server user gets read access to file system. I added a warning about this in the top of the README.


You are using Chrome headless therefore you can use group policy to add "file://" to the URL blacklist; see http://www.chromium.org/administrators/policy-list-3#URLBlac...


By default: if Headless Chrome hits a redirect to a file:// it returns: net::ERR_UNSAFE_REDIRECT

window.location.href = 'file:///' will return console error: "Not allowed to load local resource"


This is ripe for an easter egg if you request a file:// URL PDFs support compression, right? I wonder how hardened they are to zip bombs...


Consider adding FTP.

It might be better to blacklist file:// rather than trying to have a comprehensive whitelist.


Not sure if this is sound advice. Blacklisting is a cat and mouse game, especially for security. The risk of a missing entry on a blacklist is worse than on a whitelist.


I've moved all my pdf to client side. Less security and processing on the server.

http://pdfmake.org/#/gettingstarted


Interesting. There's jsPDF as well which is similar: https://github.com/MrRio/jsPDF


Does anyone have advice on rendering wide tables (with lots of text) with this library? We have a constant cat ant mouse game with this library.


Then you can't send pdfs in emails or pre-cache pdfs...


On second thought, if you're ok with linking to PDFs instead of attaching them in emails then you can pre-cache all content needed to generate PDFs. Besides email attachments, I can't think of a use-case for server side generating PDFs.


Is there a browser support matrix somewhere? This looks real promising but we need to support IE9. :(


Also consider non-Western languages (CJK, Russian, Arabic). My team got bit by this when we discovered we couldn't render emoji's client-side.


That didn't work for me. Was it patched already?

    {"status":400,"statusText":"Bad Request","errors":[{"field":["url"],"location":"query","messages":["\"url\" must be a valid uri with a scheme matching the http|https pattern"],"types":["string.uriCustomScheme"]}]}


FWIW there's a lot of things Chrome won't do properly out of the box (fonts, emojis'... more). Most projects like this will work for _most_ of the web, but there's a lot of nuance in getting it working across the board. This is something I've been working on for quite some time: https://browserless.io


I was playing around with Puppeteer the other day and was wondering if it was possible to render a web page to a single page PDF (a page with fixed width and variable height). Basically like creating screenshot without losing the text information. This would solve a lot problems such as sticky elements hiding text like in this example [1].

[1]: https://url-to-pdf-api.herokuapp.com/api/render?url=https://...


That's a good idea! You can achieve this by adding e.g. &pdf.width=1000px&pdf.height=10000px parameters.

Sometimes you can get rid of the sticky headers with &emulateScreenMedia=false parameter if the page has well implemented @media print rules in CSS. We decided to use page.emulateMedia('screen') with Puppeteer to make PDFs look more like the actual web page by default.

Pages which use lazy loading for images may look incorrect when rendered. &scrollPage=true parameter may help with this. It scrolls the page to the bottom before rendering the PDF.

Using these options make the PDF better: https://url-to-pdf-api.herokuapp.com/api/render?url=https://...


I bet that breaks on infinite scroll pages ?


Why use headless chrome instead of the PDF lib used by chrome directly? https://pdfium.googlesource.com/pdfium/


For starter, documentation: you can't even understand what pdfium IS from that page. After some search I see that it can do rasterization, from pdf to e.g. png, but I couldn't find any mention of it being capable to generate the pdf from an url. Can it?


PDFium is a PDF renderer, i.e. it takes a PDF and turns it into an image. That's not what this does.


Seems to be useful, as https://github.com/arachnys/athenapdf done pretty similar thing.


I wonder how scalable the service is...

At least with PhantomJS I felt like my system would begin to lockup if there were too many instances rendering at the same time (and it didn't appear to be an issue of too little memory).

Nonetheless, this looks promising.


Scaling is a challenge definitely. Rendering image-heavy sites requires quite a lot of RAM. But as others have stated, the good news are you can quite easily scale this horizontally by adding more servers. There's no shared state between the server instances behind a load balancer.

There's also room for improvement in how efficiently a single server instance can render PDFs. The API doesn't yet support resource pooling, this would make reusing the same Chrome process (with e.g. 4 tabs) possible. The implementation requires careful consideration since in that model it's possible to accidentally share content from previous requests to the new requesters.


> Rendering image-heavy sites requires quite a lot of RAM.

Is that the limiting factor? How many would you optimally do in parallel if RAM wasn't an issue?


You need to fire a new chrome headless process every time you create a PDF, its not scalable but it works.

I wonder if this part of chrome could be easily extracted as a C++ library.


You can actually reuse a running instance and create a new context with this: https://chromedevtools.github.io/devtools-protocol/tot/Targe.... Issue is that most libraries don't have an API for this (not sure why), and that long-running Chrome instances can get to a quirky state + other issues. Certain parameters require you start Chrome with the right flags, so reusing a running Chrome process doesn't always work.


I suspect that "extracting" this part of chrome into a library would result in the same thing: a library that starts up the full chrome environment and prints a page to PDF.

The PDF-ization isn't the part that's hard to extract (there are libraries to create PDFs from scratch already available, and they're small/intelligible). Rather, it's the rendering of a webpage for display that's the hard part, and what most of the code in any web browser is concerned with. Whether that display is a monitor or a PDF doesn't change much.


I've looked and it's not easy. There are different ways you can wrap the internals of chrome but really, it's hard to (and officially not recommended, from memory) pull out just a subset to work with.


It's using headless chrome, and there is a serverless version of chrome headless (chromeless), so it should be pretty scalable.


A chrome is a chrome... headless chrome uses almost the same resources as a desktop chrome (same engine and all that)


Yes, but that doesn't matter in regards to the question "Is it scalable?" which OP asked.


But its no scalable, making it serverless just makes scalable at a huge cost.


You are contradicting yourself.

It is scalable. It scales linearly (and for practical purposes indefinitely) with the amount of money you spend on AWS Lambda. It might not have a nice constant factor, but it is scalable.


This is just pedantic. Anything can be scalable using that definition. Heck, I could hire a 3rd world worker to manually draw the PDF's, scan them to PDF using a scanner, and put them on a server and it would "scale linearly with the amount of money I spend" on labor.


> Anything can be scalable using that definition.

No, some things can't be, like badly architected monoliths, or databases. Especially with databases it's not a given, which is why for quite some time in the last years things like the "MongoDB is web scale" blew up and people started mindlessly asking "is it scalable" (which as you figured out, means very little for a lot of systems). I'm also pretty sure that your example scales worse than linearly, since you have to introduce multiple levels of management at some point.

"scalable" = "can it be scaled", which is not a given for every system


Redrawing the PDF's and scanning them would likely not scale linearly with the amount of money you spend on labor. Labor tends to have diminishing marginal returns. The cost of hiring two workers is more than double that of hiring one worker because there is additional complexity in coordinating the workers.

Also, for the record, this comment I'm making right now is just pedantic.


As long as the workers do not need to coordinate, it should scale.

Round-robin those folks.


If there is only one worker the requests can flow directly to that one worker. If there are two workers, something would need to sit in the middle and determine how to distribute the requests. That extra hop is where the extra complexity comes from.


Kanban


Web rendering is hard. You'll probably have to scale horizontally at some point, and balance load between servers. But a microservice architecture makes this not terrible.


Cool. Nice to learn of another option. What would be the advantages of using this over wkhtmltopdf (https://wkhtmltopdf.org) ?


Thanks for the comment! I haven't personally used wkhtmltopdf much, but I like having Chrome as the rendering engine. In theory at least, debugging the PDFs can be done with desktop Chrome's print preview. I don't know about wkhtmltopdf, but url-to-pdf-api supports dynamic single-page apps, which can be beneficial depending on the use case.

Headless Chrome is quite new so it still has some bugs, but I have a hunch that it will in the end have most reliable and expected render results.


I am testing pupeter/headless chrome, pdf export stil has some bugs, I hit an issue with page sizing and I am stil using wkhtmltopdf for now, the bug is present in main Chrome save as PDF. Chrome has better font rendering and it knows to embed the web fonts into the pdf where in wkhtmltopdf I have to do it manually.

Link to size bug: https://github.com/GoogleChrome/puppeteer/issues/666


wkhtmltopdf is based on QTWebKit, which is truly a pain in the ass to work with. Way more bugs and much less support for modern CSS than Chromium.


Oddly enough I just implemented my own one of these in 34 sloc using flask and weasyprint. I chose to only have it accept html in a post rather than a url so that it could render non-publically accessible urls. You can also pass it a base_url (which it passes on to weasyprint) for resolving relative urls for static assets in the html, which are usually publicly accessible. Runs on heroku for simplicity.


I am using PhantomJS for a similar project running on AWS Lambda - running into all sorts of rendering bugs / crashes. Wanted to make the switch to puppeteer, but as of now it requires a higher version of NodeJS than what Lambda supports. Was in the process of looking at Docker containers for my service, anyone have any thoughts on Heroku vs Docker?


Hey, I'd love to talk to you more about your issues with getting over to puppeteer. I'm working on a product that separates application infrastructure from Chrome as it's nightmare to try and scale with.

I've written somewhat extensively about the deployment approaches here: https://hackernoon.com/more-than-you-want-to-know-about-head...


That was a great read!


Hey, thanks for the kind words, really appreciate hearing that!



Oh hey! I created something similar with Flask and Docker. Except you POST the html content and receive a PDF document back. It uses wkhtmltopdf, so it's pretty fast. https://github.com/halfnibble/pdf_service


Wkhtmltopdf is a dead-end, but very useful and relatively ligthweight if it still works for your use case.

Thanks for packaging this up!


Took a quick skim of the README. I have a general question for these web-to-PDF services. Is the priority to honor the page styles as set forth in a print.css-type file? Or is it to be as close as a screenshot as possible of a webpage, which is what I think the majority of laypeople would expect.

One of the small things I've recently let myself be bothered by is how divergent HTML/browser snapshots from web.archive.org, archive.is, and Google cache can be, even for relatively simple pages. I've already given up on trying to make (not even sure if it's a good idea) HTML look nice as PDFs.


This service seems to be targeted at producing controlled artefacts from your application (e.g. invoices) where you know what CSS is in use. If you wanted to capture the original design of the page you can use headless Chrome to capture screenshots automatically (https://medium.com/@dschnr/using-headless-chrome-as-an-autom...). Perhaps headless Chrome can also save the HTML plus assets, or some other archive format that would be interactive.

Edit: this seems to be a take on saving the HTML plus assets - https://github.com/pirate/bookmark-archiver.


I think phantomjs is another option. I'm personally looking for a way to screenshot charts produced on a site of mine so I can create thumbnails for a quick preview. I think phantomjs is what I will try first.

Edit: ahh someone else mentioned it too.


Nice find on the bookmark-archiver tool; thanks for sharing!


I have been aiming for default settings which would render the site as you see it (screenshot style), but so that you can switch the settings to honor @media print CSS rules.

One of the biggest values I think this yet-another-PDF-service has is that if you open the print preview on a desktop Chrome, it should be really close to what the API renders. Should make debugging a bit easier.

The main use case is to render content generated by yourself, e.g. receipts and invoices, but I don't see a reason why it couldn't be used for rendering news or blog articles.


If PDF is not really important for you and HTML is fine, take a look at SingleFile [1]. This is a Chrome extension I wrote some years ago to save a page and all its resources in a HTML file.

[1] https://chrome.google.com/webstore/detail/singlefile/mpiodij...


Partially off topic: does anybody know of a hosted solution that turns a pdf into an html page, and hosts the output html (optionally, also hosts the pdf, with a downloadable link).


https://mozilla.github.io/pdf.js/ would probably be what you would need to build your own. I think you can also use just Google drive to make something like this.

Not sure of more straight forward hosting options


Maybe this is too obvious (I assume you've already rejected it as an option), but depending on how desperate/minimum-viable-product you are: Google Drive, Dropbox, etc. offer PDF preview functionality.


You may want to take a look at pdf.js


OT (maybe) Question: I've gotten more and more annoyed over the years as links are deleted/decay etc. Even more so recently. Is there a plug-in that makes a 'personal archive' or potentially on-sends the page to the Archive. It would be useful to be able to search/go back in my time-line to webpages even if they were just static pdf's.


Wallabag: a self-hostable application for saving web pages | https://news.ycombinator.com/item?id=14686882 (Jul 2017)

Show HN: Kozmos – A Personal Library | https://news.ycombinator.com/item?id=14980075 (Aug 2017)

specifically: https://addons.mozilla.org/en-US/firefox/addon/scrapbook-x/ and https://chrome.google.com/webstore/detail/worldbrain-the-res...


Linked elswhere in this discussion:

https://github.com/pirate/bookmark-archiver python script

https://chrome.google.com/webstore/detail/singlefile/mpiodij... save as a single html file

> uses "data URI" scheme to embed image and frame contents into the page : the resulting format is not MHT/MHTML


thx for the suggestions - not quite what i was looking for. it's not so much bookmarking for later reading as a background process that dumps an archive file onto my hard drive. i'm especially not interested in someone else's side project since the risk is that they'll fold in x years. my usr scenario is being able to re-read a web page i know i saw even if i forgot to bookmark it ie, just about anything to do with climate change on a US govt site. a kind of distributed collection of 'truthy-ish' references (a large number of people have the same hash of the original page on their personal archive -> truthy-ish).




I've been working on a WYSIWYG PDF creator and API for a few months, mainly to solve a need at my day job: https://fetchpdf.com

PDF generation is 'fun', and I've tried several of the different options out there for HTML to PDF generation, but settled on wkhtmltopdf.


Ooooooh, there have been a couple projects where I wished I had something like this. Definitely bookmarking this.


If you're interested in archiving/PDF'ing a large list of links (e.g. your browser bookmarks or Pocket list), check out https://github.com/pirate/bookmark-archiver


I would like to adopt something like this, but there are some pretty normal table functions in our current print solution that I don't know how to support in HTML.

E.g. on a multi-page invoice, show a sub-total row at the bottom of each page. Does anyone know how to create this kind of function?


IIRC you can solve that using CSS for print media.

The browser will display a different CSS to the printer, so to speak.


Yes, and instead of px or em, you can use mm for css units (positioning, size, etc.).


Use a <tfoot> element (unless you're actually trying to show a sub-total of just the page you're looking at).


Yeah literally need to add an invoice subtotal for the items on that page


I'd be interested in this as well. The industry standard, PrinceXML, has a huge licensing fee (currently $3800 per server), but it's also rock solid, handles modern CSS, and has hooks for adding things like headers and footers for pages and tables.


Can you get the PDF page to adjust its size based on the HTML page to be rendered?



It looks like chrome's javascript interface exposes options that the command line doesn't. Or else I'm overlooking something, because I couldn't find a way to to hide the header and footer (which shows the date, title, url, and page number) using the command line. But this project does hide the header and footer.

I can't use an externally hosted service like this because some of my URLs are non-public. So when the user requests a PDF, I render the HTML to a temp file on the server, invoke chrome via command line, and serve up the converted PDF.


Just in case you missed it, you can clone the git repo for this project and use it inside your network. I'm playing with it now that way.

Right now, I'm using wkhtmltopdf which supports custom headers and footers and I'm mulling over how to do the same with this solution.


@kimmobru, would you happen to know how this would handle printing a multipage table to PDF? Specifically, I'm hoping for repeating headers per page, and not having any problems with rows printing half on one page, and half on the next page. I would love to replace a paid solution we use which handles this use cases with something based on headless chrome.


We use urlbox.io and are really happy with it. Don't recommend building this yourself.



with no privacy policy link or info on how long the documents are stored on the server.

Stuff like receipts, etc to be converted to PDF will contain customers' information - not to be put on a site with no info on how it's used.


That's a good reminder, thanks. I'm working on a similar service, and need to be very clear about our data retention policy.


Doesn't really work


This is Stallman's wet dream.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: