I am not at a computer noe, so I can’t test it, but do you take redirects into account? I hope you are not just whitelisting the initial URL, but also any URL’s it redirects to. If you don’t already, you should probably just disable redirects in whatever library you use.
I gave this a thought for a moment. Since we're using a real browser, there are huge amount of different ways to get the browser display a file:// link. Redirect is one, window.location.href is another, etc. The service shouldn't be run publicly in the internet for real use cases. If you do, the server should be designed in a way that it's not dangerous if the web server user gets read access to file system. I added a warning about this in the top of the README.
Not sure if this is sound advice. Blacklisting is a cat and mouse game, especially for security. The risk of a missing entry on a blacklist is worse than on a whitelist.
On second thought, if you're ok with linking to PDFs instead of attaching them in emails then you can pre-cache all content needed to generate PDFs. Besides email attachments, I can't think of a use-case for server side generating PDFs.
{"status":400,"statusText":"Bad Request","errors":[{"field":["url"],"location":"query","messages":["\"url\" must be a valid uri with a scheme matching the http|https pattern"],"types":["string.uriCustomScheme"]}]}
FWIW there's a lot of things Chrome won't do properly out of the box (fonts, emojis'... more). Most projects like this will work for _most_ of the web, but there's a lot of nuance in getting it working across the board. This is something I've been working on for quite some time: https://browserless.io
I was playing around with Puppeteer the other day and was wondering if it was possible to render a web page to a single page PDF (a page with fixed width and variable height). Basically like creating screenshot without losing the text information. This would solve a lot problems such as sticky elements hiding text like in this example [1].
That's a good idea! You can achieve this by adding e.g. &pdf.width=1000px&pdf.height=10000px parameters.
Sometimes you can get rid of the sticky headers with &emulateScreenMedia=false parameter if the page has well implemented @media print rules in CSS. We decided to use page.emulateMedia('screen') with Puppeteer to make PDFs look more like the actual web page by default.
Pages which use lazy loading for images may look incorrect when rendered. &scrollPage=true parameter may help with this. It scrolls the page to the bottom before rendering the PDF.
For starter, documentation: you can't even understand what pdfium IS from that page. After some search I see that it can do rasterization, from pdf to e.g. png, but I couldn't find any mention of it being capable to generate the pdf from an url. Can it?
At least with PhantomJS I felt like my system would begin to lockup if there were too many instances rendering at the same time (and it didn't appear to be an issue of too little memory).
Scaling is a challenge definitely. Rendering image-heavy sites requires quite a lot of RAM. But as others have stated, the good news are you can quite easily scale this horizontally by adding more servers. There's no shared state between the server instances behind a load balancer.
There's also room for improvement in how efficiently a single server instance can render PDFs. The API doesn't yet support resource pooling, this would make reusing the same Chrome process (with e.g. 4 tabs) possible. The implementation requires careful consideration since in that model it's possible to accidentally share content from previous requests to the new requesters.
You can actually reuse a running instance and create a new context with this: https://chromedevtools.github.io/devtools-protocol/tot/Targe.... Issue is that most libraries don't have an API for this (not sure why), and that long-running Chrome instances can get to a quirky state + other issues. Certain parameters require you start Chrome with the right flags, so reusing a running Chrome process doesn't always work.
I suspect that "extracting" this part of chrome into a library would result in the same thing: a library that starts up the full chrome environment and prints a page to PDF.
The PDF-ization isn't the part that's hard to extract (there are libraries to create PDFs from scratch already available, and they're small/intelligible). Rather, it's the rendering of a webpage for display that's the hard part, and what most of the code in any web browser is concerned with. Whether that display is a monitor or a PDF doesn't change much.
I've looked and it's not easy. There are different ways you can wrap the internals of chrome but really, it's hard to (and officially not recommended, from memory) pull out just a subset to work with.
It is scalable. It scales linearly (and for practical purposes indefinitely) with the amount of money you spend on AWS Lambda. It might not have a nice constant factor, but it is scalable.
This is just pedantic. Anything can be scalable using that definition. Heck, I could hire a 3rd world worker to manually draw the PDF's, scan them to PDF using a scanner, and put them on a server and it would "scale linearly with the amount of money I spend" on labor.
No, some things can't be, like badly architected monoliths, or databases. Especially with databases it's not a given, which is why for quite some time in the last years things like the "MongoDB is web scale" blew up and people started mindlessly asking "is it scalable" (which as you figured out, means very little for a lot of systems). I'm also pretty sure that your example scales worse than linearly, since you have to introduce multiple levels of management at some point.
"scalable" = "can it be scaled", which is not a given for every system
Redrawing the PDF's and scanning them would likely not scale linearly with the amount of money you spend on labor. Labor tends to have diminishing marginal returns. The cost of hiring two workers is more than double that of hiring one worker because there is additional complexity in coordinating the workers.
Also, for the record, this comment I'm making right now is just pedantic.
If there is only one worker the requests can flow directly to that one worker. If there are two workers, something would need to sit in the middle and determine how to distribute the requests. That extra hop is where the extra complexity comes from.
Web rendering is hard. You'll probably have to scale horizontally at some point, and balance load between servers. But a microservice architecture makes this not terrible.
Thanks for the comment! I haven't personally used wkhtmltopdf much, but I like having Chrome as the rendering engine. In theory at least, debugging the PDFs can be done with desktop Chrome's print preview. I don't know about wkhtmltopdf, but url-to-pdf-api supports dynamic single-page apps, which can be beneficial depending on the use case.
Headless Chrome is quite new so it still has some bugs, but I have a hunch that it will in the end have most reliable and expected render results.
I am testing pupeter/headless chrome, pdf export stil has some bugs, I hit an issue with page sizing and I am stil using wkhtmltopdf for now, the bug is present in main Chrome save as PDF.
Chrome has better font rendering and it knows to embed the web fonts into the pdf where in wkhtmltopdf I have to do it manually.
Oddly enough I just implemented my own one of these in 34 sloc using flask and weasyprint. I chose to only have it accept html in a post rather than a url so that it could render non-publically accessible urls. You can also pass it a base_url (which it passes on to weasyprint) for resolving relative urls for static assets in the html, which are usually publicly accessible. Runs on heroku for simplicity.
I am using PhantomJS for a similar project running on AWS Lambda - running into all sorts of rendering bugs / crashes. Wanted to make the switch to puppeteer, but as of now it requires a higher version of NodeJS than what Lambda supports. Was in the process of looking at Docker containers for my service, anyone have any thoughts on Heroku vs Docker?
Hey, I'd love to talk to you more about your issues with getting over to puppeteer. I'm working on a product that separates application infrastructure from Chrome as it's nightmare to try and scale with.
Oh hey! I created something similar with Flask and Docker. Except you POST the html content and receive a PDF document back. It uses wkhtmltopdf, so it's pretty fast.
https://github.com/halfnibble/pdf_service
Took a quick skim of the README. I have a general question for these web-to-PDF services. Is the priority to honor the page styles as set forth in a print.css-type file? Or is it to be as close as a screenshot as possible of a webpage, which is what I think the majority of laypeople would expect.
One of the small things I've recently let myself be bothered by is how divergent HTML/browser snapshots from web.archive.org, archive.is, and Google cache can be, even for relatively simple pages. I've already given up on trying to make (not even sure if it's a good idea) HTML look nice as PDFs.
This service seems to be targeted at producing controlled artefacts from your application (e.g. invoices) where you know what CSS is in use. If you wanted to capture the original design of the page you can use headless Chrome to capture screenshots automatically (https://medium.com/@dschnr/using-headless-chrome-as-an-autom...). Perhaps headless Chrome can also save the HTML plus assets, or some other archive format that would be interactive.
I think phantomjs is another option. I'm personally looking for a way to screenshot charts produced on a site of mine so I can create thumbnails for a quick preview. I think phantomjs is what I will try first.
I have been aiming for default settings which would render the site as you see it (screenshot style), but so that you can switch the settings to honor @media print CSS rules.
One of the biggest values I think this yet-another-PDF-service has is that if you open the print preview on a desktop Chrome, it should be really close to what the API renders. Should make debugging a bit easier.
The main use case is to render content generated by yourself, e.g. receipts and invoices, but I don't see a reason why it couldn't be used for rendering news or blog articles.
If PDF is not really important for you and HTML is fine, take a look at SingleFile [1]. This is a Chrome extension I wrote some years ago to save a page and all its resources in a HTML file.
Partially off topic: does anybody know of a hosted solution that turns a pdf into an html page, and hosts the output html (optionally, also hosts the pdf, with a downloadable link).
https://mozilla.github.io/pdf.js/ would probably be what you would need to build your own. I think you can also use just Google drive to make something like this.
Maybe this is too obvious (I assume you've already rejected it as an option), but depending on how desperate/minimum-viable-product you are: Google Drive, Dropbox, etc. offer PDF preview functionality.
OT (maybe) Question: I've gotten more and more annoyed over the years as links are deleted/decay etc. Even more so recently. Is there a plug-in that makes a 'personal archive' or potentially on-sends the page to the Archive. It would be useful to be able to search/go back in my time-line to webpages even if they were just static pdf's.
thx for the suggestions - not quite what i was looking for. it's not so much bookmarking for later reading as a background process that dumps an archive file onto my hard drive. i'm especially not interested in someone else's side project since the risk is that they'll fold in x years. my usr scenario is being able to re-read a web page i know i saw even if i forgot to bookmark it ie, just about anything to do with climate change on a US govt site. a kind of distributed collection of 'truthy-ish' references (a large number of people have the same hash of the original page on their personal archive -> truthy-ish).
I would like to adopt something like this, but there are some pretty normal table functions in our current print solution that I don't know how to support in HTML.
E.g. on a multi-page invoice, show a sub-total row at the bottom of each page. Does anyone know how to create this kind of function?
I'd be interested in this as well. The industry standard, PrinceXML, has a huge licensing fee (currently $3800 per server), but it's also rock solid, handles modern CSS, and has hooks for adding things like headers and footers for pages and tables.
It looks like chrome's javascript interface exposes options that the command line doesn't. Or else I'm overlooking something, because I couldn't find a way to to hide the header and footer (which shows the date, title, url, and page number) using the command line. But this project does hide the header and footer.
I can't use an externally hosted service like this because some of my URLs are non-public. So when the user requests a PDF, I render the HTML to a temp file on the server, invoke chrome via command line, and serve up the converted PDF.
@kimmobru, would you happen to know how this would handle printing a multipage table to PDF? Specifically, I'm hoping for repeating headers per page, and not having any problems with rows printing half on one page, and half on the next page. I would love to replace a paid solution we use which handles this use cases with something based on headless chrome.
Just try ?url=file:///etc/passwd on the demo instance.
That seems to be a quite common issue with services like this built on generic libraries.