InfoReseaux remarked that this data is suspicious, to say the least: CC BY-NC 4.0 but contains ODbL licensed data coming from Openstreetmap - but also from Microsoft.
Christian Quest sent the following message to the authors:
"I’m writing to you because I’m surprised by the choice of data license you’ve set on the GlobalBuildingAtlas dataset.
As mentioned and explained in your paper, at least two data sources you’ve been using to create this dataset are under the Open Database License (ODbL): OpenStreetMap and Microsoft building datasets.
I’ve downloaded the extract of data you’re proposing to have a look at the final dataset, and it confirms that building polygons from OSM (and Microsoft) are present in the resulting dataset in a substantial portion.
In such case, your dataset must be published under the ODbL licence (see 4.2), because it is a derivative database (see 1.0 of ODbL license for definition).
A copy of this message has also been sent to the Legal Working Group of the OSM Foundation.
Thanks in advance to fix quickly the license of the dataset you published. This will also allow OpenStreetMap contributors to use it to improve OpenStreetMap, which is not possible with the CC-BY-NC you choose."
Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"
I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.
LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.
I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.
I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.
I wrapped up my code in a little library if you're into this sort of thing.
And in French the inhabitants of "les Etats-Unis" are "Etats-uniens". I've taken the habit of referring to them as USAians, which often gets negative reactionsand remains rare - but I find it is the most accurate demonym and I'll keep pushing it.
I look forward to the world inventing demonyms for the citizens of the European Union, because at least it will mean that our emerging national body is getting mindshare !
> I look forward to the world inventing demonyms for the citizens of the European Union, because at least it will mean that our emerging national body is getting mindshare !
The European Union is an emerging country - it is my country. For now, many don't yet understand how common necessity binds us, and some remain under the illusion that they can make it alone against China and the USA, but ever closer union is real and whoever has been on Erasmus student exchange knows we are one people. On my French passport, "Union Européenne" is written above "République Française" - that is the hierarchy. A nation is people who will to live together, and the European Union is that... The rest is a couple treaties and a few decades away !
> Contrast this to the “medias” like Threads, Bluesky, etc - moderation becomes impossible just because of the sheer scale of it all.
Wut ? Moderation at Bluesky is fantastic: users build their block lists and share them for others to subscribe to - moderation à la carte... Power to the users !
I had two accounts banned from BlueSky and they didn't say why. One was parodying Donald Trump so fair enough if they don't want content like that, and they told me it was banned for impersonating Donald Trump. The other, no idea at all because I don't think I even tweeted anything very controversial, and the email was just a very generic "you violated terms of service". My third account was not banned, but I don't use BlueSky any more. It's not a ban-evasion ban, since they're logged in together in the same web browser, with the menu to switch accounts active, and yet my third account was not banned.
My point of sharing this info is that BlueSky is not a user-driven moderation system. It arbitrarily and centrally bans accounts, just like Twitter.
You're right, Bluesky moderation is centralized. Unless content is served p2p, some moderation has to be centralized. At the end of the day, there's a server serving content and that server operator is legally obligated to remove illegal material.
Hopefully, atproto + community will provide alternatives for moderation services. Work is being done on this, we'll see what we end up getting.
I feel that a competitive ecosystem of moderation services is probably the best answer we can hope for to that inherently messy problem.
I don't use anything from pro and I use datastar at work. I do believe in making open source maintainable though so bought the license.
The pro stuff is mostly a collection of foot guns you shouldn't use and are a support burden for the core team. In some niche corporate context they are useful.
You can also implement your own plugins with the same functionality if you want it's just going to cost you time in instead of money.
I find devs complaining about paying for things never gets old. A one off life time license? How scandalous! Sustainable open source? Disgusting. Oh a proprietary AI model that is built on others work without their consent and steals my data? Only 100$ a month? Take my money!
Thread on the Openstreetmap forum: https://community.openstreetmap.org/t/is-globalbuildingatlas...
Christian Quest sent the following message to the authors:
"I’m writing to you because I’m surprised by the choice of data license you’ve set on the GlobalBuildingAtlas dataset.
As mentioned and explained in your paper, at least two data sources you’ve been using to create this dataset are under the Open Database License (ODbL): OpenStreetMap and Microsoft building datasets.
I’ve downloaded the extract of data you’re proposing to have a look at the final dataset, and it confirms that building polygons from OSM (and Microsoft) are present in the resulting dataset in a substantial portion.
In such case, your dataset must be published under the ODbL licence (see 4.2), because it is a derivative database (see 1.0 of ODbL license for definition).
A copy of this message has also been sent to the Legal Working Group of the OSM Foundation.
Thanks in advance to fix quickly the license of the dataset you published. This will also allow OpenStreetMap contributors to use it to improve OpenStreetMap, which is not possible with the CC-BY-NC you choose."
reply