Where there’s data, there’s money
If you’re a creator, marketer, or startup — you’re in the right place. Below is our weekly rundown of top internet moments, AI news, creator trends, and social platform updates.
Does anyone care that Big Tech is scraping the internet to train AI models?
Average users probably don’t, but publishers do. Not just because their information could be proprietary (it probably is) but also because where there’s data, there’s money. And for publishers like online news sites, blogs, and social forums, data has always been the currency.
In the early days, it was phone books and addresses, which evolved into sales lists and CRMs for more detailed, automated targeting. The current state of the open web adds to the existing value of user data — being contextual, easily accessible, and higher quality than ever.
The value of data hasn’t changed, but the goalpost has shifted.
AI, after all, needs high quality data to be successful. But publishers aren’t about to let their IP get swallowed up without someone paying for it.
Over the last year, we've seen high-profile legal battles play out between media and AI: The New York Times suing OpenAI and Microsoft over alleged "widespread copying" of content, and Getty Images taking Stable Diffusion to court over unauthorized use of copyrighted photos, the ongoing debates around the copyrightability of AI-created works.
There’s a delicate balance between flexing the rules and being protected by IP laws, but Big Tech is pushing the boundary.
A new article from the New York Times called How Tech Giants Cut Corners to Harvest Data for AI details the lengths tech companies like OpenAI, Google, and Meta have gone through to collect data for AI training purposes. And, as the article notes, they’re not really playing by the rules. These companies have done things like:
Transcribed YouTube videos — OpenAI reportedly developed its Whisper audio transcription model to transcribe over a million hours of YouTube videos to train its GPT-4 language model, despite knowing this was legal grey area.
Considered purchasing publishing houses — Meta explored the idea of buying publishing house Simon & Schuster to procure long works, potentially violating copyright laws.
Discussed obtaining copyrighted data — According to the report, Meta discussed gathering copyrighted data from across the internet, even if it meant facing lawsuits.
Attempted to tap Google’s goldmine — Google's legal department asked the privacy team to draft language to broaden what the company could use consumer data from Google Docs, Sheets, and Slides for in its AI products.
This brings us to a big point: AI needs lots of good data to work well. As the tech gets better, so does the demand for more and different kinds of information. Using internet archives and big data sets isn’t necessarily new practice, but it’s new in the sense of mass awareness and user adoption now in the picture.
So what are media platforms and publishers doing about it?
They’re either lawyering up (see above) or finding a way to benefit.
For example, the Associated Press (AP) agreed to license its archive of news stories going back to 1985 to OpenAI in a deal where they also get access to OpenAI's technology to experiment with using it in its own work. Reddit recently said it will start charging companies for access to its programming interface used for AI training software rather than pursuing legal action.
Why does all of this matter?
For the everyday power user of tools like ChatGPT or Stable Diffusion, the impact isn’t immediate but the nuances are important to pay attention to.
The legal grey area of what’s okay to use and what’s not when it comes to training AI models will continue to shift, blur, and be redefined altogether as data and usage laws for AI take shape.
While the red tape gets untangled, creators should operate under the assumption that their content could be used as training data — if that matters to you. Marketing teams and small businesses using AI tools for content should keep in mind the associated legal risks and tread carefully to avoid copyright infringement.
It’s a complicated issue that, I fear, will get more tangled before it gets clearer. For now, we comply, triple fact check, and assess the risks worth taking.
(P.s. scroll to the end for some stray links I bookmarked for you).
Stay saucey ✌️
—
Generative AI is changing content marketing processes — here’s how
Leaders can now tap generative models to streamline creativity across the lifecycle, from conception to optimization. Read my breakdown of where AI fits into the current content workstream for larger organizations.
Creator & Social News
❇️ Threads tests Trending Topics. Threads is trying a new feature called "Trending Topics" which includes a notifier for live events (like the recent solar eclipse). Here’s what it looks like:
It’s likely to compete with 𝕏, which still holds the top spot when it comes to the first platform people turn to during major events like last week’s earthquake. It hasn’t shown major promise on Threads yet since only the solar eclipse has been highlighted, but we’ll keep you posted.
📸 TikTok to enter the photo app market. TikTok is getting close to launching a dedicated photo app which has, apparently, been in development for while. The app, referred to as "TikTok Notes," might be a response to Instagram's "Notes" and a way for TikTok to replicate some of Meta's offerings. Already, users have reported receiving notifications that their photo posts will be shared to a new "TikTok Notes" app unless they opt out.
🛠 𝕏 undergoes another massive bot purge. Even Elon noticed his followers dropped by 43,000 during a bot purge on 𝕏 last week. Continuing on its initiative to combat bots and spam, the 𝕏 Safety account said the routine purge is part of an effort to secure the app and reduce the presence of spam accounts.
🤖 Meta's AI labeling policy doubles down. Starting in May, Meta will label AI-generated content on Facebook, Instagram, and Threads and will revise its policy to no longer remove manipulated content by July. The policy aims to improve transparency and promote authenticity, but it could also impact the way users interact with and trust content on the platform.
📺 LinkedIn gets new ad inventory. LinkedIn announced the new CTV (connected TV) live event ads format, which means B2B marketers can tap into a new way of reaching audiences on TV. The ads can be used to promote events, webinars, and product launches while streaming shows and movies.
👻 Snapchat rolls out Advanced Partner Program. Snapchat has launched a new Advanced Partner Program for agencies and selected partners to work closely with Snapchat to create “innovative solutions that build stronger full funnel campaigns and drive results.” Basically, selected partners will get a flashy badge and access to a curated package of “educational and business opportunities” to hone their Snapchat expertise.
Sauce snippets
Stray links, memes, and trends from the internet ✨
The eclipse was beautiful but here’s an honest review
Why Sam Altman is now a billionaire
The best NYC earthquake memes compiled into one thread
The industrial grade glycine discourse explained
My favorite industrial grade glycine influencer content
As a last resort, a woman turned to the internet to relocate her lost husband and it worked — immediately
POV: you wake up next to Zuckerberg