The release of GPT-4 sparked excitement and concern among experts for its impressive capabilities, ethical considerations and closed model.
GPT-4 is here.
It’s the successor to the GPT-3 and 3.5 models that brought generative AI to the masses via ChatGPT and created a frenzy of investor and business activity in gen AI.
There are a number of very notable AI releases being built and anticipated at the moment. But given that ChatGPT hit 100 million active users two months after its release (making it the fastest growing app in history,) the suspense for GPT-4 has been at its apex. Just in the last three months of this writing, Google and Microsoft both evolved their internet search and productivity suites with gen AI (among countless companies infusing their products with the technology). So it’s no wonder that we got GPT-4 as fast as we did.
OpenAI one-upped the already astounding GPT-3 model and gave GPT-4 multi-modality (image recognition), greater accuracy, more creativity and longer prompting, among other upgrades.
“The continued improvements along many dimensions are remarkable,” Oren Etzioni, advisor and board member at the Allen Institute for AI, told MIT Technology Review. “GPT-4 is now the standard by which all foundation models will be evaluated.”
The model is already employed by organizations including Duolingo, Stripe, Morgan Stanley, Khan Academy, the Government of Iceland and even us here at Jasper (among many other models we tap.) One company, Be My Eyes, is tapping the model’s image recognition to give the visually impaired an app-based GPT-4-powered visual assistant. The model will also be more widely accessible through ChatGPT Plus, which costs $20 a month. It’s also reported to be the foundation of Microsoft’s new Bing search engine.
While GPT-4 is already being praised for its impressive capabilities, its release is just as much about its limitations and what it cannot accomplish. What’s equally important is the ethical considerations that GPT-4 presents as well as the fact that the model is closed in contrast to previous iterations.
All these elements are present in the 98-page technical paper OpenAI wrote on GPT-4. But since many won't get a chance to read that, we’ve broken down the key themes within it along with some reactions from experts in the field.
According to the model’s official blog page, “GPT-4 is more creative and collaborative than ever before. It can generate, edit, and iterate with users on creative and technical writing tasks, such as composing songs, writing screenplays, or learning a user’s writing style.”
It can handle over 25,000 words of text, making it even more suitable for long-form content creation, extensive conversations and document analysis. It’s also 82 percent less likely to create inappropriate content and 40 percent more likely to generate correct responses compared to GPT-3.5, according to OpenAI’s internal evaluations.
“When provided with an article from The New York Times, the new chatbot can give a precise and accurate summary of the story almost every time,” wrote Cade Metz in that very publication. “If you add a random sentence to the summary and ask the bot if the summary is inaccurate, it will point to the added sentence.”
GPT-4 can also create text outputs based on a combination of both text and images. It can handle various kinds of data, such as documents with text, photos, diagrams or screenshots, just as well as it handles pure text inputs.
“GPT-4 accepts prompts consisting of both images and text, which — parallel to the text-only setting — lets the user specify any vision or language task,” OpenAI wrote in GPT-4’s technical paper. “The model generates text outputs given inputs consisting of arbitrarily interlaced text and images. Over a range of domains — including documents with text and photographs, diagrams, or screenshots — GPT-4 exhibits similar capabilities as it does on text-only inputs.”
In the example above, not only did the model assess the full contents of the meme (text and imagery), it offered insight on why it’s funny.
“I don’t want to make it sound like we have solved reasoning or intelligence, which we certainly have not,” OpenAI CEO Sam Altman told the New York Times. “But this is a big step forward from what is already out there.”
GPT-4 has also demonstrated performance levels that are comparable to humans on a wide range of professional and academic exams. One noteworthy accomplishment is its successful completion of the Uniform Bar Examination, scoring in the top 10 percent of test takers. OpenAI’s tests also show that GPT-4 scored 1,300 (out of 1,600) on the SAT and a perfect score on Advanced Placement exams in biology, calculus, macroeconomics, psychology, stats and history. Previous GPT iterations failed the Uniform Bar Exam and scored significantly lower on most Advanced Placement tests.
Many early testers have been blown away by GPT-4 capabilities. But it’s far from a perfect model.
“Despite its capabilities, GPT-4 has similar limitations as earlier GPT models,” OpenAI wrote in GPT’s technical report. “Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts.
“GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training data cuts off in September 2021, and does not learn from its experience. It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains, or be overly gullible in accepting obviously false statements from a user. It can fail at hard problems the same way humans do.”
The technical report also discussed GPT-4’s potential to generate harmful, biased and dangerous outputs. Adversa AI, an AI research and deployment startup based in Israel, tested GPT-4 a few hours after its release and bypassed some of the model's safeguards to reveal those same negative results.
Within pitfalls around hallucinations and bias comes another key aspect of the technical paper: that it did not include details on how the model works and where its generations are rooted.
“Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar,” the report read.
Many professionals in and around the generative AI space are concerned that the lack of transparency in the model means it’s harder to track where GPT-4’s poor or harmful outcomes may stem from. Open-sourced models let outside researchers cross-check the training models, dataset construction and other elements for areas of bias or harmful results being generated. Shining a light on these pitfalls gives creators opportunities to seal these gaps and improve the models. Some fear that closing a model does not welcome as much outside influence or community-driven immersion in the ecosystem of ethicists working to improve AI for the sake of public good.
“You don’t know what the data is,” Sasha Luccioni, a research scientist at HuggingFace, told Nature. “So you can’t improve it.”
“It’s very hard as a human being to be accountable for something that you cannot oversee,” said Claudi Bockting, professor of Clinical Psychology in Psychiatry at the University of Amsterdam's Faculty of Medicine, to Nature. “One of the concerns is they could be far more biased than for instance, the bias that human beings have by themselves.”
Many experts, like CEO of Lightning AI and creator of PyTorch Lightning William Falcon, noted why OpenAI may have closed their model to begin with. Competitively, other companies replicated GPT-3 for their own gain because it was open source. Closing GPT-4 gave OpenAI a competitive advantage. And while that advantage comes at the cost of transparency, Falcon did note that OpenAI is still thinking about ethics based on the GPT-4’s report and investment in improving its model.
“They definitely have concerns about ethics and making sure that things don’t harm people,” Falcon told VentureBeat. “I think they’ve been thoughtful about that. In this case, I think it’s really just about people not replicating because, if you notice, every time they launch something new [it gets replicated].”
OpenAI dedicated a significant amount of space in the report to not only highlighting much of GPT-4’s limitations, but nuanced ethical concerns the model presents as well. For example, researchers found GPT-3 could adeptly create content that was intentionally misleading, yet persuasive, in its attempts to change a narrative around a politically-charged subject. GPT-4 is expected to be even better at those efforts, “which increases the risk that bad actors could use GPT-4 to create misleading content and that society’s future epistemic views could be partially shaped by persuasive LLMs,” OpenAI wrote.
The company touched on the model’s potential negative influence over other ethical concerns like the proliferation of conventional and unconventional weapons, privacy, cybersecurity and more. But two in particular stuck out: user overreliance and the economic impacts of GPT-4.
“Overreliance occurs when users excessively trust and depend on the model, potentially leading to unnoticed mistakes and inadequate oversight,” OpenAI wrote. “As users become more comfortable with the system, dependency on the model may hinder the development of new skills or even lead to the loss of important skills. Overreliance is a failure mode that likely increases with model capability and reach. As mistakes become harder for the average human user to detect and general trust in the model grows, users are less likely to challenge or verify the model’s responses.”
Because its model is so powerful, OpenAI is warning users about being too dependent on something like ChatGPT Plus or another API that uses GPT-4.
Regarding economic impacts, OpenAI said,“While these models also create new opportunities for innovation in various industries by enabling more personalized and efficient services and create new opportunities for job seekers, particular attention should be paid to how they are deployed in the workplace over time.”
AI companies caution: these tools should not be a one-for-one replacement for human-operated roles for all of the reliability and accuracy reasons stated above.
OpenAI also pledged to take things further than just admitting to GPT-4’s imperfections.
“We are investing in efforts to continue to monitor the impacts of GPT-4, including experiments on how worker performance changes on more complex tasks given access to models, surveys to our users and firms building on our technology, and our researcher access program,” GPT-4’s report said.
The research program the verbiage alluded to was also done before GPT-4’s release. “Red-teamers” at outside organizations tested GPT-4 early to expose and mend its flaws. One red-teamer was Andrew White, a chemical engineer at University of Rochester, who tested GPT-4 for six months and provided input that helped make the model safer and more powerful for users.
“We are committed to independent auditing of our technologies, and shared some initial steps and ideas in this area in the system card accompanying this release,” OpenAI’s report read. “We plan to make further technical details available to additional third parties who can advise us on how to weigh the competitive and safety considerations above against the scientific value of further transparency.”
Not everything can be accounted for in reports like OpenAI’s. The technology will present some unpleasant surprises around use cases and long-term impacts. And as the gen AI space continues to heat up, there will also be instances of organizations making tactical, business-first moves around their products. OpenAI’s report was very transparent in some ways and opaque in others, which presents interesting food for thought for the nature of product releases and open- versus closed-source models within this industry.
But ultimately, OpenAI’s efforts to address the harmful outputs and diverse ethical concerns that GPT-4 can present is effective. More work in that realm is necessary moving forward if these tools will serve, rather than damage, public good. Because all the ethical concerns we have now, and ones we can’t even imagine, will grow even more pressing as we see GPT-5 and beyond.
Shopping Jasper for a large business? Update your input to 100+ employees to meet with our team. Please use business email to meet with our team.