The Last Six Months in LLMs: What Actually Changed, and What Didn’t

Over the past six months, large language models have become better at a few practical things: working across text, images, and audio; using tools such as search and code runners; fitting into software products more smoothly; and doing more work at lower cost. That matters because most people now meet AI inside everyday apps, not as a lab demo. The real debate is whether this amounts to a major leap in capability or mostly better packaging around systems that still make basic mistakes.

My view is that the biggest change is usability, not magic. LLMs are moving from impressive chatbots to more useful software components. That is real progress, and it should not be dismissed. But it also does not mean the hard problems are solved. Hallucinations, weak judgment, privacy concerns, copyright disputes, and unclear business value are still very much part of the story.

First, what people mean by “LLM”

An LLM is the part of an AI system that generates language. It is trained on huge amounts of text, and in newer systems it can often work with images, audio, or files too. But one important point often gets lost in the headlines: many recent gains do not come from the model alone. They come from the system around it.

If an AI assistant can search the web, read a PDF, run code, pull data from your company wiki, and then produce an answer, that is not just “a smarter model.” It is also better software design. For non-engineers, this distinction matters because it explains why AI can feel much more useful without becoming fully reliable.

Multimodal stopped being a special feature

One of the clearest shifts is that multimodal AI has become normal. That simply means one system can handle more than one kind of input, such as text, images, screenshots, charts, PDFs, audio clips, or video. Six months ago, this still felt like a headline feature. Now it is increasingly part of the default product experience.

That matters because most real work is not made of plain text. People deal with slides, receipts, forms, whiteboard photos, meeting recordings, dashboards, and messy documents. An AI system that can read a chart, summarize a call, and compare it with a spreadsheet is more useful than one that only responds to typed questions.

The promise here is obvious: less switching between tools, faster summaries, easier search, and better help for people who work across languages or formats. The risk is also obvious: these systems can still misread a chart, ignore small print in a contract, or confidently describe something that is not actually in the image. Multimodal does not mean dependable.

Smaller models got stronger, and that changes the market

Another important development is that smaller models became more capable. Not every useful AI task now requires the biggest and most expensive model in the market. For drafting, summarizing, classification, extraction, translation, customer support routing, and internal search, a smaller model is often good enough.

This is more important than it sounds. Smaller models are cheaper to run, usually faster, and sometimes easier to deploy on private servers or even on devices. That can help with privacy, speed, and cost control. For a business, that may matter more than chasing the very top benchmark score.

The counterpoint is fair: on harder tasks such as complex coding, advanced reasoning, and messy multi-step analysis, the best frontier models still often do better. But the market no longer looks like a simple race where one giant model wins everything. It looks more like a layered market, with different tools for different jobs.

The industry shifted from chatbot answers to tool-using systems

The biggest practical change may be this one: the chat box is no longer the whole product. More AI systems now use tools behind the scenes. They search documents, call APIs, run code, check a database, browse websites, or trigger actions in other apps.

This is where the word agent usually appears. In simple terms, an agent is an AI system that can take several steps instead of giving one isolated answer. It may search first, then read a file, then compare two sources, then draft a reply. That can make it much more useful for real workflows.

For example, instead of asking a model a general question about your company policy, you can ask a system that searches the current policy documents, pulls the relevant section, and drafts a response. Instead of asking for a rough summary of sales, you can let it query the latest spreadsheet and produce a chart description. This is often a bigger improvement than a small increase in raw model quality.

But tool use adds new failure points. A system can call the wrong document, use outdated data, mis-handle permissions, or take an action you did not mean to approve. In other words, it can now be wrong in more operational ways, not just in more verbal ways.

“Reasoning” became the key selling term

If you follow AI news, you have probably seen a lot more talk about reasoning. In product language, this usually means a model spends more time or more compute on a problem before answering. In some cases, that does lead to better results on math, code, logic puzzles, and multi-step tasks.

This is a real improvement, but it is also easy to overstate. A model that performs better on a reasoning benchmark is not automatically better at office work, customer service, law, medicine, or education. Real-world tasks include unclear instructions, incomplete data, conflicting goals, and responsibility for mistakes. Benchmarks only capture part of that.

It is best to treat “reasoning mode” as a trade-off. It may be slower and more expensive, but better on hard tasks. That is useful. It is not the same thing as broad, stable judgment.

Open models gained ground, even if closed products still lead in some areas

Another major shift is the continued rise of open-weight models. That means the model weights can be downloaded and run by others, rather than being available only through one company’s service. For developers, businesses, and researchers, this matters a lot.

Open models offer more control. They can be fine-tuned for a specific language or domain, deployed in private environments, and studied more closely. They also put price pressure on the market. If a task can be done well enough with an open model, many teams will choose that route.

Closed frontier systems still have advantages: top performance on some tasks, smoother consumer products, bigger safety teams, and stronger integration across services. So the story is not that open models “won.” The story is that users now have more viable choices, and that weakens the idea that only a few companies can provide useful AI.

Longer context windows mattered, but less than people hoped

You may also have heard more about context windows. This refers to how much information a model can handle at once in a single session. Bigger context windows are genuinely useful. They allow longer documents, bigger codebases, longer chats, and more reference material in one go.

But this is also an area where marketing can confuse people. A model that can accept a very long document does not necessarily understand it well. It may still miss the one line that matters, focus on the wrong section, or combine unrelated details into a neat but wrong answer.

So yes, larger context is progress. But it is input capacity, not guaranteed comprehension. For non-engineers, that is the key distinction.

What did not change

For all the progress, several core limitations are still with us.

They still hallucinate. An LLM can produce false statements, invented citations, or wrong summaries in a very confident tone.
They still depend heavily on setup. The same model can perform very differently depending on the prompt, the tools it can access, and the quality of the source data.
They still do not “know” when they are out of their depth. Guardrails can help, but reliable self-awareness remains weak.
They still create privacy and legal risk. Uploading company files, client data, or copyrighted material is never a neutral act.
They still need human accountability. If an AI system gives bad tax advice, leaks sensitive data, or approves the wrong refund, the responsibility belongs to people and organizations, not the model.

This is why the basic question has not changed: not “Can it answer?” but “Can we trust the answer enough for this use case?”

The real debate: better intelligence, or just better products?

There are two common reactions to recent AI progress, and both are incomplete. One side says the new wave is mostly wrappers and product polish. The other says skeptics are missing a real capability jump. In truth, both sides have part of the picture.

Yes, many improvements came from product engineering: better interfaces, cleaner retrieval, smarter tool calling, faster systems, and tighter integrations. But that does not mean the gains are fake. Those improvements are exactly what turn a novelty into a useful product.

At the same time, raw model ability has improved too, especially on coding, multimodal input, and some structured reasoning tasks. Dismissing all recent progress as marketing would be lazy. But calling it a solved path to reliable digital workers would be just as careless.

The clearest reading of the last six months is this: LLMs became easier to deploy and more useful in narrow workflows, not universally trustworthy.

What this means for ordinary users

If you are not an engineer, the best way to read AI news is to ignore the noise and ask a few simple questions.

What task actually improved? “Smarter” is vague. “Can extract invoice fields more accurately” is meaningful.
What tools or data is the system using? Many impressive demos rely on search, retrieval, code execution, or access to files.
How expensive is it to be wrong? A weak draft is easy to fix. A bad legal clause, medical summary, or payment action is not.
Does this need a frontier model? Often the best answer is a smaller, cheaper, more controllable system.
Who is checking the output? AI works best where review is built into the process, not added as an afterthought.

The bottom line

The last six months in LLMs were not about one magical breakthrough. They were about steady, sometimes important improvements that made these systems more practical: more multimodal, more connected to tools, cheaper at some tiers, and easier to fit into real software.

That is enough to matter. It is not enough to stop asking hard questions. The sensible response is neither hype nor dismissal. Use LLMs where they save time, where mistakes can be checked, and where the workflow is clear. Be careful where the cost of error is high.

The most useful question now is no longer, “Is this model amazing?” It is, “For this task, under these conditions, is it reliable enough to help?”