Microsoft Researchers Are Training AI to Analyze Spreadsheets

Understanding spreadsheets can be challenging for generative AI models. To address this issue, Microsoft researchers published a paper on July 12 on Arxiv introducing SpreadsheetLLM, an encoding framework designed to help large language models “read” spreadsheets more effectively.

According to the researchers, SpreadsheetLLM has the potential to revolutionize spreadsheet data management and analysis, making user interactions more intelligent and efficient. One of its key benefits for businesses is the ability to leverage spreadsheet formulas without requiring users to master them, by asking questions in natural language to the AI model.

Why Are Spreadsheets a Challenge for LLMs?

Spreadsheets pose several difficulties for large language models (LLMs):

Size: Spreadsheets can be extensive, often exceeding the character limits that LLMs can process in one go.
Structure: Unlike the linear and sequential inputs that LLMs handle well, spreadsheets have a two-dimensional layout that is more complex to interpret.
Formatting: LLMs are typically not trained to understand cell addresses and specific spreadsheet formats.

How SpreadsheetLLM Works

SpreadsheetLLM comprises two main components:

SheetCompressor: This part of the framework compresses spreadsheets into a format that LLMs can more easily understand. It includes:
- Structural Anchors: These help identify rows and columns within the spreadsheet.
- Token Reduction: This method reduces the number of tokens required for the LLM to process the spreadsheet.
- Cell Clustering: This technique groups similar cells together to improve efficiency.

Using these methods, the researchers achieved a 96% reduction in tokens needed for spreadsheet encoding, resulting in a 12.3% improvement compared to previous leading research. They tested their method on various LLMs, including OpenAI’s GPT-4 and GPT-3.5, Meta’s Llama 2 and Llama 3, Microsoft’s Phi-3, and Mistral AI’s Mistral-v2.

Chain of Spreadsheet: This methodology teaches LLMs how to identify relevant parts of a compressed spreadsheet when answering questions and how to generate responses based on that information.

Implications for Microsoft’s AI Efforts

For Microsoft, SpreadsheetLLM could significantly enhance its AI assistant, Copilot, which integrates with Microsoft 365 applications like Excel. This advancement is part of Microsoft’s broader effort to make generative AI more practical, especially for users who are less familiar with advanced spreadsheet features.

Real-World Usage and Future Directions

While the 12.3% improvement is notable academically, its economic impact is still developing. Generative AI models have been criticized for generating inaccurate information, which could undermine large datasets. As noted by the researchers, understanding a spreadsheet’s format is different from accurately generating data within it.

The methodology currently requires substantial computing power and multiple processing passes through an LLM, which can be less efficient than traditional methods. Moving forward, the research team aims to enhance the framework by incorporating details like cell background colors and improving the LLM’s comprehension of how cell contents relate to each other.