Project logo

AI for more AI

The HouseReader project was conceived as a research initiative, focusing on the applicability of models, multimodal Large Language Models (LLMs), and artificial intelligence algorithms. This research primarily involves computer vision for interpreting interior videos in residential-type assets.

Before delving into the specifics of the project, it's crucial to underscore the significance of generative artificial intelligence's consolidation as a work tool. This advancement has accelerated and unlocked capabilities previously unachievable just a year ago, particularly following the launch of ChatGPT and its accessibility to end users.

Beginning with a profile like mine, grounded in technology but not deeply versed in low-level, highly technical, and full-stack developments, the use of assistants or chatbots such as ChatGPT marks a significant advancement in developing a concept, idea, or project unlike any other tool available to date..

The utilization of modern assistants and chatbots significantly offers a "superpower" that goes beyond just saving time on known, repetitive, and daily tasks. This superpower extends further, enabling humans to tackle challenges, tasks, or projects that were previously beyond their initial reach or areas of expertise and knowledge.

I want to emphasize a crucial point: While it might seem commonplace today, the effective use of AI chatbots is, and will continue to be, a decisive factor in executing personal or professional projects. Additionally, these tools offer invaluable learning opportunities for the end user.

We are discussing the enhancement of capabilities when facing a challenge, conducting market research, addressing problems, developing concepts, proposing functionalities, evaluating alternative approaches, and seeking solutions to achieve specific goals. This often demands extensive technical knowledge in areas like programming languages, infrastructure, DevOps, security, networks, marketing, design, and more. Typically, ideating and executing a project of this nature would require the involvement of several specialized and dedicted profiles.

The project was ultimately carried out entirely by myself, utilizing spare time during weekdays and weekends over the past two months. The only assistance I had was the GPT chatbot, which served as a faithful companion throughout this journey. Without the aid of ChatGPT and similar assistants, it's not just that the project would have been delayed by months or even a year; it would have been outright impossible for an individual with my profile to execute it within a comparable timeframe.

In conclusion, a potent synergy has emerged: the use of artificial intelligence for the development of AI projects. This combination has arrived and is here to stay.

Project Abstract

Over the past 12 months, a series of presentations and publications have showcased multimodal Large Language Models (LLMs) capable of interpreting videos or image sequences. Trained with billions of parameters, these models merge vision encoders with both open and proprietary large language models. Their objective is to provide end users with advanced instruction and reasoning capabilities.

Promising outcomes have been noted in all computer vision models and tools tested so far, particularly in short-duration videos where multimodal LLMs effectively respond to basic questions about the scene. However, their performance notably diminishes when processing longer videos and striving for a higher degree of response reliability. Additionally, using more advanced proprietary models could significantly raise the cost.

Further complicating matters are additional uncertainty factors, like processing end-user recordings of their home interiors, which can be up to five minutes long, in various formats and qualities. There's also the potential for unintentional overlapping or repetition of spaces. The aim is to interpret spaces, layouts, and count and identify objects, rendering the task highly complex.

The sole solution to this challenge involves combining several pre-existing models and logics to develop a comprehensive algorithm. This algorithm is capable of reliably interpreting a given video for a specific space, using only the user's video as input, without any additional indications or parameters.

HouseReader, by executing its video analysis algorithm, is capable of:

  • Providing a general description of your home.
  • Determining the distribution of spaces and rooms that make up the space.
  • Estimating the cost value of the identified elements inside.
  • Detecting potential risks related to:

    • Rapid deterioration of the property.
    • Lifestyles of the people living in the home.
    • Potential risks of fire, combustion, ventilation, pests.
  • Compiling a census of household objects, with images and labeling.
  • Generating a final report for the user with a complete summary of the analysis.

This project is scalable to various fields, types of spaces, and functionalities, although the current algorithm primarily focuses on the residential domain.

Computer Vision and Multimodal LLM's

Multimodal LLMs for video analysis can be classified based on how they process the video content. The two main types are:

  • LLM Frame-by-Frame: These models process each frame of the video individually. This can be efficient for tasks such as object and person recognition, but may be less accurate for tasks requiring an understanding of the video context, like video description or video segmentation.
  • LLM Stream-Based: These models process the video as a continuous stream. This can be more efficient for tasks that require an understanding of the video context, but may be less accurate for tasks requiring the recognition of specific objects and people.

In addition to these two main types, there are also other types of multimodal LLMs for video analysis, such as Hybrid LLMs, Attention-based LLMs, and Deep Learning LLMs.

2023 has been the year with the most advancements in the publication of models and visual assistants in history. Some examples of the most relevant models that have emerged in this area:

Instructions

After conducting multiple tests with various conversational models for video, using recordings ranging from 1 to 5 minutes, none have been able to extract information with 100% accuracy regarding the number of rooms in the house, the detail of the elements in each room, their location or distribution, etc. Some models, like LLaVA, are capable of correctly answering some of these questions, but the conclusion is that conversational video models that fully understand their content still have a vast field for development.

Potential Use Cases

HouseReader is capable of addressing a range of real-life use cases. To begin with, the following have been identified:

  • Insurance Evaluation for Homes: HouseReader can serve as a tool for gathering initial information from clients about the condition of their homes when seeking home insurance. It is designed to estimate the interior value of the house and identify potential risks that could affect the insurance premium range, either increasing or decreasing it.
  • Pre-Evaluation for Insurance or Technical Interventions: If a home's spaces and elements are pre-labeled and accessible in a database, HouseReader can conduct a preliminary assessment for insurance companies or technicians. This aids in determining or planning an onsite intervention pertinent to a specific incident.
  • Inventory Management for Short-term Rentals or Vacation Properties: HouseReader can be employed to perform an inventory of objects and elements in residential properties designated for short-term or vacation rentals. This tool can keep track of all items within the property before and after each stay, ensuring there are no discrepancies or missing items.
  • Home Inspections and Valuation Support: HouseReader can aid in home inspections and support valuation processes, providing a detailed overview of the property's condition and contents. This comprehensive insight is essential for accurate property appraisal.
  • Enhancement for Massive Commercial Listings: This tool can enhance functionality by allowing users to automatically generate property descriptions or feature lists for large-scale commercial listings, eliminating the need for manual specification.

Technical Overview

Here is a summary of the technological stack utilized, along with the execution and processing phases of the video by the algorithm, which comprises over 30 scripts.

The end-to-end video processing flow is summarized in the following phases:

Instructions

To illustrate these phases more graphically, the processing follows this sequence, beginning with a video and culminating in the PDF that the user receives in a fully automated manner:

Instructions

Cost Efficency

Cost is a critical consideration in projects like these, which lack formal financial support and face the challenge of high computing demands from algorithms and the potential for extensive use of proprietary LLMs like ChatGPT4 or ChatGPT 4Vision. Consequently, the scope of the developments has been strategically designed for maximum efficiency, ensuring that the solution remains fully operational and viable at minimal costs.

  • Computing and Infrastructure: The setup involves the use of two interconnected AWS EC2 machines. One machine is dedicated to hosting the Web application, while the other is allocated for computer vision algorithms and additional computations. The algorithm machine is managed (both started and shut down) by the web server machine, ensuring it operates only when a new video is received for processing.
  • Development in Python: The abundance of available and free libraries in Python, coupled with its simplicity and versatility in development, makes the implementation of these applications straightforward.
  • Utilization of Pre-trained, Open-source ML Models: The project employs machine learning models that are pre-trained and openly available on the Hugging Face platform.
  • For tasks such as validating potential repeated spaces or generating property descriptions, the system makes 1 to 3 API calls per processed video to Chat GPT 4 Vision.

In summary, the costs for both the project construction phase and ongoing operations are very low. With a typical processing volume, the total operational cost is estimated to be around 6-8€ per day, with the capacity to process nearly 100 videos daily.

Limitations and Roadmap

Current limitations of the development:

  • Object Duplicity: There is an ongoing challenge in scalable counting the same objects within a single frame, as the masking process selects the area where 1 or several identical objects are found. A potential solution could be passing it through a Q&A service to a Multimodal fusion (GPT 4V or LLaVA).
  • Pre-trained Models with COCO for Semantic Segmentation: The model used for identifying masks and assigning labels is prepared to support 150 general visual elements. There's an opportunity to train a model dedicated to the identification and labeling of interior elements in residential spaces.
  • Sequential Execution of Video Algorithms: Currently, the algorithm processes videos sequentially. Optimizing the platform architecture for parallel executions could be beneficial.
  • Low Quality of Mask Images: This limitation restricts the ability to accurately identify the material of an object, its condition, etc. Improvements in this area are expected progressively.

Possible new functionalities to incorporate:

  • Adding Input for Location or Cadastral Reference: Enriching the final report with data about the surroundings (POIs, transportation, public facilities, price/m2 in the neighborhood, etc.) would be feasible and valuable.
  • Calculation of Space Square Meters: This is a complex challenge, but multiple techniques of depth estimation are being developed, like Dense Prediction Transformer (DPT) Hugging Face - Depth Estimation.
  • Adaptation to Other Types of Assets: The approach could potentially be replicated for office spaces, shopping centers, industrial or logistics spaces, etc.
  • Real Price Retrieval of Objects through Integration with External Sources: This would add significant value by providing more detailed cost analysis.
  • Production of (almost) professional quality renders: The possibility of suggesting real-time redesign of home spaces to the end user based on preferences could be a potential business opportunity for renovations and designers.
  • Automated residential tours: Introducing integrations with tools based on Gaussian Splatting or NeRF techniques for tours in real estate spaces could enhance the output quality significantly.
  • AVM – Automated Valuation Model for Estimating House Value: Utilizing the multiple characteristics collected to estimate the value of the property.

One Vision

After the launch of ChatGPT in November 2023, all the subsequent Large Language Models (LLMs) that have appeared in the past year, and of course all the advances in image generation and understanding (like stable diffusion, DALL·E 3, Adobe Firefly,...), the next revolution will come from Video analysis.

Just as they have been able to train large language models by massively ingesting parameters collected from multiple sources, a similar process has occurred with images.

We will soon see models trained with videos. To date, video transcriptions can be used by converting them into text format and enriching an LLM or a RAG. However, a video is not just audio; it is an immense source of visual information that is currently not being exploited, such as textures, colors, places, movements, reactions, contexts within a scene, and more. This infinite information could be a new revolution in AI and the future models to come.

Conclusion

Given the rapid advancements in computer vision over the past year, it's likely that this project has somewhat preempted the capabilities that are set to emerge in multimodal LLMs, where analyzing long-duration videos with high precision will become perfectly viable.

As I mentioned in the introduction of this paper, for me, developing the concept and the project overall has been an enriching journey and learning experience through computer vision models, programming, infrastructure, LLMs, etc.

I'm uncertain about how it will end or what the specific next steps for the functional project itself will be, but I take away the experience and knowledge gained. I'm grateful for the collaboration in testing from friends, colleagues, and family members who participated, especially my wife Montse and daughters Violeta and Alba, whom I have been stealing holidays and weekends from, and finally to ChatGPT (😉), my always available partner these last few months in both this and other AI projects..

I hope you like House Reader!


Pedro Sánchez Alvarez

Email: pedrosanchezal@gmail.com