Ai Master

Report on trends in artificial intelligence, machine learning, and data engineering

#News Center ·2023-09-06 15:40:24

Key Takeaways

Generative AI, driven by large language models (LLMs) such as GPT-3 and GPT-4, has gained prominence in the AI and machine learning industry and is being widely adopted by technologies such as ChatGPT.

Major tech companies such as Google and Meta have announced their own generative AI models, demonstrating the industry’s commitment to advancing these technologies.

Vector databases and embedding storage have received attention for their role in enhancing observability in generative AI applications.

There is growing concern about responsible and ethical AI, with calls for stricter security measures for large language models and an emphasis on improving the lives of all through AI.

Modern data engineering is moving towards decentralized and flexible approaches, with concepts such as data grids that advocate for federated data platforms that are partitioned across domains.

Trend reports provide readers with a high-level overview of topics we believe architects and technology leaders should be paying attention to. In addition, they help the editorial team focus on writing news and recruiting article writers to cover innovative technologies.

In this annual report, the editors explore the current state of AI, ML, and data engineering, as well as emerging trends that you should be paying attention to as a software engineer, architect, or data scientist. We organize the discussion into a technology adoption curve and provide supporting commentary to help you understand where things are going.

For this year's podcast, the editorial team invited external guest Sherin Thomas, a software engineer at Chime, to join the discussion. The following section of the article summarizes some of these trends and where different technologies fall on the technology adoption curve.

Generative AI

Generative AI, including large language models (LLMs) such as GPT-3, GPT-4, and ChatGPT, has become a major force in the AI and machine learning industry. These technologies have received a lot of attention, especially considering the progress they have made in the past year. We have seen widespread user adoption of these technologies, especially driven by ChatGPT. Several companies such as Google and Meta have announced their own generative AI models.

We expect the next step will be a greater focus on LLMOps to operate these large language models in enterprise environments. We are currently divided on whether hint engineering will be a hot topic in the future, or whether its applications will be so widespread that everyone can contribute to the hints used.

Vector Databases and Embedding Stores

With the rise of LLM technology, vector databases and embedding stores are gaining more and more attention. One interesting application that is gaining more and more attention is using sentence embeddings to enhance the observability of generative AI applications.

The need for vector search databases stems from the limitations of large language models, which have limited lexical history. Vector databases can store document summaries as feature vectors generated by these language models, which may produce millions or even more feature vectors. Using traditional databases, as the dataset grows, it becomes increasingly difficult to find relevant documents. Vector search databases enable efficient similarity searches, allowing users to find the nearest neighbors of a query vector, thereby enhancing the search process.

A notable trend is the surge in funding for these technologies, indicating investors' recognition of their importance. However, the adoption of these technologies by developers has been slow, but is expected to accelerate in the coming years. Vector search databases like Pinecone, Milvus, and open source solutions like Chroma are gaining traction. The choice of database depends on the specific application scenario and the nature of the data being searched.

Vector databases have shown their potential in various fields, including Earth observation. For example, NASA uses self-supervised learning and vector search techniques to analyze Earth satellite images to help scientists track long-term changes in weather phenomena such as hurricanes.

Robotics and drone technology

The cost of robots is falling. Legged balancing robots used to be difficult to buy, but now there are some models that cost about $1,500. This allows more users to use robotics in their applications. The Robot Operating System (ROS) remains the leading software framework in this field, but companies like VIAM are also developing middleware solutions to make it easier to integrate and configure plugins for robot development.

We expect that advances in unsupervised learning and foundational models will translate into more powerful capabilities. For example, integrating large language models into the path planning part of a robot will enable planning using natural language.

Responsible and ethical AI

As AI begins to affect all of humanity, people are increasingly interested in responsible and ethical AI. There have been calls for stricter security measures on large language models, and frustration with the output of such models, reminding users of existing security measures.

Engineers still need to keep in mind the need to improve the lives of all, not just a few. We expect AI regulation to have a similar impact to the General Data Protection Regulation (GDPR) a few years ago.

We have already seen some AI failures due to bad data. Data discovery, manipulation, data lineage, labeling, and good model development practices will become a focus. Data is critical for interpretability.

Data Engineering

The state of modern data engineering is that it is shifting towards a more decentralized and flexible approach to manage the growing volume of data. A new concept, data mesh, has emerged to address the challenges of centralized data management teams becoming bottlenecks for data operations. It advocates a federated data platform across domain partitions that treats data as a product. This enables domain owners to have ownership and control over their data products, reducing reliance on central teams. While promising, data mesh adoption may face barriers related to expertise and requires advanced tools and infrastructure to enable self-service capabilities.

Data observability has become critical in data engineering, similar to system observability in application architecture. Observability is critical at all levels, including data observability, especially in the field of machine learning. Trust in data is critical to the success of AI, and data observability solutions are essential for monitoring data quality, model drift, and exploratory data analysis to ensure reliable machine learning results. This paradigm shift in data management and the integration of observability across data and machine learning pipelines reflects the evolution of the modern data engineering landscape.

Updates to the Explanatory Curve

This trend report also includes an updated chart showing our predictions for the current state of some technologies. The categories are based on the book Crossing the Chasm by Geoffrey Moore. We focus on those categories that have not yet crossed the chasm.

From innovators to early adopters, one notable upgrade is "AI Coding Assistant". Although it was just launched last year and almost no one is using it, we are seeing more and more companies offering it as a service to their employees to make them more efficient. It is not a default part of every technology stack, and we are still exploring how to use them most effectively, but we believe its adoption rate will continue to grow.

We believe that the area that is currently crossing the chasm is Natural Language Processing. This is not surprising, as following the huge success of ChatGPT, many companies are trying to incorporate Generative AI features into their products. Therefore, we decided to let it cross the chasm and enter the Early Majority category. There is still a lot of growth potential in this area, and time will tell us more about the best practices and capabilities of this technology.

There are a few categories to watch that have not seen any movement at all. These technologies include synthetic data generation, brain-computer interfaces, and robotics. All of these technologies seem to have been stuck in the Innovators category. The most promising in this regard is the topic of synthetic data generation, which has received more attention recently with the GenAI hype. We do see more and more companies talking about generating more training data, but have not seen enough applications actually using this data in their stack to move it to the Early Adopters category. Robotics has been in the spotlight for many years, but its adoption rate is still too low for us to guarantee that it will change.

We also introduced a few new categories in the chart. One notable one is Vector Search Databases, which is a byproduct of the GenAI craze. As our understanding of how to represent concepts as vectors continues to improve, the need for efficient storage and retrieval of vectors has increased. We have also added Explainable AI to the Innovators category. We believe that the ability of computers to explain why they made a particular decision is essential for widespread application to combat hallucinations and other dangers. However, we do not currently see enough research results in the industry to promote it to a higher category.

Conclusion

The fields of artificial intelligence, machine learning, and data engineering are booming year after year. Both the technical capabilities and potential applications are still booming. For the editors, it is exciting to be so close to this progress and look forward to continuing to cover it next year. In the podcast, we made some predictions for the year ahead, ranging from "General artificial intelligence will no longer exist" to "Autonomous agents will become a reality."