MIT Researchers Teach AI System to Predict Acts From Key Video Frames

MIT researchers have developed an add-on module for Artificial Intelligence (AI) systems that can, by analysing a few frames of a video feed, predict how objects will be changed or transformed by human action.

The module is called Temporal Relation Network (TRN) and it gives AI systems the ability to learn how objects can undergo changes at different times in a video.

The researchers at the Massachusetts Institute of Technology aim to build better AI systems that have higher activity recognition and a higher comprehension of what is happening to the world around them.

Artificial Intelligence Laboratory

Former PhD student in the Computer Science and Artificial Intelligence Laboratory at MIT Bolei Zhou, commented in blog post that: “We built an artificial intelligence system to recognise the transformation of objects, rather than appearance of objects.”

“The system doesn’t go through all the frames — it picks up key frames and, using the temporal relation of frames, recognise what’s going on. That improves the efficiency of the system and makes it run in real-time accurately.”

“That’s important for robotics applications, you want [a robot] to anticipate and forecast what will happen early on, when you do a specific action,” Zhou – currently an assistant professor of computer science at the Chinese University of Hong Kong – added.

MIT Researchers

The researchers tested and trained the module on three crowd-sourced datasets of videos which contained footage of various activities being performed.

The first one was made by the company TwentyBN and features 200,000 videos of 174 action categories, examples would be a hand poking and knocking over a stack of cans.

The second, dubbed Jester, contains 150,000 videos showing 27 different hand gestures. While the last dataset called Charades teaches the module what different activities look like, such as playing basketball or carrying a bicycle.

According to MIT, when the TRN is fed a video it: “Simultaneously processes ordered frames in groups of two, three, and four — spaced some time apart.” It then judges whether or not objects transformation in those key frames is the result of a specific activity.

“If it processes two frames, where the later frame shows an object at the bottom of the screen and the earlier shows the object at the top, it will assign a high probability to the activity class, “moving object down,” MIT researchers note.

Further Research

The next steps for the researchers at MIT will be to integrate object recognition with the activity recognition software. Luckily work is already well on its way in training AI to identify objects in video frames.

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Artificial Intelligence Laboratory

MIT Researchers

Further Research

See Also: Machine Learning vs “Eye-Balling”: MIT Research Cuts Chemo Doses by 75%

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing