A Large-Scale RGB-D Dataset for Robots to Recognize Daily Activities of The Elderly

Background

   As part of the solutions to an aging society, research on elder care robots has been actively carried out around the world. In order for robots to understand the elderly and provide context-sensitive services, robotic intelligence technologies that can identify various human attributes is essential. Among them, action recognition is a fundamental technology to understand the intentions of human behavior and grasp the daily life patterns of human users.
  The massive success of the deep learning approach has enabled rapid improvement in many computer vision tasks. Efforts to create large scale datasets to accelerate deep learning studies have been underway in extensive research areas, including human action understanding. However, despite the large number of publicly available datasets, there is a great lack of adequate data for robots to recognize daily activities of human users. Most datasets have no consideration for the robotic environment in which humans and robots live together. Furthermore, there is no large-scale visual dataset at all that deals with the everyday behavior of the elderly. The absence of datasets centered on robots and humans has been a serious impediment to robot intelligence researches, especially for elder care robots.

Introduction

data-samples
[Figure 1] Sample frames from daily actions in the proposed dataset are displayed together with the corresponding depth map and skeleton information obtained from Kinect v2 sensors. Actions (from left to right): eating food with a fork, vacuuming the floor, spreading bedding, washing a towel by hands, hanging out laundry, hand shaking.

  To solve the shortage of datasets, we collect and release the first large-scale RGB-D dataset of daily activity of the elderly for human care robots: ETRI-Activity3D.
   The dataset is collected by Kinect v2 sensors and consists of three synchronized data modalities: RGB videos, depth maps, and skeleton sequences. To shoot visual data, 50 elderly people are recruited. The elderly subjects are in a wide range of ages from 64 to 88, which lead to a realistic intra-class variation of the actions. In addition, we acquire a dataset for 50 young people in their 20s in the same way as older people. Finally, 112,620 sets of 3D data were obtained.
  We hope that the proposed dataset, which comprehensively considers the elderly, the robots and the environment in which they interact, can contribute to the advancement of robot intelligence.

Item Contents
Number of samples 112,620
Number of action classes 55
Number of subjects 100 (50 old people, 50 young people)
Collection environment Residential Environment in Apartment
Data modalities RGB videos, depth map frames, body index frames, 3D skeletal data
Sensor Kinect v2

Sample videos of our dataset can be downloaded from the link below.

Download samples

Unique characteristics and advantages of the proposed dataset over the existing ones are as follows.

1) A new visual dataset based on observations of the daily activities of the elderly
2) A realistic dataset considering the service situation of human care robots
3) A large-scale RGB-D action recognition dataset that overcomes the limitations of previous datasets

Action Classes

  A closer understanding of what older people actually do in their daily lives is important for determining practical action categories. We visit the homes of 53 elderly people over the age of 70 and carefully monitor and document their daily behavior from morning to night. Based on the most frequent behaviors observed, 55 action classes are defined.

ID Action descripction ID Action descripction
1 eating food with a fork 29 hanging out laundry
2 pouring water into a cup 30 looking around for something
3 taking medicine 31 using a remote control
4 drinking water 32 reading a book
5   putting food in the fridge/taking food from the fridge    33 reading a newspaper
6 trimming vegetables 34 handwriting
7 peeling fruit 35 talking on the phone
8 using a gas stove 36 playing with a mobile phone
9 cutting vegetable on the cutting board 37 using a computer
10 brushing teeth 38 smoking
11 washing hands 39 clapping
12 washing face 40 rubbing face with hands
13 wiping face with a towel 41 doing freehand exercise
14 putting on cosmetics 42 doing neck roll exercise
15 putting on lipstick 43 massaging a shoulder oneself
16 brushing hair 44 taking a bow
17 blow drying hair 45 talking to each other
18 putting on a jacket 46 handshaking
19 taking off a jacket 47 hugging each other
20 putting on/taking off shoes 48 fighting each other
21 putting on/taking off glasses 49 waving a hand
22 washing the dishes 50      flapping a hand up and down (beckoning)     
23 vacuumming the floor 51 pointing with a finger
24 scrubbing the floor with a rag 52 opening the door and walking in
25 wipping off the dinning table 53 fallen on the floor
26 rubbing up furniture 54 sitting up/standing up
27 spreading bedding/folding bedding 55 lying down
28 washing a towel by hands    

Collected Data

   The resolution of RGB videos is 1920 × 1080. Depth maps are stored frame by frame in 512 × 424. Skeleton information contains locations of 25 body joints in the 3D space for tracked human bodies.

Collected Data Resolution File Format Size
RGB Videos 1920x1080 MP4 296 GB
Depth Map Frames 512x424 PNG 4.08 TB
Body Index Frames 512x424 PNG 42.60 GB
3D Skeletal Data 25 joints CSV 20.83 GB
    Total 4.44 TB

Setup

   Considering the height of home robots, the shooting device is prepared with two Kinect sensors at heights of 70cm and 120cm as shown in Figure 2. The four shooting devices are grouped together, and eight synchronized sensors in the group capture the subjects’ action at the same time. Instead of placing the devices at fixed horizontal angular intervals, we place them in a position where the robot can appear inside the house. The distance between the sensors and the subject also varies from 1.5 meters to 3.5 meters. For actions that can be done anywhere (e.g., taking medicine and talking on the phone), we shoot them up to five times, changing the places where they might occur. In this way, we can provide further intra-class variation by containing different views and background conditions. All the group and camera numbers are provided as the filename for each video sample.

data_collection_system
[Figure 2] Layout of the rooms and configuration of the data acquisition system

Publications

All documents and papers that report on research that uses the ETRI-Activity3D dataset should cite the following paper:

Jinhyeok Jang, Dohyung Kim, Cheonshu Park, Minsu Jang, Jaeyeon Lee, Jaehong Kim, “ETRI-Activity3D: A Large-Scale RGB-D Dataset for Robots to Recognize Daily Activities of the Elderly”, International Conference on Intelligent Robots and Systems (IROS) 2020, pp.10990-10997

Download

Please follow the link below, and join as a member to get to the download page:

Contact

Please email dhkim008@etri.re.kr if you have any questions or comments.

Acknowledgment

  • The protocol and consent of data collection were approved by the Institutional Review Board(IRB) at Suwon Science College, our joint research institute.
  • This work was supported by the ICT R&D program of MSIP/IITP. [2017-0-00162, Development of Human-care Robot Technology for Aging Society].