Hand hygiene is essential for food safety and food handlers. Maintaining proper hand hygiene can improve food safety and promote public welfare. However, traditional methods of evaluating hygiene during food handling process, such as visual auditing by human experts, can be costly and inefficient compared to a computer vision system. Because of the varying conditions and locations of real-world food processing sites, computer vision systems for recognizing handwashing actions can be susceptible to changes in lighting and environments. Therefore, we design a robust and generalizable video system that is based on ResNet50 that includes a hand extraction method and a 2-stream network for classifying handwashing actions. More specifically, our hand extraction method eliminates the background and helps the classifier focus on hand regions under changing lighting conditions and environments. Our results demonstrate our system with the hand extraction method can improve action recognition accuracy and be more generalizable when evaluated on completely unseen data by achieving over 20% improvement on the overall classification accuracy.