Q11: Design a distributed application using MapReduce which processes a log file of a system.

# Setup Guide — MapReduce Log Processor --- ## 1. Install Java ```bash sudo apt update sudo apt install openjdk-8-jdk -y java -version ``` --- ## 2. Download & Extract Hadoop ```bash wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz tar -xvzf hadoop-3.3.6.tar.gz mv hadoop-3.3.6 hadoop ``` --- ## 3. Set Environment Variables ```bash nano ~/.bashrc ``` Add at the bottom: ```bash export HADOOP_HOME=~/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ``` ```bash source ~/.bashrc ``` --- ## 4. Configure Hadoop **core-site.xml** — `nano ~/hadoop/etc/hadoop/core-site.xml` ```xml <configuration> <property> <n>fs.defaultFS</n> <value>hdfs://localhost:9000</value> </property> </configuration> ``` **hdfs-site.xml** — `nano ~/hadoop/etc/hadoop/hdfs-site.xml` ```xml <configuration> <property> <n>dfs.replication</n> <value>1</value> </property> </configuration> ``` --- ## 5. Fix JAVA_HOME in Hadoop Config ```bash readlink -f $(which java) # Output example: /usr/lib/jvm/java-8-openjdk-amd64/bin/java ``` ```bash nano ~/hadoop/etc/hadoop/hadoop-env.sh ``` Find and update: ```bash export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ``` --- ## 6. Format Namenode ```bash hdfs namenode -format ``` --- ## 7. Set Up SSH (Passwordless) ```bash sudo apt install openssh-server -y sudo service ssh start ssh-keygen -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys ssh localhost ``` --- ## 8. Start Hadoop ```bash start-dfs.sh start-yarn.sh jps ``` --- ## 9. Install Python ```bash sudo apt install python3 -y sudo apt install python3-pip -y sudo apt install python-is-python3 python --version ``` --- ## 10. Verify Hadoop is Ready ```bash hadoop version hdfs dfs -ls / ```

# Run Guide — MapReduce Log Processor --- ## 1. Create mapper.py ```bash nano mapper.py ``` Paste this code: ```python #!/usr/bin/env python3 import sys for line in sys.stdin: line = line.strip() parts = line.split() if len(parts) >= 3: ip = parts[0] log_type = parts[1] print(f"{log_type}\t1") print(f"{ip}\t1") ``` Save and close: - `Ctrl + X` → Exit - `Y` → Confirm save - `Enter` → Keep filename --- ## 2. Create reducer.py ```bash nano reducer.py ``` Paste this code: ```python #!/usr/bin/env python3 import sys current_key = None current_count = 0 for line in sys.stdin: key, value = line.strip().split("\t") value = int(value) if current_key == key: current_count += value else: if current_key: print(f"{current_key}\t{current_count}") current_key = key current_count = value if current_key: print(f"{current_key}\t{current_count}") ``` Save and close: - `Ctrl + X` → Exit - `Y` → Confirm save - `Enter` → Keep filename --- ## 3. Create log.txt ```bash nano log.txt ``` Paste this data: ``` 192.168.1.1 INFO User login 192.168.1.2 ERROR Disk failure 192.168.1.1 WARNING Low memory 192.168.1.3 INFO File uploaded 192.168.1.2 ERROR Network issue ``` Save and close: - `Ctrl + X` → Exit - `Y` → Confirm save - `Enter` → Keep filename --- ## 4. Give Execute Permission ```bash chmod +x mapper.py reducer.py ``` --- ## 5. Test Locally First ```bash cat log.txt | ./mapper.py | sort | ./reducer.py ``` Expected output: ``` ERROR 2 INFO 2 WARNING 1 192.168.1.1 2 192.168.1.2 2 192.168.1.3 1 ``` --- ## 6. Upload Log File to HDFS ```bash hdfs dfs -mkdir /input hdfs dfs -put log.txt /input hdfs dfs -ls /input ``` --- ## 7. Run MapReduce on Hadoop ```bash hdfs dfs -rm -r /output hadoop jar ~/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \ -input /input/log.txt \ -output /output \ -mapper mapper.py \ -reducer reducer.py ``` > If JAR not found, locate it first: > ```bash > find ~/hadoop -name "*streaming*.jar" > ``` --- ## 8. View Output ```bash hdfs dfs -cat /output/part-00000 ``` Expected output: ``` 192.168.1.1 2 192.168.1.2 2 192.168.1.3 1 ERROR 2 INFO 2 WARNING 1 ``` --- ## Troubleshooting **Output folder already exists:** ```bash hdfs dfs -rm -r /output ``` **log.txt not in HDFS:** ```bash hdfs dfs -mkdir /input hdfs dfs -put log.txt /input ```

Data Science Laboratory (ds) Codes

Group B

Q11: Design a distributed application using MapReduce which processes a log file of a system.

Log Processing using MapReduce

Other Questions in Data Science Laboratory