Solution and implementation for Q11 from Data Science Laboratory (ds).
# Setup Guide — MapReduce Log Processor
---
## 1. Install Java
```bash
sudo apt update
sudo apt install openjdk-8-jdk -y
java -version
```
---
## 2. Download & Extract Hadoop
```bash
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
```
---
## 3. Set Environment Variables
```bash
nano ~/.bashrc
```
Add at the bottom:
```bash
export HADOOP_HOME=~/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```
```bash
source ~/.bashrc
```
---
## 4. Configure Hadoop
**core-site.xml** — `nano ~/hadoop/etc/hadoop/core-site.xml`
```xml
<configuration>
<property>
<n>fs.defaultFS</n>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
```
**hdfs-site.xml** — `nano ~/hadoop/etc/hadoop/hdfs-site.xml`
```xml
<configuration>
<property>
<n>dfs.replication</n>
<value>1</value>
</property>
</configuration>
```
---
## 5. Fix JAVA_HOME in Hadoop Config
```bash
readlink -f $(which java)
# Output example: /usr/lib/jvm/java-8-openjdk-amd64/bin/java
```
```bash
nano ~/hadoop/etc/hadoop/hadoop-env.sh
```
Find and update:
```bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
```
---
## 6. Format Namenode
```bash
hdfs namenode -format
```
---
## 7. Set Up SSH (Passwordless)
```bash
sudo apt install openssh-server -y
sudo service ssh start
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh localhost
```
---
## 8. Start Hadoop
```bash
start-dfs.sh
start-yarn.sh
jps
```
---
## 9. Install Python
```bash
sudo apt install python3 -y
sudo apt install python3-pip -y
sudo apt install python-is-python3
python --version
```
---
## 10. Verify Hadoop is Ready
```bash
hadoop version
hdfs dfs -ls /
```
# Run Guide — MapReduce Log Processor
---
## 1. Create mapper.py
```bash
nano mapper.py
```
Paste this code:
```python
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
parts = line.split()
if len(parts) >= 3:
ip = parts[0]
log_type = parts[1]
print(f"{log_type}\t1")
print(f"{ip}\t1")
```
Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename
---
## 2. Create reducer.py
```bash
nano reducer.py
```
Paste this code:
```python
#!/usr/bin/env python3
import sys
current_key = None
current_count = 0
for line in sys.stdin:
key, value = line.strip().split("\t")
value = int(value)
if current_key == key:
current_count += value
else:
if current_key:
print(f"{current_key}\t{current_count}")
current_key = key
current_count = value
if current_key:
print(f"{current_key}\t{current_count}")
```
Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename
---
## 3. Create log.txt
```bash
nano log.txt
```
Paste this data:
```
192.168.1.1 INFO User login
192.168.1.2 ERROR Disk failure
192.168.1.1 WARNING Low memory
192.168.1.3 INFO File uploaded
192.168.1.2 ERROR Network issue
```
Save and close:
- `Ctrl + X` → Exit
- `Y` → Confirm save
- `Enter` → Keep filename
---
## 4. Give Execute Permission
```bash
chmod +x mapper.py reducer.py
```
---
## 5. Test Locally First
```bash
cat log.txt | ./mapper.py | sort | ./reducer.py
```
Expected output:
```
ERROR 2
INFO 2
WARNING 1
192.168.1.1 2
192.168.1.2 2
192.168.1.3 1
```
---
## 6. Upload Log File to HDFS
```bash
hdfs dfs -mkdir /input
hdfs dfs -put log.txt /input
hdfs dfs -ls /input
```
---
## 7. Run MapReduce on Hadoop
```bash
hdfs dfs -rm -r /output
hadoop jar ~/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
-input /input/log.txt \
-output /output \
-mapper mapper.py \
-reducer reducer.py
```
> If JAR not found, locate it first:
> ```bash
> find ~/hadoop -name "*streaming*.jar"
> ```
---
## 8. View Output
```bash
hdfs dfs -cat /output/part-00000
```
Expected output:
```
192.168.1.1 2
192.168.1.2 2
192.168.1.3 1
ERROR 2
INFO 2
WARNING 1
```
---
## Troubleshooting
**Output folder already exists:**
```bash
hdfs dfs -rm -r /output
```
**log.txt not in HDFS:**
```bash
hdfs dfs -mkdir /input
hdfs dfs -put log.txt /input
```